Bloom Filters, use cases and implementation
Use cases and Redis based implementation of Bloom Filter service
What is bloom filter?
A Bloom filter is a probabilistic data structure that is used to quickly and efficiently test whether an element is a member of a set. It is named after Burton Howard Bloom(1970). It’s intended to save space and better performance.
Because a Bloom filter is probabilistic, the possibility of false positives is inbuilt. i.e. there may be elements that are not in the set that are incorrectly identified as members of the set. However, there are no false negatives. That is, if an element is in the set, it will always be identified as such. If you have use case where no one is going to die under water if the data does not exist. The use cases where quickly and efficiently test whether an element is a member of a set, but where it is not necessary to have perfect accuracy. We will discuss about these cases and implementations where fast lookups can make are break applications.
The idea of a ‘Bloom filter’ is to use a fixed-size bit array and a set of hash functions. When an element is inserted, the hash functions are applied to the element to generate a set of indices in the bit array. Each index is then set to 1. To test whether an element is a member of the set, the hash functions are again applied to the element to generate a set of indices in the bit array. If all of the bits at those indices are set to 1, then the element is probably a member of the set. If any of the bits are not set to 1, then the element is definitely not a member of the set.
What are Use cases?
Although the main motivation is faster lookups with minimal storage requirements ,we need to understand about the limitation and cost of it. We can definitely say YES but not confidently NO. Its like JPEG may not give pixel perfect stuff but it does give the picture.
Maintaining Invalid set of stateless tokens/api-keys that are time sensitive
In server side application, JWT offers best in class stateless, scalable option for building cloud native applications for passing on permissions and security server need not store its completely passthrough. One of the downside there is no way to tie with user’s logout (or any other actions like removing users etc…), the token will be valid by timeout specified in JWT, Usually this limitation forces us to keep timeout very low to minimise the impact and also deal with refresh tokens. We can overcome this by using maintaining all the logged-out tokens and user de-activated/removed tokens in a bloom filter providing us ability to deal with such values and keep the timeout matching with session timeout from UI OR as requested by API user. It’s cheap to store and we can reject these requests.
De-duplication of Ids for Kafka streams and highly index sensitive stores like elastic search where inserts, mutating/updating incurs high cost.
In kafka and storing data in elastic search, most of the time data comes in the form of CSV, flat files, data base changes through Change Detection Capture (CDC) techniques which are prone to duplicate data ingestion. Generally the cost is high mutable operations in index sensitive data store. Creating hash and storing them in bloom filter can save all those operations.
Managing cohorts
In AdTech the most important aspect successful ad campaign is building effective cohorts (a group of people with a shared characteristic) and targeting them efficiently. Time, Gender, GeoLocation, Age etc… aren’t sufficient anymore. For that we need to have high write throughput and very quick lookup for incoming request belongs to which cohort. There can be lot many use cases like routing of api servers to premium and non premium servers etc.
Design and Implementation
The intention here to design and implement the this feature more easily accessible to large number of developers to make use directly. Hence this is developed with spring boot (most popular) and Kotlin (Best JVM language). Other than the use cases that I described above developers looking for spell checking, network packet filtering, and DNA sequence analysis should also find this useful
Redis is a popular in-memory distributed data store that is frequently used as a cache or for fast access to frequently used data. One of the features that Redis provides is the ability to create and manage bloom filters. Redis provides built-in support for bloom filters. Redis has a command called BF.ADD, which adds an element to a bloom filter, and a command called BF.EXISTS, which tests whether an element is a member of a bloom filter. These commands are very fast and efficient, and they can be used to implement a bloom filter with very little code. Redis provides many features that make it easy to manage and monitor bloom filters and BF.DEBUG can be used to diagnose issues with bloom filters.
Lua script helps us to execute the operations in server side in Redis akin to stored procedure, when we are checking 100s of cohorts where we can get list with single remote command.
For simple single JVM application hat don’t require high scalability and performance, Google’s Guava’s bloom filter implementation is designed to be easy to use and customisable. It provides a simple API for adding elements to a bloom filter and checking if an element is likely to be a member of the filter. Guava’s bloom filter also supports configurable parameters such as the expected number of elements and the desired false positive rate.
A separate service is written offering both ingestion and search capabilities of the use-cases that has been listed above. SpringBoot/Kotlin/Redis and UI with Thymeleaf/Htmx stack will include both the implementations.
<TODO>
References
https://www.youtube.com/watch?v=Z9_wrhdbSC4
https://redis.com/redis-best-practices/bloom-filter-pattern/
https://medium.com/lydtech-consulting/kafka-deduplication-patterns-1-of-2-ef0371a3331b