Bloom Filters is one of those data structures that you don’t generally learn about in a typical data structures 101 class, but wish you had learnt once you know about them. Despite reading several articles on bloom filters I was still finding it hard to grasp the concepts until the last week when I decided to sit down and not get up until I get the hang of it. Below is an article where I attempt to explain what I understood in a clear way (hopefully) so that others can learn.
Before we talk about what a Bloom Filter is and how it actually works let me make an attempt to sell you on why you would need a bloom filter. Let’s assume you have an idea to build your own search engine. You are fed up of Google’s monopoly and decide to write your own version of a search engine (much like DuckDuckGo). Now the first step in building a search engine is to build a crawler. How does a crawler work? Quite simply the work of crawler can be reduced down to the following steps (without the implementation details, of course)
As you can see, a crawler’s work in theory is quite simple - just keep scraping content for any previously unseen URLs (webpages). The step we are going to focus on is step 3 in the above algorithm. If you’ve taken your data structures class you know that to maintain a list of all URLs to check for membership is a bad idea (
O(n) lookup time). So what you instead do is a use a set (or a hash-table) in memory that allows you do quick lookups and test for membership of a URL. Now this works fine as long as your hash-table can reside in the memory. Consider the case for Google for example - clearly there is no way a hash table for a billion plus URLs can reside in main memory. You can surely use the disk for storing and querying but since that is significantly slower compared to accessing the main memory we are not going to consider that case for now.
How do you tackle the above situation? Is there a data structure that can be stored in main memory and still hold vast amount of data? This is where bloom filters come in. Bloom filters use much lesser space and constant time to answer the queries for set membership. More precisely
A Bloom filter is a data structure that is used to check for membership of an element
xin a set of
Bloom filters have a strong space advantage over other data structures like sets, hash tables or binary search trees. Bloom filters also have the property that the time taken to add an item or to check for membership is a constant
O(k) and is independent of the number of items in the filter.
What’s the catch you might ask? Well, the catch is that bloom filters trade exactness for this efficiency meaning that there are false-positives - i.e. elements that are not a part of set but are claimed to be part of the set. However, as we shall soon see, the rate of false positives depends upon the application and can be lowered at the expense of amount of memory required. Like everything else in computer science, there is a trade-off and in this case, between exactness and amount of memory.
At the heart of every bloom filter lies two key elements
nbits, initially all set to
kindependent hash functions
h(x). Each hash function takes a value
vand generates a number
i < nwhich effectively maps to a position in the bit array.
The underlying idea of a bloom filter is quite simple and can be explained in the following steps -
nbits with zeros. Generally
nis chosen to be much greater than the number of elements in the set.
h(x)on the element. With the value generated, which is an index in the bit array, set the bit to 1 in the array. For example, if there are
khash functions there will be
kindices generated. For each of these
kpositions in the bit array set
array[i] = 1
kvalues by applying the
khash-functions on the input. If at least one of these
kindices in the bit array is set to zero then the element is a new element else this is an existing element in the set.
Lets just jump to an example1. We start by defining the hash functions. To keep the example simple, we’ll use two hash functions
g(x). This is how these work for an integer
xto its binary equivalent. Lets call it
band generate a new number
ywith these bits.
y modulo 11.
g(x) is same as
h(x), the only difference being that instead of taking the odd numbered bits we take the even numbered bits.
Quick example: Let
x = 25. Then
b = 11001. Taking every odd bit gives us a binary number
101 which is binary for
5. Similarly taking every even bit gives us
10 which is binary for
h(x) = 5 % 11 = 5 and
g(x) = 2 % 11 = 2.
Here’s a simple implementation in Python for the hash functions defined above.
def h(x): return int("".join([v for i, v in enumerate(bin(x)[2:]) if i % 2 == 0]), 2) % 11 def g(x): return int("".join([v for i, v in enumerate(bin(x)[2:]) if i % 2 == 1]), 2) % 11
With that out of the way lets proceed with our filter. Assume our bit array has
11 bits, all of which are initially set to zero -
Let the first number that our bloom filter sees be
25. As shown above, the value of
h(25) = 5 and
g(25) = 2. Hence the next step is to set the
5th and the
2nd position in our bit array to
1. Doing that our bit array now transforms to
00100100000 (big endian mode).
Extending the same logic for different values of
x we get the table shown below.
This can be verified by using the functions defined above -
map(lambda x: (h(x), g(x)), (25, 159, 585)) # [(5, 2), (0, 7), (7, 9)]
As its quite clear, the bit array now becomes
10100101010. How about lookups? Lets say that the new value that comes to our filter is
118. We compute
g(118) and find these values to be
3 respectively. The next thing we do is the check in value of the bit array at these indices and find that
array = 1 array = 0
Since one of the bit is set to 0 our bloom filter reports that this the number
118 is not an existing member in our set.
Getting back to the caveat that was pointed out earlier - let me use an example to demonstrate how the bloom filter is susceptible to false positives.
Assume the next value that comes to our filter is
162. Calculating the hash functions we find that
0. Looking for the values in our bit array we find that both of these indices are set to
1. Hence, in this case the bloom filter will wrongly report that the value
162 exists in our set.
If after reading the above you are thinking to yourself that we just need to reduce the number of collisions to reduce the rate of false positives then you are right. A simple improvement in the above example is to use more hash functions and have a large bit array. If instead of 11 bits we had 100 bits and instead of just 2 hash functions
g(x) we had a few more, the probability that a non-existing value to hash to all the set bits would have reduced, thereby, reducing the rate of false positives.
Let’s do some math to corroborate our intuition above. If you feel the above statement is intuitive enough feel free to skip this section.
From the example in the previous section it will be clear that a false positive arrives whenever all the positions output by the
k hash functions. Hence we can say that, the probability of a false positive depends on the density of
1s in the array and the number of hash functions. More the number of
1s the higher the probability of a false positive. Likewise, fewer the hash functions, higher the probability of a collision.
Also, roughly we can say the following about the number of ones -
number of 1s = (no. of elements x no. of hash functions) - collisions.
From the above example, we have
3 input elements and
2 hash functions. Hence there should be
6 bits set to 1. Since we have one collision
g(117) == h(585) therefore the total number of bits set in our bit array is
Lastly, considering an analogy to darts and targets, we can say that for
d darts and
t targets, we can prove that the probability that no darts hit the target is
e is the base of natural logarithm2. In our case, the targets are the bits in the array and the darts are the outputs of the hash-functions.
Okay, now we are ready for the example. Assume we have a bit array of 1 billion bits, 5 hash functions and 100 million elements. That is
t = 10^9 (bits in the array are the targets) d = 5 x 10^8 (outputs from the hash functions are the darts) P(no dart hits the target) = e^(-d/t) = e^(-0.5) = 0.607
Hence the density of
0s in the bit array =
0.607 or density of
1s in the bit array =
The probability of a false positive = Probablity of all 5 hash functions to a index that has 1 in the bit array =
0.393^5 = 0.00937.
As seen above, the P(false positive) is indeed less. By tweaking the number of hash functions and the number of bits in the array this can be improved even further.
The space advantage in a Bloom Filter comes from the compactness from the data structure. A Bloom filter with 1% error and an optimal value of k, requires only about 9.6 bits (1.2 bytes) per element — regardless of the size of the elements3. Comparing this value to storing a typical integer in hash table which would take 32 bits or (4 bytes).
Hopefully, if you’ve come this long you are probably sold on giving the data structure a shot. Armon Dadgar’s bloomd is excellent implementation inspired by Memcached. There are also a good number of client libraries available in popular languages. If you’re using Redis you can use the
GETBIT functions to have a redis backed bloom filter. The documentation has some helpful examples to help you.
I hope this article was useful in whetting your appetite about Bloom Filters. Probabilisitic data structures are really cool and in my opinion, bloom filters are a perfect introduction to the topic.
Feel free to leave any comments or suggestions in the comments below.