Image for post
Image for post
Source : Cloudflare

A while back, I was working on a task that required me to generate a random unique string and use it as an identifier in a database column and I thought to myself, “It’s a trivial task, will be done in 2 minutes!”.

Rakishly, I began to formulate the code.

while(dbRepository.hasId(id)){
id = createRandomString(6)
}

“This code will work. But at what cost?”

Time complexity is O(N *M* k), where N is the size of the db and k is the length of the random string to be generated and M is the number of collisions in the DB .

Let’s optimise it.

We can create an index on Id and if N >> k then we can safely ignore k. This drops the complexity to O(MlogN).

“That looks pretty good”, I proclaimed. But then I started to calculate what would be the worst case time taken for a typical relational database.

Search time = log(10^8) = 27opsOn average -If the random function generates duplicates 0.001% times
then we will iterate 10^-5 * 10^8 = 1000 times
The average case of 1000 db calls is unacceptable for an API which will tend to worsen when the table size increases.

But what can we do to remedy this?

  1. Write better random function or increase the length of the String. That might work from small amount of entries in database but as that keeps on increasing then we will be in the same predicament again.
  2. Use Bloom Filters.

What is a Bloom Filter?

A simple filter that can answer “Do I have this string already in the storage or not?” in O(1) time.

Let’s first talk about Hashmap.

A hashmap is a simple array with each index calculated by the hash of the String/Object being inserted against this index. This makes retrieval O(1).

Let’s use the same principle but instead of saving the object against the index we just mark that value as true or 1.

Image for post
Image for post
A Bloom filter (from Wikipedia)

A bloom filter can be implemented with a binary array. Similar to a Hashmap, we have hash function(s) that generate an integer value which corresponds to an index in the binary array.

Algorithm

1. String s is hashed by K hash functions generating K different indices.2. The values against these indices are marked as 1.
EXISTS(S)
1. String s is hashed by K hash functions generating K different indices.2. The values against these indices are checked, if all of them are 1 then there is a strong possibility of having this data in storage. If any of them is zero then we definitely don't have this value persisted.

Bloom Filter can definitely answer if that string is not present in the DB but it can’t definitely say that string is in the DB.

Confused?

Let’s take an example —

Image for post
Image for post
Empty filter

Insert “Maple”.

Here K(number of hash functions) = 3

so, we get 3 indices — 0, 3, 8

Image for post
Image for post
0,3,8 marked as 1

Insert “Canada”

hashed indices — 4, 6, 8

Image for post
Image for post
4, 6, 8 marked as 1

Check if “India” exists —

hashed indices — 5, 8, 9

Here 9 and 5 are zero so we return false.

Check if “Apple” exists —

hashed indices — 6, 4, 3

All of them are 1 so we have to return true here. But Apple was never inserted in the filter!

This is called a false positive and is a downside of this data structure. We can mitigate it to a certain extent by increasing the array size.

Important Parameters

There are 2 important parameters of a bloom filter.

  1. Array Size
  2. Number of Hash Functions

Let’s understand Bloom filter and importance of these parameters by code.

A Bloom Filter Interface.
Abstract Implementation of the Bloom Filter

We are using BitSet to represent the binary array.

We can calculate expectedElements by using the formulas (from GeeksforGeeks) —

Probability of False positivies: Let m be the size of bit array, k be the number of hash functions and n be the number of expected elements to be inserted in the filter, then the probability of false positive p can be calculated as:

Image for post
Image for post

Size of Bit Array: If expected number of elements n is known and desired false positive probability is p then the size of bit array m can be calculated as :

Image for post
Image for post

Optimum number of hash functions: The number of hash functions k must be a positive integer. If m is size of bit array and n is number of elements to be inserted, then k can be calculated as :

Image for post
Image for post

Let’s create a BasicBloomFilter add some hash functions :

Here I have used 2 Hash Functions

  1. MurmurHash3
  2. Java Hashcode

Let’s Test this class —

We will add a lot of words in the filter and check against words that aren’t in the filter to see how much false positives we are getting.

Result —

present :: 5 out of 5TestCase 2 - 
BloomFilter
::
size :: 26048
falsePosProbability :: 0.01
expectedElements :: 3714
optimumHashFunctions :: 7
present :: 0 out of 5

As we can see if the size of the BitSet is small as in TestCase 1, every bit is set and will cause all strings to be reported as false positives.

When we calculated the optimal size when falsePosProbability set to 0.01 it drops to 0. However now the size is 26,048 for 3714 elements.

As this data needs to be in memory to be quickly accessed we need to strike a balance between falsePosProbability and size.

Thanks for reading!

Read more to learn how to implement Bloom Filter via Redis here.

Written by

I like building stuff.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store