# Bloom Filters in Action

A while back, I was working on a task that required me to generate a random unique string and use it as an identifier in a database column and I thought to myself,* “It’s a trivial task, will be done in 2 minutes!”*.

Rakishly, I began to formulate the code.

String id = createRandomString(6) //length 6while(dbRepository.hasId(id)){

id = createRandomString(6)

}

*“This code will **work**. But at what **cost**?”*

Time complexity is **O(N *M* k)**, where N is the size of the db and k is the length of the random string to be generated and M is the number of collisions in the DB .

Let’s optimise it.

We can create an index on Id and if N >> k then we can safely ignore k. This drops the complexity to **O(MlogN).**

*“That looks pretty good”*, I proclaimed. But then I started to calculate what would be the worst case time taken for a typical relational database.

Number of rows (N) - 10^8Search time = log(10^8) = 27opsOn average -If the random function generates duplicates 0.001% times

then we will iterate 10^-5 * 10^8 = 1000 timesThe average case of1000 db callsisunacceptablefor an API which will tend to worsen when the table size increases.

But what can we do to remedy this?

- Write better random function or increase the length of the String. That might work from small amount of entries in database but as that keeps on increasing then we will be in the same predicament again.
- Use Bloom Filters.

# What is a Bloom Filter?

A simple filter that can answer* “Do I have this string already in the storage or not?”* in O(1) time.

Let’s first talk about Hashmap.

A hashmap is a simple array with each index calculated by the hash of the String/Object being inserted against this index. This makes retrieval O(1).

Let’s use the **same principle** but instead of saving the object against the index we just mark that value as true or 1.

A bloom filter can be implemented with a binary array. Similar to a Hashmap, we have hash function(s) that generate an integer value which corresponds to an index in the binary array.

## Algorithm

INSERT(S)1. String s ishashedby K hash functions generating K different indices.2. The values against these indices are marked as 1.EXISTS(S)1. String s ishashedby K hash functions generating K different indices.2. The values against these indices are checked, if all of them are 1 then there is astrong possibilityof having this data in storage. If any of them is zero then wedefinitelydon't have this value persisted.

Bloom Filter can **definitely answer** if that string is **not present** in the DB but it **can’t** **definitely **say that string is in the DB.

## Confused?

Let’s take an example —

Insert **“Maple”.**

Here K(number of hash functions) = 3

so, we get 3 indices — 0, 3, 8

Insert **“Canada”**

hashed indices — 4, 6, 8

Check if **“India” **exists —

hashed indices — 5, 8, 9

Here 9 and 5 are zero so we return false.

Check if **“Apple” **exists —

hashed indices — 6, 4, 3

All of them **are 1** so we have to return **true** here. But **Apple** was never inserted in the filter!

This is called a ** false positive** and is a downside of this data structure. We can mitigate it to a certain extent by increasing the array size.

## Important Parameters

There are 2 important parameters of a bloom filter.

- Array Size
- Number of Hash Functions

**Let’s understand Bloom filter and importance of these parameters by code.**

We are using BitSet to represent the binary array.

`falsePosProbability - How much false positives can we tolerate`

expectedElements - How many elements are going to be in the DB.

We can calculate ** expectedElements **by using the formulas (from

*GeeksforGeeks*) —

**Probability of False positivies:** Let **m **be the size of bit array, k be the number of hash functions and **n** be the number of expected elements to be inserted in the filter, then the probability of false positive **p** can be calculated as:

**Size of Bit Array:** If expected number of elements **n** is known and desired false positive probability is **p** then the size of bit array **m** can be calculated as :

**Optimum number of hash functions:** The number of hash functions **k** must be a positive integer. If **m** is size of bit array and **n** is number of elements to be inserted, then k can be calculated as :

Let’s create a **BasicBloomFilter** add some hash functions :

Here I have used 2 Hash Functions

- MurmurHash3
- Java Hashcode

## Let’s Test this class —

We will add a lot of words in the filter and check against words that aren’t in the filter to see how much false positives we are getting.

**Result —**

TestCase 1 -::

BloomFilter

size :: 64

falsePosProbability :: -1.0

expectedElements :: -1

optimumHashFunctions :: -1present :: 5 out of 5TestCase 2 -::

BloomFilter

size :: 26048

falsePosProbability :: 0.01

expectedElements :: 3714

optimumHashFunctions :: 7present :: 0 out of 5

As we can see if the size of the **BitSet is small** as in TestCase 1, every bit is set and will cause all strings to be reported as **false positives**.

When we calculated the optimal size when falsePosProbability set to 0.01 it drops to 0. However now the size is **26,048 for 3714 elements.**

As this data needs to be in memory to be quickly accessed we need to strike a balance between **falsePosProbability and size.**

Thanks for reading!

*Read more to learn how to implement Bloom Filter via Redis **here**.*