Avoid unnecessary database queries by using bloom filters

Bloom filters is a technique to determine if an element is possibly contained in a set or it is not contained at all. What can this mechanism be useful for? Let's see an example.

We have an API to create users with an email. We can make a POST request to /users endpoint and we will get a 201 CREATED or 409 CONFLICT, when email is being used by another user.

First obvious approach could be query the database to check if email exists:

fun createIfNotExists(email: String) {
	if (existsInDatabase(email)) {
        throw RuntimeException("Email is already in use!")
    } 
    createUserInDatabase(email) // of course, this goes to database as well
}

However, this is not the best solution in terms of performance; we are hitting the database to check if email is present in every single request. We can minimize database hits using bloom filters. Using this mechanism, when a new email is received, we can indeed be sure that it is not present in database, without making a single query. But, how is it done?

val arr = BooleanArray(128) // 1

fun existsEmail(email: String): Pair<Boolean, Int> {
    val idx = Math.abs(email.hashCode()) % 128 // 2
    val exists = if (!arr[idx]) false else existsEmailInDatabase(email) // 3
    return Pair(exists, idx)
}
  1. We create a boolean array with 128 items initialized to false. The larger is the array, the lesser is the amount of useless database queries will be made. (bit array is the optimum implementation, because of memory usage.)
  2. Email is mapped to an array index.
  3. If array stores false value in that index, we can be absolutely sure that email is not being used by another user. Otherwise, when it stores true, we cannot be sure; it might be used or not, That's the reason we need to query the database in this case.

Of course, when we insert a new email, we need to update the array position value to true. This would be the final code:

val arr = BooleanArray(128)

fun createIfNotExists(email: String) {
    val (exists, idx) = existsEmail(email)
	if (exists) {
        throw RuntimeException("Email is already in use!")
    } 
    createUser(email, idx)
}

fun existsEmail(email: String): Pair<Boolean, Int> {
    val idx = Math.abs(email.hashCode()) % 128
    val exists = if (!arr[idx]) false else existsEmailInDatabase(email) 
    return Pair(exists, idx)
}

fun existsEmailInDatabase(email: String): Boolean {
    // select from users...
}

fun createUser(email: String, idx: Int) {
    arr[idx] = true
 	insertUserInDatabase(email)   
}

fun insertUserInDatabase(email: String) {
    // insert into users...
}

Note that when bloom filter array stores true in every position, no optimization is taking place since database is being reached in every request. This will eventually happen because email hashes will be always mapped to an array index. As said before, the more elements the array is storing, the longer optimization will take place, but more memory will be needed.

Analyze your use case and determine if this technique can be useful for it.

I hope you enjoyed the post 🙃