Membership Queries with Big Data
Doing membership tests is one of the common use-cases required by many applications.
- In web crawling we want to know whether an URL has been crawled or not.
- In recommendations systems we want to know whether an item has already been recommended to a user or not.
- To prevent DDOS attacks and IP Spoofing we want to know whether an IP address has been blacklisted or not.
- In ad-words, we want to know whether there are any ads corresponding to the words and n-grams present in a query.
- In search we want to know whether a query term is indexed before looking up an inverted index.
- To know whether a user password during sign-up is a weak password or not or a password has been compromised.
- In spell-checking we want to know whether a word is in the vocabulary or not.
…and many more
Approach 1 : Databases
In the above applications, we can always leverage the database but doing so comes at a cost of disk I/O and putting unnecessary load on the DB for something that may not exist at all in the DB.
For e.g. in web crawling we can store the crawled URLs in a NoSQL database like MongoDB or Cassandra and every-time we want to crawl an URL, we can first lookup whether the URL exists in the DB or not. If it exists we do not crawl else crawl and add the URL to DB. This can have implications if there are billions of URLs in the DB.
Pros:
- Data is read and write to and from a single source of truth.
- Data is persistent. Even if system crashes we will still have the data about which urls has been crawled.
Cons
- Membership queries are slow as it involves both network (assuming DB is hosted on a remote cloud server) and disk I/O.
- May involve unnecessary DB lookups. For e.g. while crawling for URLs for a new domain, most URLs would not have been seen before.
Approach 2: Sets and Hash Tables
The entire data is stored in in-memory data structures such as a Set or a Hash-Table and is held in the RAM.
Note that an in-memory Set is an auxiliary data structure since we anyways need to persist the data in disk else if the system crashes the data would be lost from the RAM.
Let h(x) be an hash-function of size ‘m’, i.e. it maps input data into the range [0, m-1]. Let A be an array of size ‘m’. Given an URL x, perform h(x). Set A[h(x)]=x. To check if an URL is present, check A[h(x)].
There are few downsides to this:
- RAM is limited. The size of data would be in multiple GBs or Terabytes.
- Sets and Hash-Tables works on the principle of hashing. To enable fast lookups i.e. O(1) lookup in the worst case, the size of the hash table must be >> the size of data and the hash function should produce uniformly distributed keys.
Assuming 32 bit hash keys, the size of the membership hash table (assuming an array of size 2³²) would be approximately 16GB.
Collision in hash table occurs when for 2 inputs x and y, h(x)=h(y) where x != y and h is the hash function.
Since the data is always increasing but the size of the hash-table is limited by the RAM size, there will be lots of collisions when either the size of data ~ size of the hash table or the hash function is a bad one and produces skewed distribution of keys.
For e.g. consider the hash function h(x) = (3x+5) % 8
This hash function produces keys in the range [0, 7]. So if the number of inputs to hash is > 8, then by the pigeonhole principle, at-least one key will definitely contain multiple values.
For e.g. for x=1 and y=9, then h(x)=h(y)=0.
Even if the space of keys is much higher say h(x) = (3x+5) % (2³⁰), then the keys are in the range [0, 2³⁰-1]. By the birthday paradox, if we have hashed 38582 inputs then there is 50% chance that at-least 2 of them have the same hash index.
One can do dynamic resizing when the load factor i.e. ratio of number of inputs to the number of hash keys reaches above a certain threshold. For e.g. if the current h(x) = (3x+5) % m and load factor is 0.75, then create a new hash h(x) = (3x+5) % (2*m) and re-index all the data.
- We cannot use a bit vector instead of an array because then we cannot store collisions.
To resolve collisions there are certain techniques such as using Linked List, Linear Probing, Cuckoo Hashing etc. All of these techniques have an average run time performance of O(1).
- Dynamic resizing is used to reduce number of collisions whereas Linear Probing, Cuckoo Hashing etc are used for CRUD operations with collisions.
A simple python code to implement a Set using hash table with Linear Probing
To insert 1 million strings into this set it takes around 6 seconds. Searching a string on an average takes 0.0015 ms.
Depending on the available RAM on the system, we can start with a smaller hash table i.e. smaller values of ‘m’ but time taken to insert may take longer as it will require multiple re-indexing.
For e.g. starting with hash table of size of 2¹⁰ it took 9 seconds to insert the data but took 6 seconds with hash table size of 2²⁰. The final size of the hash table was 2²¹ i.e. it took 11 re-indexing steps with the smaller hash table as compared to just 1 re-indexing with the later.
We can test how good the hash function is in terms of distributing the input uniformly across the range of keys. We can use Kolmogorov-Smirnov Test to test how close ‘f’ is to a sampling from an uniform distribution.
The p-value is 0.90 implying that this is very close to a uniform distribution.
Pros
- Faster reads and writes as compared to databases since data is entirely stored in-memory. Although for persistence we need to periodically flush the data from in-memory into disk
- Main database is not touched often and thus it can be utilised for more complex queries.
- Linear probing for collisions are much faster as compared to Linked List or Cuckoo Hashing because records are stored and searched sequentially starting from a pointer in memory. Sequential access of memory addresses are faster as compared to random access as in Linked List where the next pointer can point to any memory address in the RAM.
Cons
- Size of hash-table scales linearly with data
- Hash table size in GB is limited by available RAM
- Number of collisions will increase with data and hence time to read whether a record is present or not.
- Time complexity is O(1) in the average case not worst case since we need to linearly probe the record.
- On distributed systems, each instance will have its own hash table in the RAM and no way of coordinating. Thus same URL will be crawled by processes running in separate instances. We can use distributed in-memory databases such as Redis to implement the Hash Table but that will slow down the reads and writes as it will require network I/O to communicate and coordinate between instances as well as waiting for write locks to be released by concurrent processes.
Approach 3: Bloom Filters
Bloom Filters are probabilistic data structures implemented using hash tables but with a ‘twist’.
These are called probabilistic data structures because Bloom filters can return false positives. For e.g. if a URL is never crawled, bloom filters can say that it has been crawled.
But the good part is that we can control this error rate by selecting the optimum size of the hash table and the number of hash functions.
In bloom filters, we use a hash-table of size ‘m’ and ‘k’ hash-functions instead of a single hash function. The ‘k’ hash functions are chosen to be independent.
One good part is that in bloom filters, we work with bit vectors instead of integer arrays which reduces the data size by the factor of the number of bits used to represent the integer in the data. For e.g. with a bit vector of size m=2³² we need 0.5GB of memory as compared to 16GB while storing the data as 32 bit integer array.
Let h1(x), h2(x) … hk(x) be the ‘k’ hash functions. Then for a given input ‘x’ and bit vector A set A[h1(x)]=1, A[h2(x)]=1 … and so on.
To search for a given input ‘x’, check if the bits at positions h1(x), h2(x) … hk(x) are set or not. If any one of the bits is not set, then ‘x’ is definitely not there but if all the bits are set then ‘x’ is ‘probably’ there.
It is ‘probably’ there because it could be possible that a combination of other inputs have set all these bits.
To reduce the probability of false positives, we can increase the size ‘m’ of the bit vector and/or increase the number ‘k’ of hash functions. But too many hash functions will fill up the bit vector fast and as a result the number of false positives will increase. So we have to balance ‘k’.
Let ’n’ be the estimated number of points that will be stored in the bloom filter. For our convenience we assume it is 10⁶.
The optimum values of ‘m’ and ‘k’ to achieve an error rate of ≤ p:
m = -n*log(p)/(log(2))²
k = (m/n)*log(2)
To insert 10⁶ points into the bloom filter with a false positive error rate of p=0.005, it took around 5–6 seconds with size of bit vector m=11027754 and k=8.
To validate that indeed the false positive error rate is bounded by 0.005, we can insert some random points not from the inserted data and test whether it is present or not.
Searching within the bloom filter takes 0.005 ms on average.
Pros
- For the same amount of data say 1 million strings, standard hash table would require at-least an array of size approximately 2²¹ to maintain a load factor of at-most 0.75, which is approximately 8 MB in RAM (assuming 32 bit integers) whereas bloom filter would require approximately 1.3 MB to maintain a false positive rate of 0.5%.
- We can reduce the RAM requirement further in bloom filters by increasing the false positive rate that we can live by. For e.g. with 5% FPR, we need approximately 0.75MB RAM to store 1 million strings.
- There are many applications which do not care much about the false positive rate. For e.g. in recommendation systems, it does not matter much if some user is not shown 0.5% of the items which he/she has not seen before.
- In many applications bloom filters are used in conjunction with databases. The idea is that before looking up the database index to check if a value exists, we can lookup a local bloom filter first. Bloom filter never returns false negatives thus if a value is not in the database, bloom filter will say it is not there and we do not have to go to the DB unnecessarily. But if it says the value is there then we have to lookup the DB index.
Cons
- Time complexity of lookup is O(k) where k is the number of hash functions. Thus it may take longer than lookups with hash table.
- For each lookup query, we have to search at k index positions which may not be sequential in RAM, thus is sub-optimal.
- Deletion of values not possible with the above Bloom filter implementations and we need something known as Counting Bloom Filters.
- In distributed systems, the benefits will be overshadowed by network I/O to maintain a shared bloom filter and also handling concurrency with locks.
Approach 4: Quotient Filters
One of the drawbacks of Bloom Filters for membership query is that if the data becomes too large and we need to store part of the bloom filter in the filesystem on disk, then the ‘k’ hash functions needs to access random locations in the file. Since random reads from disk are slower as compared to sequential reads thus Bloom filters will not perform up-to the mark when stored on disk.
To overcome this, there is another probabilistic data structure known as the Quotient Filter, which solves this problem as it stores all values corresponding to a particular key sequentially and in sorted order.
Will not go into the details of Quotient filter in this post and will refer to this awesome blog post to help understand how insert and search works with Quotient filters.
Here is a Python implementation for the same:
Also observe that QF uses only a single hash function as compared to ‘k’ hash function with BF. Although the worst case time complexity of insert and lookup in BF is O(k), with QF, the amortized run time complexity is O(1) but sometimes it may take more time as compared to BF since we need to sequentially search for the position to insert at.
With QF, insertion of 1 million strings takes around 5 seconds and searching takes on an average 0.006ms. The false positive error rate was around 0.2% and memory requirement was around 2.75MB as compared to 1.3MB in BF, since we are also storing the hash values.
For lookups with Quotient Filter, it took on average 4 sequential index scans with median of only 3 index scans. Thus we see that Quotient Filters are efficient as compared to Bloom Filters where ‘k’ was somewhere around 8.
Pros:
- Sequential reads unlike random reads in bloom filters. Beneficial when data is stored on disk.
- Only one hash function as compared to ‘k’ hash functions in BF.
- Quotient filters support deletion of values unlike Bloom Filters.
Cons:
- Run time complexity is O(1) amortized but sometimes it may take more time as compared to BF since we need to sequentially search for the position to insert at.
- Insertion and lookup logic is not very straightforward.
In general we do not need to store the bloom/quotient filter in disk in case of membership queries where we are working in conjunction with a database and the purpose of filters are only to prevent unnecessary lookups to access the DB and thus reduce load on the DB server. For this purpose, there would be enough RAM to hold these filters in memory.
For e.g. to hold 1 billion strings bloom filter will require only 1.3GB RAM with a false positive rate of 0.5%. Manageable with most basic cloud servers.
In case we want to use the filters as full fledged hash-maps where we also need to support deletion, then we can use Cuckoo Filter or Quotient Filters as alternatives to Counting Bloom Filters.
Some nice reads