source: quanta magazine

HyperLogLog, Probabilities and Dynamic Programming …

Abhijit Mondal

--

In this post we will be discussing the HyperLogLog algorithm for cardinality estimation.

Given a stream of input data points for e.g. the input could be IP addresses, user ids on a social networking site, geolocations such as latitude and longitude pairs for GPS enabled device and so on.

In order to find the number of unique visitors to a website in the last 30 days assuming that there are 1 million requests per second, there will be 86.4 billion user ids (non-unique) logged into the system each day.

Multiply that by 30 days to get 2.6 trillion user ids in a month.

Assuming that each user id is of size 32 bytes, we will need 76 TB of data points to analyze and find the number of unique user ids from this data.

Ofcourse, one can store all the 76TB on disk and scan the data to find the number of unique user ids but each such disk scan is an expensive operation. Based on the latency numbers chart, it takes somewhere around 10-20ms to read 1 MB of data from disk, thus reading 76 TB it will take somewhere around 15-20 hours to read all of that data.

Latency Numbers Every Programmer Should Know (github.com)

If we store all that data in memory, we need to have at-least 4864 RAMs each of 16GB and since memory is expensive than storing on disk, this option seems infeasible at the moment.

Alternatively, we can use a ‘set’ data structure to store the stream of user ids. The set could be implemented using either a hash table or a balanced binary search tree such as AVL Tree or Skip List.

The space complexity is O(n) where n is the number of unique user ids. If n ~ O(10⁹), we would still requires somewhere around 30GB of RAM for storing the set.

Can we live with some error in the counts of the number of unique visitors but use somewhere around 1 KB memory only?

This can be a desired use-case when the exact count do not matter but approximate is enough such as counting the number of unique visitors to a website or the number of virtual machines created on a cloud platform etc.

The idea behind HyperLogLog is pretty intuitive.

Assuming we have a ‘good’ hash function that generates hashes with an uniform distribution in the range [0, 2³² -1].

If the maximum number of trailing zeros in the binary representation of the hash is K, then the approximate cardinality is 2^K.

For e.g. if we use the Murmur3 family of hash functions, then the 32 bit hash of the string “Hello world” is:

hsh = mmh3.hash("Hello world", 42, signed=False) & ((1 << 32)-1)
print(bin(hsh))

0b10010000010010111101110011111100

We can see that the number of trailing zeros is 2.

Let X be the event that we saw K trailing zeros in the binary representation of the hash.

The number of possible hashes with trailing K bits as 0 is 2^(N-K) because we can place 0 or 1 in each of the remaining N-K places.

Thus the probability that a hash ends with K zeros is :

To calculate the expected number of hashes before we see K trailing zeros, let the expected number of hashes be E. Then

If we see an event with K trailing zeros, then we add the term 1*p, else if we do not see this event, then we repeat again, thus if we do not see K trailing zeros, then the number of trials is 1+E.

Solving for E, we get:

Thus the expected number of hashes after which we see K trailing zeros is 2^K.

The standard deviation is given by:

Note that the above analysis do not assume that the maximum number of trailing zeros seen is K, it just assumes that we see an event with K trailing zeros.

In HLL, instead of considering the event that we see a hash with K trailing zeros we consider the event that the maximum number of trailing zeros seen is K.

The benefit of the later strategy in streaming data is that the maximum number of trailing zeros is a non-decreasing function which correlates with the fact that cardinality is always non-decreasing.

The former strategy is not useful because if at time T we see a hash with 10 trailing zeros and at time T+1, a hash with 5 trailing zeros then at time T+1, the expected cardinality is lower than at time T which contradicts with our intuition.

Let’s calculate the probability that the maximum number of trailing zeros seen so far is K. Let the number of inputs seen so far be n.

Let r = (2^N), N is the total number of bits.

Let P be the number of ways by which we can have n N-bit numbers such that the number of trailing zeros ranges from 0 to K for all n numbers.

Similarly, let Q be the number of ways by which we can have n N-bit numbers such that the number of trailing zeros ranges from 0 to K-1 for all n numbers.

Then the number of ways by which we can have n N-bit numbers such that at-least one of n has K trailing zeros = P-Q

Probability that the maximum number of trailing zeros seen is K:

The expected cardinality in HLL with a single hash function is given to be :

The variance of the cardinality is given as:

The standard deviation and standard error is given as:

For N=32 and n = 2³², the standard error is high as 51610.

One can experimentally verify this is true by running the following Python simulation:

import random, mmh3, math, numpy as np, string

def get_trailing_zeros(n, N):
if n == 0:
j = N
else:
j = 0
while n > 0:
if (n & 1) == 1:
break

n = n >> 1
j += 1

return j


N = 32
n = 1 << N
a = []

for j in range(1000):
g = -1
for i in range(n):
data = ''.join(\
random.choices(string.ascii_lowercase + string.digits,
k=random.randint(1, 20)))

# generate hash between 0 and 2^N-1
hsh = mmh3.hash(data, 42, signed=False) & ((1<<N)-1)

# number of trailing zeros in binary of hsh
b = get_trailing_zeros(hsh, N)
g = max(g, b)

# calculate cardinality
a += [1<<g]

# standard deviation
print(np.std(a))

Assuming that we have seen n inputs so far in the data stream and the leftmost position of the first one in the binary representations of the hashes is K i.e. the maximum number of trailing zeros is K-1.

......100
.......10
....10000
........1
..1000000

Given this data, can we estimate the expected cardinality of the stream ?

Let X be the total number of ways of obtaining a maximum of K-1 trailing zeros with n input hashes. Then X is calculated as:

Using the same reasoning as seen earlier.

Let a be the number of ways of obtaining K-1 trailing zeros and 1 at the K-th place in a N-bit hash .

Then:

Let b be the number of ways of obtaining less than K-1 trailing zeros in an N-bit hash.

Then:

To generate a sequence of n input hashes such that at-least one of the input hashes have K-1 trailing zeros and the remaining have less than K-1 trailing zeros.

Let the cardinality with n hashes be h i.e. out of n hashes seen, h are unique hashes.

I can choose j hashes from ‘a’ hashes (‘a’ is defined above) and h-j hashes from ‘b’ hashes varying j from 1 to h. After filling the h hashes out of n hashes, I can fill the remaining n-h hashes using repetition from the h unique hash values.

Let H(n, j, h-j) be the number of ways of arranging n input hashes with h unique hashes and j hashes from set ‘a’ i.e. equal to K-1 leading zeros and h-j hashes from set ‘b’ i.e. less than K-1 leading zeros.

Then we can use the following recursion relation to solve for H:

The 1st component refers to the case where we have formed n-1 hashes with j hashes from set ‘a’ and h-j from set ‘b’. For the remaining n-th hash, I can pick any one of the h hashes selected so far.

The 2nd component refers to the case where we have formed n-1 hashes with j-1 hashes from set ‘a’ and h-j from set ‘b’. For the remaining n-th hash, I have to pick one of the a-(j-1) hashes from set ‘a’ because I need to have j hashes from set ‘a’.

The 3rd component refers to the case where we have formed n-1 hashes with j hashes from set ‘a’ and h-j-1 from set ‘b’. For the remaining n-th hash, I have to pick one of the b-(h-j-1) hashes from set ‘b’ because I need to have h-j hashes from set ‘b’.

Here is the Python code to find H using memoization technique:

def H(n, x, y, dp):
if x < 0 or y < 0:
dp[(n, x, y)] = 0
return dp[(n, x, y)]

if n < x+y:
dp[(n, x, y)] = 0
return dp[(n, x, y)]

if n == 1:
if x+y == 1:
if x == 1:
dp[(n, x, y)] = a
else:
dp[(n, x, y)] = b
else:
dp[(n, x, y)] = 0

return dp[(n, x, y)]

u = H(n-1, x, y, dp) if (n-1, x, y) not in dp \
else dp[(n-1, x, y)]

v = H(n-1, x-1, y, dp) if (n-1, x-1, y) not in dp \
else dp[(n-1, x-1, y)]

w = H(n-1, x, y-1, dp) if (n-1, x, y-1) not in dp \
else dp[(n-1, x, y-1)]

dp[(n, x, y)] = u*(x+y) + (a-x+1)*v + (b-y+1)*w

return dp[(n, x, y)]

Compute H for all different cardinalities h ranging from 1 to n.

To find the expected cardinality, multiply each cardinality h with H and divide by X (calculated above) and take their sum.

n = 300
N = 10
k = 7

q = (1 << N)
x = int((q*(1-(1/2)**k))**n - (q*(1-(1/2)**(k-1)))**n)

a = (1 << (N-k))
b = int(q*(1-(1/2)**(k-1)))

dp = {}

e = 0
for h in range(1, n+1):
y = 0
for j in range(1, h+1):
y += H(n, j, h-j, dp) if (n, j, h-j) not in dp else dp[(n, j, h-j)]

e += h*(y/x)

print(e)

The above code is for finding the expected cardinality when we have seen 300 input data points each of 10 bits and the maximum number of trailing zeros seen is 7.

The expected cardinality is somehwere around 260.

We can verify this is true by running a simulation.

import random, mmh3, math, string

v, c = 0, 0
n = 300
N = 10
k = 7

for j in range(10000):
nums = random.choices(range(10**9), k=n)
g = -1
d = set()
for i in range(len(nums)):
data = ''.join(\
random.choices(string.ascii_lowercase + string.digits,
k=random.randint(1, 20)))

hsh = mmh3.hash(data, 42, signed=False) & ((1<<N)-1)
b = get_trailing_zeros(hsh, N)
g = max(g, b)
d.add(hsh)

if g == k:
v += len(d)
c += 1

print(v/c)

We will see that the expected cardinality is also somewhere around 260.

This is one way to find the cardinality without having to store all the input data but the issue is that :

  1. We are using O(n³) memory to store the cached values dp.
  2. Running time complexity is O(n³).

Both these drawbacks suggest that we cannot use this analysis to calculate the expected cardinality.

Coming back to our HLL algorithm, instead of using a single hash function, what if we use multiple hash functions. Will that bring the standard error down?

Using m hash functions, let K1, K2, … Km be the maximum number of trailing zeros corresponding to each hash function.

Then the estimated cardinalities from each hash are respectively 2^K1, 2^K2 … and so on. To get the final cardinality we can take the geometric mean of the individual cardinalities i.e.

The expected cardinality is the sum over all possible combinations of K1, K2 … upto Km given to be:

Number of possible combinations of K1, K2, … Km is:

If we run the below Python code, we can see that the standard deviation decreases on increasing m i.e. the number of hash functions.

import math

# Function to generate all possible combinations of k1, k2, ..., km
def hn(N, m, curr, out):
if m == 0:
out += [curr]

else:
for k in range(N+1):
hn(N, m-1, curr + [k], out)

N = 10
n = 1 << N

for m in range(1, 10):
s1, s2 = 0, 0

out = []
hn(N, m, [], out)

# For each combination of k1, k2, ..., km
for x in out:
p = 1

# Mean of the k1, k2, ..., km
q = 2**(1.0 + sum(x)/m)

for y in x:
p *= ((1-(1/2)**(y+1))**n - (1-(1/2)**y)**n)

s1 += p*q**2
s2 += p*q

print(m)

# print standard error
print(math.sqrt(s1-s2**2)/math.sqrt(n))

The plot of the standard error vs. number of hash functions m, looks like one below:

standard error vs. number of hash functions (m)

Due to the combinatorial nature of the number of hash functions on the different combinations Ki, calculating for higher values of m becomes exponentially difficult, hence we have shown till m=6.

Here is a python code to demonstrate an estimation of cardinality using 1000 independent hash functions:

import random, mmh3, math, numpy as np, string

N = 32
m = 1000
k = 5123

# Generate m seeds for murmur3 hash corresponds to m hash functions
seeds = [random.randint(0, 10**5) for _ in range(m)]

# Generate k random strings
data = list(set([''.join(\
random.choices(string.ascii_lowercase + string.digits,
k=random.randint(1, 20))) for _ in range(k)]))

# True cardinality
print(len(data))

g = [-1]*m

for i in range(10*k):
h = random.sample(data, 1)[0]

for j in range(m):
# generate hash between 0 and 2^N-1
hsh = mmh3.hash(h, seeds[j], signed=False) & ((1<<N)-1)

# number of leading zeros in binary of hsh
b = get_trailing_zeros(hsh, N)
g[j] = max(g[j], b)

# Estimated cardinality
w = sum(g)/m
a = 0.79402*2**w

print(a)

In the above code, we are multiplying estimated cardinality by 0.79402. This constant is found out in the paper by the original authors.

One of the drawbacks with the above approach is that we have to use m different hash functions and if m is large i.e. m > 1000, the time complexity is quite high i.e. O(m*w) for each input string of length w present in the stream.

Another approach to simulate m different hash functions, is that we take the first p bits from the hash as a bucket. Thus with p bits, we will have 2^p number of buckets. Estimate the cardinality using the remaining N-p bits.

Similar to m hash functions, we take the geometric mean of the cardinality in each bucket but the final cardinality has to be multiplied by m.

Because the input stream is divide into m sub-streams unlike m hash functions which are applied to all inputs, here each bucket corresponds to approximately n/m inputs.

The updated code for using 1st p bits as a bucket identifier:

import random, mmh3, math, numpy as np, string

N = 32
p = 10
m = 1 << p
k = 51236

# Generate k random strings
data = list(set([''.join(\
random.choices(string.ascii_lowercase + string.digits,
k=random.randint(1, 20))) for _ in range(k)]))

# True cardinality
print(len(data))

g = [-1]*m

for i in range(10*k):
h = random.sample(data, 1)[0]
hsh = mmh3.hash(h, 42, signed=False) & ((1<<N)-1)

# First p bits is for bucket index r
# Remaining N-p bits are for finding the cardinality
u = N-p

r = hsh >> u
q = hsh & ((1<<u)-1)

# number of leading zeros in binary of quotient q above
b = get_trailing_zeros(q, u)
g[r] = max(g[r], b)

# Estimated cardinality
w = sum(g)/m
a = 0.79402*m*2**w

print(a)

For each input in the stream, the time complexity is now reduced to O(w) from O(m*w).

The updated expected cardinality when we are using m buckets instead of m hashes can be written as follows:

Note that now we are using the additional terms because out of n inputs, n1 can go into 1st bucket, n2 into 2nd bucket and so on. The maximum number of trailing zeros in bucket j is Kj (same as before).

The code for computing standard errors is updated as follows:

# Compute nCr values for r from 0 to n using dynamic programming
def ncr(n):
cache = [[0]*(n+1) for i in range(n+1)]

for i in range(n+1):
for j in range(i+1):
if j == 0 or i == j:
cache[i][j] = 1
else:
cache[i][j] = cache[i-1][j] + cache[i-1][j-1]

return cache

# Compute the term inside the 2nd summation in the expectation formula
# dp = memoization variable
def gn(arr, i, n, m, u, dp):
if m == 0:
return 0 if n > 0 else 1

s = 0
for j in range(n+1):
p = (1-(1/2)**(arr[i]+1))**j - (1-(1/2)**arr[i])**j

a = gn(arr, i+1, n-j, m-1, u, dp) if (n-j, m-1) not in dp \
else dp[(n-j, m-1)]

s += p*u[n][j]*a

dp[(n, m)] = s
return s


N = 5
n = 1 << N
u = ncr(n)

for m in range(1, 10):
# Get all combinations of k1, k2, ... km similar to previous
out = []
hn(N, m, [], out)

s1, s2 = 0, 0

# Compute the 1st summation in the above expectation formula
for x in out:
p = gn(x, 0, n, m, u, {})/m**n
q = 2**(1.0 + sum(x)/m)

s1 += p*q**2
s2 += p*q

print(m)
print(math.sqrt(s1-s2**2)/math.sqrt(n))

Due to the combinatorial nature of the expectation, it is computationally expensive to compute the standard errors for values of N beyond 6 and m beyond 5.

The final version of HLL has few other modifications such as :

  1. Using harmonic mean instead of geometric mean
  2. Handling cases for small and large cardinalities separately.

The final code looks like as follows:

import mmh3, random, math
from scipy.stats import hmean
import numpy as np

class HyperLogLog:
def __init__(self, p=10):
self.N = 32
self.p = p
self.m = 1<<p
self.buckets = [-1]*self.m

self.alpha = 0.7213/(1+1.079/self.m)

if self.m == 16:
self.alpha = 0.673
elif self.m == 32:
self.alpha = 0.697
elif self.m == 64:
self.alpha = 0.709

def add(self, data):
h = mmh3.hash(data, 42, signed=False) & ((1<<self.N)-1)

# Exclude the 1st p bits in h
v = self.N-self.p

# Quotient using remaining 22 bits
q = h & ((1<<v)-1)

# Remainder is first p=10 bits
r = h >> v

# Number of trailing zeros of quotient q
j = get_trailing_zeros(q, v)

self.buckets[r] = max(self.buckets[r], j+1)

def get_size(self):
u = 1 << self.N
v = [2**x for x in self.buckets]
n = self.alpha*self.m*hmean(v)

# Case for small cardinalities
if n <= 2.5*self.m:
z = sum([self.buckets[i] == 0 for i in range(self.m)])
if z != 0:
return self.m*math.log(self.m/z)

# Case for very large cardinalities
elif n > (1/30)*u:
return -u*math.log(1-(n/u))

return n

There are 1024 buckets and in each bucket we are storing the maximum number of trailing zeros K which ranges from 0 to 22 (10 bits out of 32 bits for bucket). To store K for range of values from 0 to 22, we need 5 bits.

Thus each of the 1024 buckets is storing 5 bits, thus total = 1024*5 bits = 0.625 KB. The error rate from the above code varies around 1–3%.

Compare this to using 30GB of memory for accurate cardinality estimation.

Using harmonic mean, the standard error values are lower as compared to using geometric mean.

standard error with harmonic mean

Read more about the theory behind the hyperloglog and error bounds in the original paper here.

--

--