source: quanta magazine

Why should you care about hash function

Abhijit Mondal

--

Ever since I learnt about data structures and algorithms in college, hash tables has been the most famous and most commonly used data structure in real life software engineering. Also hash tables are one of the most commonly asked data structures in interviews for problem solving, system design, low level design etc.

The backbone of a hashtable is the hash function.

In this post we dive deep into understanding hash functions in details and how can we design a hash function by ourselves.

So what’s the big deal with hash function

Its not only hash tables where hash function finds its usage, but there are several other applications where hash functions are useful.

Bloom Filter

It is a probabilistic data structure for answering membership queries with a very limited memory usage. A good hash function reduces the false positive rate in bloom filters.

Membership Queries with Big Data. Doing membership tests is one of the… | by Abhijit Mondal | Medium

Count Min Sketch

Probabilistic data structure for approximating frequency of occurrences with a limited memory usage.

Count–min sketch — Wikipedia

HyperLogLog

Probabilistic data structure for approximating cardinality i.e. number of unique items in a dataset, using a very limited memory usage.

Cardinality Estimation with Big Data | by Abhijit Mondal | Medium

Consistent Hashing

A ring based data structure for assigning objects to partitions in a distributed system. A good hash function ensures that each partition contains roughly equal number of objects.

Deep dive into Chord for Distributed Hash Table | by Abhijit Mondal | Medium

Rsync

It is an utility for uploading and downloading files over the network and for syncing files across different devices in cloud based hosting such as Google Drive, DropBox, Evernote etc. Hash function is used to quickly compare the rolling hashes of an incoming stream with an existing one.

The rsync algorithm (samba.org)

Apart from the above, hash functions such as MD5 and SHA-1 are used in many cryptographic applications such as SSL (public key encryption).

But in this post we will limit ourselves to non-cryptographic hash functions only.

Properties of a good non-cryptographic hash function

The main trait of any good hash function is that it should minimize the number of collisions.

For e.g. with a hash table of size M i.e. M bins, and N objects, a good hash function would assign roughly N/M objects per bin.

Generally M is chosen or updated to be > N, thus on average each bin has approximately 1 or less than 1 object.

But with a bad hash function, even with M > N, some bins may contain upto O(N) objects (chained using a linked list). Thus effectively having a time complexity of O(N).

# Simple hash table in python for insert and get operation

class HashTable:
def __init__(self, nbins=1000000):
self.arr = [None]*nbins
self.bins = nbins

def insert(self, key, value):
hv = hash(key)
index = hv % nbins

if self.arr[index][0] is None:
self.arr[index] = [[(key, value)]]
else:
contains = False
# If len(self.arr[index]) = O(N) then time complexity for
# each insertion is O(N)
for i in range(len(self.arr[index])):
if self.arr[index][i][0] == key:
contains = True
break
if contains is False:
self.arr[index] += [[(key, value)]]

def get(self, key):
hv = hash(key)
index = hv % nbins

if self.arr[index][0] is None:
return None
else:
# If len(self.arr[index]) = O(N) then time complexity for
# each search is O(N)
for i in range(len(self.arr[index])):
if self.arr[index][i][0] == key:
return self.arr[index][i][1]
return None

Following are some of the properties of a good non-cryptographic hash function that minimze collisions.

Uniform distribution of hashes

If there are M bins, where a hash function can hash an object into, then the probability that an object is hashed to any of the M bins is 1/M i.e. each bin has equal probability. This should hold true irrespective of the distribution of the objects before hashing.

i.e. even if the original objects were sampled from a non-uniform distribution, but after hashing, the hash values for these objects must follow the uniform distribution.

Two ways we can achieve this given an aribitrary bit distribution: “stretching” and “translating”.

Stretching a bit vector can be done by multiplying with a constant and translating a bit vector by addition with a constant.

hash(x) = ax + b
where a and b are constants.
a and b are chosen from a random uniform distribution.

Multiplying x by a uniform constant creates a resultant bit vector with uniform distribution of 1s and 0s, even if x does not have an uniform distribution of 1s and 0s.

Similarly addition with a random constant creates a resultant bit vector with uniform distribution of 1s and 0s. Remember in addition of 2 bit vectors [u1, u2, …, uN] and [v1, v2, …, vN], each sum component is computed as ui XOR vi.

In XOR, the output is 50% 1s and 50% 0s because 0 ^ 0 = 0, 0 ^ 1 = 1, 1 ^ 0 = 1 and 1 ^ 1 = 0. Thus it helps to achieve a random uniform distribution of 1s and 0s.

import random, collections
h = [1]*8 + [0]*2

a = random.randint(1, 1000)
b = random.randint(1, 1000)

for i in range(100):
# Generate bit vector with 80% of 1s and 20% of 0s (non-uniform)
inp = ''.join([str(x) for x in random.choices(h, k=1000)])
res = int(inp, 2)

# Calculate hash
hsh = res*a + b
hsh = list(bin(hsh)[2:])

# Compare distribution of 1s and 0s before and after hashing
print(collections.Counter(inp))
print(collections.Counter(hsh))
print()

If you run the above program, you can see that the distribution of 1s and 0s in the final hash is almost uniform:

Counter({'1': 798, '0': 202})
Counter({'1': 526, '0': 482})

Counter({'1': 827, '0': 173})
Counter({'1': 527, '0': 481})

Counter({'1': 824, '0': 176})
Counter({'1': 519, '0': 489})

Counter({'1': 794, '0': 206})
Counter({'0': 514, '1': 495})

Counter({'1': 794, '0': 206})
Counter({'0': 511, '1': 498})

Counter({'1': 793, '0': 207})
Counter({'1': 536, '0': 473})

Counter({'1': 805, '0': 195})
Counter({'0': 513, '1': 496})

Avalanche Effect

If we flip one bit in the original object, the expected number of bits flipped in the hash value should be close to 50%. This property ensures that it is not easy to reverse engineer a hash function.

Also the bits flipped in the hash should be randomly chosen.

For e.g. if flipping one bit of the object flips only one bit of the hash, then it is possible for an adversary to create multiple objects that will hash into the same bin thus artificially increasing the number of collisions. This will in turn degrade read performance of the hash table from O(1) to O(N).

This is known as DoS (Denial of Service) attack on Hash Tables.

The idea is that it should be very difficult to guess which objects should hash to the same value without doing an exhaustive search.

Consider this hash function:

unsigned long hash(const char *str){
unsigned int hash = 0;
int c;

while (c = *str++)
hash = hash*7 + c;

return hash;
}

If we call the above with hash(“balm”) we get a hash value of 39232, and when we call it with hash(“baln”), we get a hash value of 39233.

Changing the last bit of “balm” (1101110110100000011100) to “baln” (1101110110100000011101), changes the hash value by a single bit.

Because for each increment of the last character, the hash value increments by 1, for each increment of 2nd last, the hash value increments by 7, for 3rd last, hash value increments by 7²=49 and so on.

Thus someone can create artificial collisions by incrementing the last character by 7 and another by incrementing the 2nd last character by 1.

Even if more than 50% bits are flipped on some inputs, still there is no guarantee it is a good hash function:

unsigned long hash(const char *str){
unsigned int hash = 0;
int c;

while (c = *str++)
hash = hash + c;

return hash;
}

In the above hash function, hash(“abcdu”) returns 511 (0111111111), and hash(“abcdv”) returns 512 (1000000000). Thus more than 50% bits are flipped on changing a single character in the input.

But observe that the flipped bits are not random and consecutive bits are flipped and also this is a one off example, running the hash with multiple pairs, we can observe that on an average only a single bit is flipped in the output.

You might ask, why don’t we use random numbers in the hash function, in that way, we can achieve randomness in the outputs?

If we used a random number in the hash function, then when we insert an object into the hash table, it will compute one hash but when we try to read back, then most likely a different random number will be chosen, which will give a different hash value and thus we will read incorrect object.

The same holds true during multiple insertion/updation of an object.

Fast in performance

The hash should be computed very fast. Most cryptographic hash function tradeoffs the speed for high collision resistance by building complicated hash function. In the case on non-cryptographic use case, we should look to strike a balance between speed and collision resistance.

Most hash functions use a combination of multiplication, addition, bitwise OR, bitwise XOR. If there are too many of these operations then hashing will not remain an efficient operation as we need to maintain a O(1) average time complexity.

In some computer architecture, using bitwise shift operators instead of multiplication and XOR instead of addition may be faster and many hash functions choose to use bitwise operations instead.

For e.g. in the below C++ code, we compute the hash function as :

hash(x) = x*41943104 + 987

Instead of using addition, we can also use XOR although it will produce different results but the idea is same. Addition is similar to XOR operation but with an additional carry bit floating around.

hash(x) = (x*41943104) ^ 987

One can write the multiplicative constant using bitwise shift operators.

For e.g. x*41943104 can be written as (x << 25) + (x << 23) + (x << 6)

because 41943104 = 2²⁵ + 2²³ + 2⁶

#include <iostream>
#include <stdlib.h>
#include <random>
#include <chrono>

unsigned long getRandom(unsigned int a, unsigned int b) {
// Get uniform random number between a and b
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> dist6(a,b);

return dist6(rng);
}

int main() {
for (int i = 1; i <= 1000; i++) {
// Get random number b/w 1 and 10000
unsigned long x = getRandom(1, 10000);

auto start = std::chrono::high_resolution_clock::now();

unsigned long hsh = (x*41943104 + 987);
auto end = std::chrono::high_resolution_clock::now();

std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count() << "ns\n";

start = std::chrono::high_resolution_clock::now();

// Replace 41943104 with bitwise shift operator and + with XOR (^)
hsh = (((x << 25) + (x << 23) + (x << 6)) ^ 987);
end = std::chrono::high_resolution_clock::now();

std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count() << "ns\n";

std::cout << std::endl;

}
return 0;
}

If you run the above program, the timings will be almost same in most modern architectures but in older computer architectures, bitwise operations tend to be faster by few nano seconds.

c++ — Is multiplication and division using shift operators in C actually faster? — Stack Overflow

Testing “goodness” of a hash function

Once we have designed an hash function, how do we test it whether it is a good function or not?

We can use different tests for measuring different properties.

Testing uniformity and randomness

Using the chi-squared test we can test whether distribution of objects into bins is random or not.

In chi-square test, we create a contingency table as follows:

Chi-Square Contingency Table

In the above table there are 5 bins and 50 objects to assign. The expected number of objects per bin assuming uniform distribution is 10.

We then compute the chi-squared statistics as follows:

Sum of (observed-expected)²/expected for each of the 10 cells.

All cells under ‘assigned’ column will have expected value as 10 and all cells under ‘not assigned’ column will have expected value as 40.

Thus:

s = (10-10)^2/10 + (25-10)^2/10 + (8-10)^2/10 + (4-10)^2/10 + (3-10)^2/10 +
(40-40)^2/40 + (25-40)^2/40 + (42-40)^2/40 + (46-40)^2/40 + (47-40)^2/40

s = 0+22.5+0.4+3.6+4.9+0+5.626+0.1+0.9+1.225 = 39.251

Below is the python code computing the chi-square value from hash table assignments.

import random, collections
from scipy import stats
from scipy.stats import chi2_contingency

# word size 8 bits
word_size = 8

# Total number of objects
n = 100000

# Total number of bins
m = 1002569
m_bits = len(bin(m))-2

# Sample from 1 and 0
s = '10'

# Size of object fixed 10
s_char = 10

# Define the hash function
def hash_fn(x, a, b, m):
h = b

# Compute hash with each block of word_size length bits
for i in range(0, len(x), word_size):
h = (h*a + int(x[i:i+word_size], 2)) % m

return h % m


def uniformity_test():
# Random multiplier
a = random.randint(1, 1000)
b = random.randint(1, 1000)

values = []

for i in range(n):
# Generate random string bits 1s and 0s
x = ''.join(random.choice(s) for i in range(word_size*s_char))

# Calculate hash
hsh = hash_fn(x, a, b, m)
values += [hsh]

# data[i][0] = Number of objects assigned to the i-th bin
# data[i][1] = Number of objects not assigned to the i-th bin = n-data[i][0]
data = [[0, n] for i in range(m)]

for k, v in collections.Counter(values).items():
data[k][0] = v
data[k][1] = n-v

# Calculate chi-squared statistics
stat, p, dof, expected = chi2_contingency(data)

# Interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Non uniform distribution')
else:
print('Uniform distribution')

Our hash function ‘hash_fn’ works with bit vectors. The hash function takes in a block of 8 bits from the input object and computes the hash(i) for the i-th block as follows:

hash(i) = (hash(i-1)*a + integer rep. of the i-th block of 8 bits) % m
hash(0) = b

m = number of bins
a, b = constants

On running the above code using our hash function, we can see that the p-value is somewhere around 0.69 which is > 0.05 implying that the null hypothesis is correct or in others words, the distribution of objects into the bins is uniform or random.

Testing for Avalanche Effect

To test for avalanche effect, flip a random bit in the original object and check how many bits are flipped in the hash. With a good hash function, almost 50% of the bits are flipped in the hash.

def avalanche_test():
# Random multiplier
a = random.randint(1, 1000)
b = random.randint(1, 1000)

u = 0

for i in range(n):
# Generate random string
x = ''.join(random.choice(s) for i in range(word_size*s_char))

# Calculate hash of original object
hsh1 = hash_fn(x, a, b, m)

# Calculate bit representation of original object hash
b1 = bin(hsh1)[2:]
b1 = '0'*(m_bits-len(b1)) + b1

# Flip a random bit in original string
j = random.randint(0, word_size*s_char-1)
f = '1' if x[j] == '0' else '0'
x = x[:j] + f + x[j+1:]

# Calculate hash of modified object
hsh2 = hash_fn(x, a, b, m)

# Calculate bit representation of modified object hash
b2 = bin(hsh2)[2:]
b2 = '0'*(m_bits-len(b2)) + b2

# Calculate number of bits flipped
g = 0
for k in range(m_bits):
g += 1 if b1[k] != b2[k] else 0

u += g/m_bits

# Average number of bits flipped
print(u/n)

Running the above program returns the average ratio of bits flipped as 0.48 which is very close to 0.5. Thus our hash function can be thought to be a relatively good one.

Collision Test

Last but not the least, we can run a simulation for the number of collisions when M >> N i.e. the number of bins are greater than the number of objects. In an ideal case, each bin should contain at-most one object but with a bad hash function, some bins may contain more than 1 objects.

def collision_test():
# Random multiplier
a = random.randint(1, 1000)
b = random.randint(1, 1000)

hash_values = []

for i in range(n):
# Generate random string
x = ''.join(random.choice(s) for i in range(word_size*s_char))

# Calculate hash
hsh = hash_fn(x, a, b, m)

hash_values += [hsh]

bins = []
for k, v in collections.Counter(hash_values).items():
bins += [v]

# Percentage of bins with more than 1 object
print(len([x for x in bins if x > 1])*100/m)

# Maximum size of a bin
print(max(bins))

By using the above hash function, 0.45% of bins has greater than 1 object while maximum number of objects in any bin is 4.

This is certainly not a bad statistics.

Now let’s test a popular hash function using our above metrics:

Java’s hashCode()

The hash function is defined as below:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

s[i] is the i-th character and n is the total number of characters.

Replace the hash_fn method in the above codes with the following method:

def hash_fn(x, a, b, m):
h = 0
p = 1

for i in range(len(x)-1, -1, -word_size):
h += int(x[(i-word_size+1):(i+1)], 2)*p
p *= 31

return h % m

Although, hashCode() function requires using 2³² bins i.e. a 32 bit integer, but for the sake of the calculations above, we will keep it lower so that test results could be generated.

The results are as following:

  • Uniformity Test — p-value=0.85 indicating objects are uniformly distributed across bins.
  • Avalanche Test = Average number of bits flipped is 0.4
  • Collision Test — Percentage of bins with > 1 objects = 0.47%

--

--