Probability and Statistics for Software Engineers

Published in

Gaussian Machine

19 min readJan 2, 2023

Probability and Statistics are two of the most important math subjects that are essential to be an above average software engineer in this cut-throat competitive software engineering field. These are 2 of the essential skills that will keep you ahead in the race.

In this post we discuss various software engineering problems that would require probability and stats skills to enhance your analysis skills instead of relying on heuristics or ‘gut feeling’.

Given a hash table of size N and M keys to insert in the hash table. What is the probability of a collision?

Let P be the probability that none of the hashes of the keys collide. Then the probability that at-least 2 hashes have the same value is 1-P which is also the probability of at-least one collision.

Let’s compute P first.

After the 1st key is inserted, the 2nd key has N-1 choices left because it cannot be put along with the 1st key since it will lead to collision. Probability of the 2nd key having no collision after 1st key inserted = N-1/N

Similarly, after the 2nd key is put, the 3rd key has N-2 choices left (excluding the slots for the 1st and 2nd key).

Probability of the 3rd key having no collision after 1st and 2nd keys inserted = N-2/N

P = N/N * (N-1)/N * (N-2)/N * (N-3)/N * … *(N-M+1)/N
P = N!/((N-M)!*N^M)

Of course, if M > N, then by intuition you cannot have no-collision because by Pigeonhole Principle, at-least one slot will have 2 or more keys.

Thus when M > N, P=0
When M ≤ N, Probability of at-least one collision = 1-P

What is the probability of a collision for the i-th key?

For the i-th key, let the probability it leads to a collision be P(i), then we can write P(i) as follows:

P(i)=P(i|1)*P(1) + P(i|2)*P(2) + … + P(i|N)*P(N)

P(i|k) is the probability that the i-th key leads to a collision given that it lands up in the k-th slot and P(k) is the probability of picking the k-th slot which is just 1/N because there are N slots.

The i-th key creates a collision in slot k, if there are at-least one key already in slot k.

The probability that the k-th slot is non-empty after i-1 keys = 1-(probability that none of the i-1 keys land up in slot k)

Probability that any key lands up in a slot other than slot k = (N-1)/N

Probability that all i-1 keys do not land up in slot k (assuming independence) = (1–1/N)^(i-1)

Probability that at-least one key lands up at slot k = 1-(1–1/N)^(i-1)

Thus:

P(i|k)=1-(1–1/N)^(i-1)
P(i)=N*(1/N)*(1-(1–1/N)^(i-1))=1-(1–1/N)^(i-1)

What is the expected number of collisions?

The expected number of collisions is = M — Expected number of slots which are occupied.

This is because, the slots which are occupied, the first key to occupy them did not have any collisions, thus if we subtract all the keys which did not have any collisions we will get the total number of collisions.

If a slot is occupied, then there is exactly one key which did not have any collision.

Expected number of slots which are occupied = N-(Expected number of slots which are empty).

Let Expected number of slots which are empty = E

Let X_i be an indicator variable denoting the i-th slot is empty (1) or occupied (0).

E = E[X_1 + X_2 + … + X_N] = E[X_1] + E[X_2] + … + E[X_N]

E[X_i] = 0*P(X_i=0) + 1*P(X_i=1)

P(X_i=1) i.e. the i-th slot is empty = (1–1/N)^M
because none of the M keys landed up at the i-th slot and assuming 
independence of keys.

E[X_i] = (1–1/N)^M

E = N*(1–1/N)^M

Expected number of slots which are occupied = N-E = N-N*(1–1/N)^M

Expected number of collisions = M-(N-E) = M-N+E = M-N+N*(1–1/N)^M

What is the probability that the maximum number of keys in a slot is H?

Let P(M, N, H) be the number of ways where any slot out of N slots has maximum H keys out of M keys. For the 1st slot we can choose 0, 1, 2, .. upto H keys and then recursively call the function with the remaining keys on N-1 slots.

Then we can write a recurrence relation as follows:

P(M, N, H)=(sum K=0 to H) (M chooses K)*P(M-K, N-1, H)

It implies that, for the 1st slot, we can select K keys out of M keys where K can vary from 0 to H. After this selection is made, we have M-K keys remaining and N-1 slots. So we again call this method recursively P(M-K, N-1, H).

For the base cases:

P(M, 1, H) = 0 if M > H else 1
P(0, N, H) = 1 and 
P(M, N, H) = 0 if M < 0

Python implementation:

def ncr(n):
    cache = [[0]*(n+1) for i in range(n+1)]
    
    for i in range(n+1):
        for j in range(i+1):
            if j == 0 or i == j:
                cache[i][j] = 1
            else:
                cache[i][j] = cache[i-1][j] + cache[i-1][j-1]
    
    return cache

def maximum_keys(m, n, h):
    # Computes the nCr values using dynamic programming
    cache = ncr(m)
    p = recursion(m, n, h, cache, {})/(n**m)
    
    return p

def recursion(m, n, h, cache, dp):
    # base case
    if n == 1:
        return 0 if m > h else 1
    
    s = 0
    for k in range(h+1):
        if m >= k:
            # Use cached values wherever applicable
            a = recursion(m-k, n-1, h, cache, dp) if (m-k, n-1) not in dp else dp[(m-k, n-1)]
            s += cache[m][k]*a
    
    dp[(m, n)] = s
    return dp[(m, n)]

Time complexity is O(M*N*H).

Probability that the maximum number of keys in a slot is H = P(M, N, H)/N^M

Probability that a server is down is given as ‘p’. How many replicas will be required so that probability that at-least majority of the replicas are up is 99.99%?

Let the number of replicas be N (odd). For majority, it is (N+1)/2.

Probability that 'k' servers are up = (N chooses k)*(1-p)^k*p^(N-k)

i.e. choose k out of N servers and multiply that by (1-p) for each k servers that are up and multiply by p for each (N-k) servers that are down.

Thus probability that at-least (N+1)/2 servers are up is (denoted by ‘f’):

f = (sum k=(N+1)/2 to N)(N chooses k) * (1-p)^k * p^(N-k) >= 0.9999

For a given probability of failure ‘p’, vary N and check when the above probability is at-least 0.9999.

Since on increasing the number of servers N, the probability ‘f’ ’increases, thus we can do binary search on N to find the minimum N such that f ≥ 0.9999.

def ncr(n):
    # Calculate nCr values using dynamoc programming
    cache = [[0]*(n+1) for i in range(n+1)]
    
    for i in range(n+1):
        for j in range(i+1):
            if j == 0 or i == j:
                cache[i][j] = 1
            else:
                cache[i][j] = cache[i-1][j] + cache[i-1][j-1]
    
    return cache

def prob(n, p, cache):
    # Calculate probability f (above) with n and p
    s = 0
    for k in range(int((n+1)/2), n+1):
        s += cache[n][k]*((1-p)**k)*(p**(n-k))
    
    return s

def binary_search(p, m=1000):
    # Maximum 1000 servers
    cache = ncr(m)

    # We want only odd number of servers
    odd_numbers = [x for x in range(1, m+1) if x % 2 == 1]
    
    left, right = 0, len(odd_numbers)-1
    q = -1
    
    while left <= right:
        mid = int((left+right)/2)
        n = odd_numbers[mid]
        
        h = prob(n, p, cache)
        
        if h >= 0.9999:
            # Found a solution, cache the solution in q and search left
            q = n
            right = mid-1
        else:
            left = mid+1
    
    return q

Probability that a server crashes in a 60 minute window is given as ‘p’. What is the probability that the server will crash in a 10 minute window?

Assuming crashes happen with a Poisson distribution (generally events are modelled using Poisson distribution) i.e. the probability that ‘x’ crashes happen in one unit of time

P(X=x) = e^(-l)*l^x/x!

where l = average number of crashes that happen in one unit of time.

In this example let 1 unit of time = 10 minute, then 60 minute = 6 units.

If ‘l’ number of crashes happen in 1 unit of time, then 6*l number of crashes happen in 6 units of time.

Probability that at-least one crash happens in 6 units of time = 1-Probability no crashes happens in 6 units of time.

P(X ≥ 1; 6 units) = 1-P(X=0;6 units)
P(X=0;6 units) = e^(-6l) because x=0 and thus x! = 1

P(X ≥ 1; 6 units) = 1-e^(-6l) = p

e^(-l) = (1-p)^(1/6)

Probability that at-least one crash happens in 1 unit of time i.e. 10 minute

P(X ≥ 1; 1 unit) = 1-P(X=0;1 unit)
P(X=0;1 unit) = e^(-l) because x=0 and thus x! = 1

P(X ≥ 1; 1unit) = 1-e^(-l) = 1-(1-p)^(1/6)

Given a distributed database system with 1 leader and N replica servers. During write, the leader replicates the data to all N replicas. If any replica fails, the leader will retry until the data is successfully replicated. What is the expected number of attempts before all replicas are successfully replicated? Assume that the probability a replica fails is p. In each failed attempt, all N replicas are retried (even for those that were successful).

If the replication was successful in the k-th attempt it implies, it failed for first k-1 attempts.

The probability that an attempt failed = 1-(1-p)^N

because the probability that replication failed = 1-probability that all replicas were successfully replicated

Probability that all replicas were successfully replicated=(1-p)^N

The probability that first k-1 attempts failed = (1-(1-p)^N)^(k-1)

The probability that the k-th attempt succeeded = (1-p)^N

Probability replication was successfull in k-th attempt = 
(1-(1-p)^N)^(k-1)*(1-p)^N

Letting (1-p)^N = q

Probability replication was successfull in k-th attempt P(k) = 
(1-q)^(k-1)*q

Expectation E = 1*P(1) + 2*P(2) + 3*P(3) + ….. infinite series

E = q + 2*(1-q)*q + 3*(1-q)²*q + …. 
(1-q)*E = (1-q)*q + 2*(1-q)²*q + 3*(1-q)³*q + ….

E - (1-q)*E = q + (1-q)*q + (1-q)²*q + …. = q*(1/(1-(1-q)) = q*(1/q) = 1

E*q = 1 or E = 1/q

Thus expected number of attempts = 1/(1-p)^N

In the above problem, instead of attempting retry on successful servers, the leader only attempts retry on the failed servers. A single attempt is considered as an attempt to replicate all the servers in parallel which has not yet been replicated.

Let E(n) be the expected number of attempts with n servers:

In the 1st attempt, either 0 or 1 or 2 or … N servers will be successfully replicated.

If k out of n servers are successfully replicated, then we will need to perform E(n-k) additional attempts because k servers has been already replicated.

The probability of this is nCk*(1-p)^k*p^(n-k).

E(n) = 1 + (sum k=0 to n) nCk * (1-p)^k * p^(n-k) * E(n-k)
E(n) = 1 + p^n*E(n) + (sum k=1 to n) nCk * (1-p)^k * p^(n-k) * E(n-k)
E(n)*(1-p^n) = 1 + (sum k=1 to n) nCk * (1-p)^k * p^(n-k) * E(n-k)

E(n) = (1 + (sum k=1 to n) nCk * (1-p)^k * p^(n-k) * E(n-k))/(1-p^n)

base cases are:
E(0) = 0

Time complexity of this approach is O(n²).

def recursion(n, p, cache, dp):
    # Base case
    if n <= 0:
        return 0
    
    s = 0
    for k in range(1, n+1):
        # If already computed, then don't call recursion
        a = recursion(n-k, p, cache, dp) if n-k not in dp else dp[n-k]
        s += cache[n][k]*((1-p)**k)*(p**(n-k))*a
    
    dp[n] = (1+s)
    return dp[n]

def retry_attempts(n, p):
    cache = ncr(n)
    return recursion(n, p, cache, {})/(1-p**n)

Given a database with n partitions. When a request comes to a server, if it is intended for its own partition then it will process it else it will forward the request to a random partition. What is the expected number of hops before a request lands up in its correct partition?

Probability that request lands up at the correct partition = 1/n

From above, expected number of hops = 1/(1/n) = n

In a leader election protocol, how many terms before all N servers gets the chance to become a leader ?

This is a variant of the Coupon Collector’s Problem.

Let the order in which the servers becomes leader for the 1st time (based on timestamp) be denoted by the sequence: S1, S2, … Sn

Let Xi be a variable denoting the number of terms it takes for Si to become leader after S(i-1) became leader i.e. after S(i-1) became leader, for the next Xi-1 terms, only servers among S1 to S(i-1) became leaders again and only in the Xi th term, Si became leader.

Then the expected number of terms before all N servers becomes leader

E = E[X1 + X2 + … + Xn] = E[X1] + E[X2] + …. + E[Xn]

E[Xi] = Expected number of terms before Si become leader after S(i-1)

Let p = probability that a server among S1 to S(i-1) wins election = (i-1)/N

Then probability that Si wins an election = 1-p = 1-(i-1)/N

Note that no other server apart from 1 to i cannot win election because according to our initial assumption, S(i+1) onwards can become leader only after Si becomes leader for the 1st time.

Thus from above (number of trials until success) E[Xi] = 1/(1-p) = N/N-i+1
E = N*(1/N + 1/N-1 + 1/N-2 + ….+ 1)

In an in-memory cache like Redis, what is the expected number of queries before all cache entries are accessed?

This is a similar problem as above (Coupon Collector’s Problem) and let N be the number of cache entries.

Xi in this example denotes the number of queries before the Si-th cache entry is accessed for the 1st time.

Thus, E = N*(1 + 1/2 + 1/3 + …. + 1/N)

In a TTL based caching, the average duration between access of a key is T, how should you choose TTL, so that 99.99% of the requests are honored?

In TTL base caching, a key is deleted once the TTL expires. If TTL is too small, then many requests will lead to cache miss as key have expired. If TTL is too high, then there could be too many stale keys.

Assuming that the duration between access is exponentially distributed with mean T.

The probability that the duration is less than some Y: P(X ≤ Y) = 1-e^(-T*Y)

If we set the TTL as Y, then P(X ≤ Y) = 0.9999

Solving for Y:

e^(-T*Y) = 1–0.9999
T*Y = -ln(0.0001) 
Y = -ln(0.0001)/T

In a given distributed hash table, the probabilities for accessing each key is p1, p2, …pN. What strategy would you use to partition the keys across K partitions such that the expected number of requests to each partition is roughly the same?

One strategy is that, if N keys are randomly distributed across K partitions, then each partition will have roughly 1/K probability of being accessed.

Given m keys with probabilities of access p1, p2, p3, …pm in a single partition, probability of either of them being accessed is p1+p2+…pm.

Thus p1+p2+…pm should be approximately equal to 1/K.

For each partition, find the subset of keys, whose total sum of probabilities is closest to 1/K.
Since the sum of the probabilities need not be equal to 1/K for all partitions, we can write a backtracking approach that minimizes the difference between the largest partition (sum) and the smallest partition (sum).

def backtracking(i, n, k, probabilities, assignments, sums):
    if i == n:
        # all the keys has been assigned
        return min(sums), max(sums), assignments[:]
        
    x, y = -float("Inf"), float("Inf")
    best = None
    min_diff = float("Inf")
    
    # For the i-th key loop over all possible partitions
    for j in range(k):
        if sums[j] <= 1/k:
            # heuristic added to reduce search space i.e. 
            # only add keys to a partition if the sum of the probabilities
            # is less than equal to 1/k
            assignments[i] = j
            sums[j] += probabilities[i]
            
            # a - minimum sum of probabilities for any partition
            # b - maximum sum of probabilities for any partition
            # c - best partition with minimum diff between max and min sums
            a, b, c = backtracking(i+1, n, k, probabilities, assignments, sums)
            
            if b-a < min_diff:
                min_diff = b-a
                x, y, best = a, b, c
            
            sums[j] -= probabilities[i]
            assignments[i] = -1
    
    return x, y, best
    
def partition(probabilities, k):
    # assignment[i] is the partition the i-th key is assigned to
    assignments = [-1]*len(probabilities)

    # sums[i] is the sum of the probabilities of the i-th partition
    sums = [0]*k

    a, b, c = backtracking(0, len(probabilities), k, 
                            probabilities, assignments, sums)
    
    return c

Time complexity of this approach is exponential in the number of keys and partitions. Thus we cannot use the above approach for large number of keys and partitions.

Another non-optimal (greedy) approach would be:

Create a min-heap (priority queue) with the sum of each partition as the key.
Iterate over each probability, and remove the root of the heap, then update the root by adding the new probability. In this way, for every key, we add it to the partition with the current minimum sum.
Also maintain an array of key indices in the heap entries.

import heapq

def partition(probabilities, k):
    heap = [(0, i, []) for i in range(k)]
    
    for i in range(len(probabilities)):
        # Remove the entry with the minimum partition sum
        x, j, h = heapq.heappop(heap)

        # Update and add new entry by adding the new probability
        heapq.heappush(heap, (x+probabilities[i], j, h + [i]))
    
    # Assign its partition to each key entry
    out = [-1]*len(probabilities)
    
    while len(heap) > 0:
        x, j, h = heapq.heappop(heap)
        for i in h:
            out[i] = j
        
    return out

Time complexity of approach is O(N*logN), but result is non-optimal. But most often it produces partitions with very close sum of probabilities.

Given a message queue, where messages arrive at an average rate of 80 per second (with Poisson distribution) and the processing speed is 100 per second (Poisson distribution). What is the expected number of messages waiting in the queue and the expected waiting time in queue?

The calculations are derived from this sheet: queuing_formulas.pdf (mst.edu)

This is an M/M/1 queue, with lambda = 80 and mu = 100 and c = 1

rho = lambda/(c*mu) = 8/10 = 0.8

Expected number of messages in the queue P = rho²/(1-rho) = 0.64/0.2 = 3.2
Expected waiting time in the queue = P/lambda = 3.2/80 = 40ms.

In the above problem, what if the number of queues are 3?

rho = lambda/(c*mu) = 8/30 = 0.267

Expected number of messages in the queue P = rho²/(1-rho) = 0.0972
Expected waiting time in the queue = P/lambda = 0.0972/80 = 1.21 ms.

An average of 1 million requests per second are coming through an app. The data requires processing which takes 50ms to process each request. How many worker processes are required so that the expected waiting time for each request is at-most 1 second?

The calculations are derived from this sheet: queuing_formulas.pdf (mst.edu)

lambda (incoming requests per second) = 10⁶

mu (serving requests per second) = 1/50ms = 20

Expected waiting time in the queue = rho²/(lambda*(1-rho)) <= 1s

rho²/(1-rho) <= 10⁶, Solving for rho, rho <= 0.999999
rho = lambda/(c*mu) <= 0.999999

c >= lambda/(0.999999*mu) = 10**6/(20*0.999999) = 50K

Thus, we need at-least 50K servers to server 1 million requests per sec.

These calculations assume that the incoming and serving request durations follow the exponential distributions.

Given a consistent hashing ring with M keys and N servers, what is the probability that at-least 90% of the keys will land up on one server?

Instead of using the ring architecture to solve this problem, we can use the fact that each key maps to one server. So this problem becomes similar to hash table collision i.e. probability that 90% of keys goes into a single slot.

Let u = int(0.9*M) be 90% of M (keys)

Let the probability that each slot has less than ‘u’ keys is given as p, then the probability that at-least one slot with greater than equal to ‘u’ keys is 1-p

Let G(M, N, H) be the number of ways to assign M keys in N slots s.t. all slots have less than H keys. Then we can compute G(M, N, H) recursively.

For the 1st slot, we can select 0 to H-1 keys and then recursively call G with remaining keys on N-1 slots. Repeat this for all N slots.

G(M, N, H) = (sum k=0 to H-1) (M chooses k) * G(M-k, N-1, H)

base case:
G(M, 1, H) = 1 if M < H else 0

Probability p = G(M, N, H)/N^M

Probability that at-least one slot with greater than equal to 'u' keys = 
1-(G(M, N, u)/N^M)

Python implementation:

Time complexity is O(M*N*H)

def consistent_hash(m, n, q=0.9):
    # Computes the nCr values using dynamic programming as seen above
    cache = ncr(m)

    # q is the fraction of keys in the same slot, here q=0.9
    u = int(q*m)
    p = recursion(m, n, u, cache, {})/(n**m)
    
    return 1-p

def recursion(m, n, h, cache, dp):
    # base case
    if n == 1:
        return 0 if m >= h else 1
    
    s = 0
    for k in range(h):
        if m >= k:
            # Use cached values wherever applicable
            a = recursion(m-k, n-1, h, cache, dp) if (m-k, n-1) not in dp else dp[(m-k, n-1)]
            s += cache[m][k]*a
    
    dp[(m, n)] = s
    return dp[(m, n)]

You want to perform load testing on an unreleased feature in development. For that you had sampled production traffic data for the last month. But after productionizing you find that the app is crashing too often under load. How will you figure out whether the traffic data used in development is different than the production data?

We need to perform statistical tests to determine whether the distribution of the 2 samples are similar or different.

One such test for arbitrary distributions is the Kolmogorov-Smirnov (K-S) Test.

Here is a good reference on how to conduct K-S Test in Python: How to Perform a Kolmogorov-Smirnov Test in Python — Statology

There are N servers behind load balancer. The i-th server is serving Qi number of requests. For the next request, you want to sample a server such that the probability of selecting a server is inversely proportional to number of requests to that server i.e. if a server is serving 90% of the requests, then the probability of selecting that server is much lower as compared to selecting a server which is serving 5% of the requests.

One such algorithm is as follows:

1. Compute S = Q1+Q2+…+Qn

2. Invert the requests Qi for each server to Q'i = S-Qi

3. Probability of inverted requests Pi = Q'i/(Q'1+Q'2+…+Q'n)

4. Compute cumulative Pi's denoted by Si = P1+P2+…+Pi (sum of P1 to Pi).

5. Generate a uniform random number b/w 0 and 1. Call this H.

6. Find the smallest i, s.t. Si ≥ H. 
Then we select this  i-th server for serving the next request. 
We can find the smallest i using binary search.

Python implementation:

import random

def load_balancer(requests):
    s = sum(requests)
    
    # Invert the requests since it is inversely proportional
    inv_requests = [s-x for x in requests]
    
    s = sum(inv_requests)
    probs = [x/s for x in inv_requests]
    
    # Compute the cumulative probabilities
    cum_probs = [0]*len(requests)
    h, j = 0, 0
    for x in probs:
        h += x
        cum_probs[j] = h
        j += 1
    
    # Generate uniform random number between 0 and 1
    u = random.random()

    # Do binary search to find smallest i s.t. cum_probs[i] >= u
    left, right = 0, len(cum_probs)-1
    p = -1
    
    while left <= right:
        mid = int((left+right)/2)
        if cum_probs[mid] >= u:
            p = mid
            right = mid-1
        else:
            left = mid+1
    
    return p

def experiment(requests, num_trials=5000): 
    for i in range(num_trials):
        server = load_balancer(requests)
        # Update requests for the server that was chosen
        requests[server] += 1

A good reference for this approach : Darts, Dice, and Coins (keithschwarz.com)

With a few servers this approach is ok, but if the number of servers are high enough, then we are re-computing the probabilities and cumulative probabilities with each request.

An online approach is using Thomson Sampling, where the distribution updates itself whenever it starts serving a new request. This approach is widely used in A/B Testing and Multi-Armed Bandits.

Thus we only need to sample the new server for each request and update the parameters of the beta distribution.

Here we assume that the probability that after a request is assigned to a server, it starts serving it, is ‘p’. Although we can choose it to be 1.

class ThomsonSampling:
    def __init__(self, a=1, b=1, p=0.95):
        # a, b - parameters of beta distribution
        # p - probability that the server will serve the request
        self.pos = a
        self.neg = b
        self.prob = p

        self.num_requests = 0
    
    def sample(self):
        return random.betavariate(self.pos, self.neg)
    
    def serve(self):
        return 1 if random.random() <= self.prob else 0
    
    def new_request(self):
        x = self.serve()
        
        if x == 1:
            self.num_requests += 1
        
        # Here neg is incremented in case the server starts serving because
        # we want to sample more of servers which is serving less
        self.neg += x
        self.pos += 1-x

def experiment(requests, num_trials=5000): 
    servers = []

    # Since we want to direct new requests to servers with lower number
    # of existing requests, hence we use the requests for each server
    # as the negative parameter in the beta distribution.
    for i in range(len(requests)):
        servers += [ThomsonSampling(1, requests[i]+1, p=1.0)]

    for i in range(num_trials):
        max_val = -float("Inf")
        best = None
        
        # Get the server with the highest probability of being sampled
        for j in range(len(servers)):
            s = servers[j]
            u = s.sample()
            
            if u > max_val:
                max_val = u
                best = s

        # Route new request to best server
        if best:
            best.new_request()
    
    for i in range(len(requests)):
        print(i, servers[i].num_requests)

At the end of the experiment we can see that the number of requests landing at the server with lower number of initial requests will be higher as compared to the others.

The beauty of this algorithm is that:

No need to update the requests array after each new request and also no need to re-compute the probabilities. Just need to update the parameters of the beta distribution.

References:

main.tex (asarkar.com)

L-18.pdf (duke.edu)

105_hash_and_prob_pub (jhu.edu)

Hash Collision Probabilities (preshing.com)

Probability of collisions (ucsd.edu)

12-hashing.pdf (illinois.edu)

Probability of Collision in Hash Function [Complete Analysis] (opengenus.org)

The probability of data loss in large clusters — Martin Kleppmann’s blog

Queueing_Problems_Solutions_2021_Sztrik.pdf (unideb.hu)

queuing_formulas.pdf (mst.edu)

Queueing theory (brunel.ac.uk)

Randomized Binary Search Algorithm — GeeksforGeeks

Randomized Algorithms — GeeksforGeeks

Sampling: Two Basic Algorithms (gregorygundersen.com)

Baby’s First Algorithmic Sampling from a Distribution: Methods Available and How to Use Them | by Gabriela Morgenshtern | Towards Data Science

slides_sampling.pdf (ucl.ac.uk)

Randomized-Algo.pdf (iitkgp.ac.in)

book.dvi (iitd.ac.in)

Small11.pdf (stanford.edu)

Randomized Graph Algorithms (ntua.gr)

Lecture 23: Randomized Algorithms (washington.edu)

Randomized Algorithms (utah.edu)

notes-random.dvi (mit.edu)

Probability and Computing, Oxford 2017–18

lec2final.pdf (princeton.edu)

Treap (A Randomized Binary Search Tree) — GeeksforGeeks

03-treaps.pdf (illinois.edu)

Exponential Distribution Examples in Statistics — VrcAcademy

Microsoft PowerPoint — Lecture_11_Poisson_Process_Exponential_Erlang_Gamma_Distributions.pptx (illinois.edu)

queue3.pdf (unl.edu)

probability — erlang distribution — Mathematics Stack Exchange

Gaussian Machine

Probability and Statistics for Software Engineers

Given a hash table of size N and M keys to insert in the hash table. What is the probability of a collision?

What is the probability of a collision for the i-th key?

What is the expected number of collisions?

What is the probability that the maximum number of keys in a slot is H?

Probability that a server is down is given as ‘p’. How many replicas will be required so that probability that at-least majority of the replicas are up is 99.99%?

Probability that a server crashes in a 60 minute window is given as ‘p’. What is the probability that the server will crash in a 10 minute window?

In the above problem, instead of attempting retry on successful servers, the leader only attempts retry on the failed servers. A single attempt is considered as an attempt to replicate all the servers in parallel which has not yet been replicated.

Given a database with n partitions. When a request comes to a server, if it is intended for its own partition then it will process it else it will forward the request to a random partition. What is the expected number of hops before a request lands up in its correct partition?

In a leader election protocol, how many terms before all N servers gets the chance to become a leader ?

In an in-memory cache like Redis, what is the expected number of queries before all cache entries are accessed?

In a TTL based caching, the average duration between access of a key is T, how should you choose TTL, so that 99.99% of the requests are honored?

In a given distributed hash table, the probabilities for accessing each key is p1, p2, …pN. What strategy would you use to partition the keys across K partitions such that the expected number of requests to each partition is roughly the same?

Given a message queue, where messages arrive at an average rate of 80 per second (with Poisson distribution) and the processing speed is 100 per second (Poisson distribution). What is the expected number of messages waiting in the queue and the expected waiting time in queue?

In the above problem, what if the number of queues are 3?

An average of 1 million requests per second are coming through an app. The data requires processing which takes 50ms to process each request. How many worker processes are required so that the expected waiting time for each request is at-most 1 second?

Given a consistent hashing ring with M keys and N servers, what is the probability that at-least 90% of the keys will land up on one server?

References:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Gaussian Machine

Written by Abhijit Mondal

Responses (1)