# Redis Sorted Sets and Skip Lists

I am going to discuss a very useful data structure in Redis — The Sorted Sets.

I discovered the Sorted Sets data structure in Redis while searching for Priority Queues in Redis. To my surprise there is no Priority Queue implementation in Redis, although it is one of the most important data structure used in many practical scenarios.

The advantages of Priority Queue are:

- Fast minimum and maximum get() operations. In fact for a min-priority queue, getting the minimum is O(1). Similarly for a max-priority queue. But you cannot get both in O(1) using the same priority queue.
- Removing the minimum and maximum is O(logN), where N is the total size of priority queue. But again you cannot remove both in O(logN) using the same priority queue.
- Inserting a new element into the priority queue is O(logN). Similarly deletion of an existing element is O(logN)
- Updating an existing index or key is again O(logN).

Priority Queues are implemented using Heap data structures where for a Min-Heap, the value at the i-th index is smaller than both values at indices 2*i+1 and 2*i+2.

These are useful in following scenarios:

- Maintaining K highest or lowest elements. For e.g. track the 100 most frequent search queries in a stream of query data.
- Finding the K closest elements.
- …and many more.

For finding the K highest or lowest elements, we can always sort the array and then take the first or last K elements but imagine you are receving a stream of data such as search queries where you have to sort the entire data seen so far for every new query. That is very inefficient.

Python has a library ‘heapq’ that implements Min Priority Queue. But the impementation suffers from not able to update an index or a key in the heap.

For e.g. If I wanted to update the frequency of a query, then if the query is not the at the root of the heap then I cannot directly update the frequency. Thus I have written an implementation in Python that enables updates in O(logN).

To my surprise Redis did not have Priority Queue implementation but it relies on something known as Sorted Sets. Sorted Sets are nothing fancy but its kind of a Hash-Map where the entries are sorted based on their values.

{‘facebook’: 1.0, ‘instagram’: 3.0, ‘apple’: 6.0, ‘google’: 10.0}

Generally when we use HashMap (or dict in Python) there is no guarantee that the keys or values will be ordered.

Simple HashMaps offers the following time complexities (in the average case):

- Search by key — O(1)
- Search by value — O(N) (we need to scan the entire hashmap)
- Insertion, deletion and updation — O(1)
- Get key-value pair with minimum/maximum value — O(N) (we need to scan the entire hashmap)
- Remove minimum/maximum value (key-value pair)— O(N) (same as point 4)

But with HashMaps ordered by values, we can improve search by value, get key-value pair with minimum/maximum value and remove minimum/maximum value to O(logN) from O(N).

But this also affects searching, insertion, deletion and updation because now we cannot just do it in O(1) since we need to maintain the order.

In many practical scenarios, Sorted Sets can be implemented using a Balanced Binary Search Tree.

Balanced BSTs are BSTs where the difference between the heights of left and right sub-tree at any node is at-most 1. Popular algorithms for constructing Balanced BSTs are AVL-Tree and Red-Black Tree.

These can be implemented for streaming data too.

Generally when elements are added to a BST, it can be ordered or can be randomly shuffled. When consecutive burst of elements are ordered in a stream, searching, insertion/deletion/updation is O(N) in the worst case because the tree would be skewed.

Thus we require Balanced BSTs so that search, insertion/deletion/updation is O(logN) in the worst case.

- Search by key — O(1)
- Search by value — O(logN)
- Insertion, deletion and updation — O(logN)
- Get key-value pair with minimum/maximum value — O(logN)
- Remove minimum/maximum value (key-value pair) — O(logN)

Here is simple Python implementation for insertion into AVL Tree.

In Redis, Sorted Sets are implemented using a probabilistic data structure called as Skip Lists.

The motivation for coming up with Skip Lists was to improve the search time complexity of linked lists.

The difference between sorted arrays and sorted linked lists is that we can search an element in O(logN) time in sorted arrays but we cannot do so in sorted linked lists because we need to traverse the next pointers to arrive at the search node.

In Skip Lists, there are multiple levels of sorted lists. Basically each node can have multiple next pointers and previous pointers. Each node in the Skip List is assigned a random level L. For a level ‘i’ in the skip list, there are next pointers between nodes whose levels are at-least ‘i’.

Let’s look at the following diagram taken from the original paper.

The values are sorted in the above list. Also each node has different levels (or heights if you prefer that). For e.g. the node with value 6 is of level 4 (because it is present in 4 levels), similarly node with values 9 and 17 are of level 2, 25 of level 3 and remaining are level 1.

The head node is assigned maximum level i.e. is present at all levels.

The levels are assigned probabilistically. In a more common scenario a node of level ‘i’ is ‘promoted’ to a level ‘i+1’ with a probability of 50%. Or you can thinks in this way:

Assumming that the number of levels ranges from 1 to 4, then if a node has 100% chance of being assigned level 1, it also has a 50% chance of assigning level 2, 25% chance of level 3 and 12.5 % chances of level 4. Thus 1/8 th of all nodes will be assigned level 4.

The following ‘Sample’ class is used to sample a level L randomly for a new node. This is simple ‘random sample with weights’ implementation.

The search in skip list proceeds as follows:

Let maximum level = L = 4. Search the value 17 in the above list.

- Start from head node pointer of level L.
- Follow next pointers at level L until the next is 17 or next is NULL or the next has a value greater than 17.
- If next at level L is NULL or has a value greater than 17, then do L = L-1
- Repeat the steps 2 and 3. If the value 17 is present then it must be definitely discovered at level 1. Else if we reach NULL pointer at level 1 then the value is not present.

The advantage of the above algorithm over normal sorted linked list is that in a single level sorted linked list we need to traverse each node before reaching our search value.

But with skip lists we have skipped the nodes 3, 9 and 12 while search for the value 17 in the above diagram. On an average the search time complexity is O(logN), the base of log being 1/p where ‘p’ is the probability with which a node at level ‘i’ is also assigned a level ‘i+1’. In our case it was 0.5.

Similarly for insertion of a node with value 17, proceed as follows:

Let maximum level = L = 4.

- For the new node, randomly sample a level between 1 to L. Let’s call this level H.
- Maintain an array update[] for all levels ≤ H.
- The array update[] will contain pointers to all nodes that will be previous to our new node in the final list. i.e. update[j] will contain the pointer to the node at level ‘j’ which has the maximum value less than 17.
- Start from head node pointer of level L.
- Follow next pointers at level L until the next is NULL or the next has a value greater than 17.
- If next at level L is NULL or has a value greater than 17, then do L = L-1
- If next at level L is NULL or has a value greater than 17 and L ≤ H, then do update[L] = current_node.
- Repeat the steps 5 to 7.
- Once we reach level 1, we create the new node and then assign the next pointers to all nodes present in array update[] to point to the new node.

Time complexity for insertion into Skip List is again O(logN).

Similarly for deletion and updation we have O(logN) time complexities.

Not explaining all the other algorithms which is pretty much similar and also easy to understand. Here is the Skip List class implementation with all required functionalities:

- Search by Key (‘search_key’)
- Search by Value (‘search_val’)
- Insert (‘insert’)
- Delete (‘delete’)
- Get Minimum (‘get_min’) and Get Maximum (‘get_max’)
- Pop Minimum (‘pop_min’) and Pop Maximum (‘pop_max’)

Sorted Sets apart from providing similar time complexity guarantees for all Priority Queue operations also provides the additional benefits over Priority Queue:

- Can obtain minimum and maximum or do min-pop and max-pop in the same data structure unlike in heap for which we need min-heap and max-heap separately.
- Useful for analytics with streaming data which requires sorted snapshot of data at each time instance. For e.g. computing the median of a stream of numbers, quartile ranges, percentiles, number of values less than or greater than X etc.
- Obtain sorted list of values in O(N) (because list is already sorted) as compared to O(NlogN) in Priority Queue.

I did a comparison of Skip-List with AVL Tree for insertion of **1 million real numbers**.

The AVL Tree implementation in Python took around **116 seconds** while Skip List implementation in Python took **61 seconds**.

In Redis, Sorted Sets are implemented in ZADD ‘family’ of commands. As seen in two of my earlier posts on distributed crawlers, ZADD commands are really useful whenever we need Priority Queue implementation.