“If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail.” — Abraham Maslow
In the field of software engineering, every choice you make requires making one or more tradeoffs. From passing technical interviews (especially system design) to designing real systems within a company requires one to do tradeoff analysis.
Some benefits of doing tradeoff analysis are as follows:
- Making informed choice.
“I used NoSQL because it is better than SQL and scalable” is not a very credible statement from an engineer.
- Makes collaboration easier with other engineers.
“I used JAVA because I only know JAVA” is something not expected from an engineer working in a team of 20 people having different programming language backgrounds.
- Better design documentation.
When one uses NoSQL over SQL and after few years realizes that NoSQL is not suitable, they do not have a way to go back and evaluate why they chose NoSQL over SQL. They may need to start evaluation from scratch.
- Brings clarity to design.
When one evaluates multiple options, they can consider multiple perspectives of the problem.
For e.g. when one evaluates JAVA vs. Python, they think from compiled vs interpreted language perspective. Adding C++ to the comparison, will allow them to also think in terms of garbage collection, memory allocation/deallocation, performance etc.
In this post I will show using a few examples from our day to day software engineering and ML problems how to do tradeoff analysis.
Hash Table vs. Trie for Search Autocomplete
Both Trie and Hash Table have an average case time complexity of O(W) for search and insert, where W is the average length of a query string.
In Hash Table it is O(W) because we need to first calculate the hash of the input by traversing the entire string.
But in Hash Tables once we find the hash, accessing the memory is in O(1) but with Tries we have to access O(W) memory addresses using pointers to child nodes. Same applies for inserts.
In Trie, space complexity is O(N*W) where N is the number of query strings and W is the average length of a query string.
In Hash Table, the number of keys are O(N*W) because for each string of length W, we insert all the prefixes in the Hash Table and since the number of prefixes are W for a W length string, thus number of keys are O(N*W).
Apart from these, each value corresponding to a key is a list query ids. Since a query id can occur with W keys, thus space complexity of storing the values is also O(N*W).
Using Tries, the number of nodes can be reduced significantly if many query string share common prefixes. Using compressed tries, instead of each node storing single letters, we can store entire prefix thus reducing the space requirements.
But Tries may not give a lot of upper hand in terms of memory usage, because they use pointers to traverse along the tree and pointers incur significant memory overheads. Thus there will not be too much of a difference in the memory consumption of a Hash Table vs a Compressed Trie.
Tries are non trivial to distribute across multiple servers as compared to a Hash Table.
Should we partition by levels or prefixes or range of prefixes?
For e.g. given a query “cat”, in order to find all prefixes with “cat”, we might find that queries with “c” is in one server, “ca” in another and “cat” in yet another server because we are traversing the Trie nodes for each letter.
With Hash Table, once we find the hash of “cat”, we can now lookup the partition where hash(“cat”) is located.
When a query is inserted into a Trie each node along the path from root needs to be locked in order to update the frequency of the prefix as well as add the new query id.
But with a Hash Table, onec we find the hash of the query, we need to only lock the corresponding key to insert query id or update frequency.
Hash Tables are relatively straightforward to serialize into a file or a string as these are just key value pairs.
Tries are tricky because we also need to maintain parent child relations. One possibility could be to serialize a Trie by levels i.e. store one level per file and along with the level, store the parent node ids.
Djikstra’s vs Bellman Ford for Finding Shortest Paths
In Djikstra’s, time complexity for search is O(V*logE) where V is the number of nodes and E is the number of edges. Since E=O(V²), we can also write the complexity as O(V*logV).
Bellman Ford algorithm has a complexity of O(V*E), thus in the worst case it can be O(V³).
Another advantage of Bellmand Ford algorithm is that it can work iteratively i.e. in iteration K over the number of nodes V, the algorithm finds the shortest path from source to all nodes located at-most K edges away.
Thus if are interested in shortest paths to only the neighbors at-most K edges away, then we can run the algorithm for only K iterations instead of V where V >> K.
In Djikstra’s, we need to store the edges in a min priority queue in order to access the lowest edge weight edge first. Storing the edges has a space complexity of O(E).
We also need to store the vector of edge weights from source to V nodes — space complexity of O(V).
In Bellman Ford, we only need to store a vector of the shortest path weights from source, thus space complexity of O(V).
Parallel & Distributed Setting
Bellmand Ford algorithm do not consider any ordering of the edges whereas in Djikstra’s ordering of the edges from a node matters as we take the smallest edge weight first and so on.
Thus we can easily parallelize and distribute the shortest path computations for a given iteration.
Knuth Morris Pratt (KMP) vs Rabin Karp vs Suffix Trie for String Search
KMP has worst case run time complexity of O(N+M) where N is the base string and M is the pattern to search.
While Rabin Karp has average case run time complexity of O(N+M) but worst case complexity of O(N*M).
Suffix Trie has a run time complexity of O(M).
KMP algorithm is suitable when the base string is changing often as compared to the pattern because we need to pre-process every time a new pattern is seen. But even with pre-processing time complexity is still O(N+M) since complexity of pre-processing is O(M).
An application for KMP could be log mining for errors i.e. extract all lines from log files with log level ERROR, KMP algorithm can be used to search streaming logs for application or server errors.
But for something like search engine, KMP would be bad idea because search queries (pattern) are changing frequently but base corpus is relatively static. A Suffix Trie is more suitable for this.
Suffix Trie can be adapted for multiple base strings but generally they are suitable when a single base string is used for searching multiple patterns.
Regarding space complexity, KMP requires addition O(M) space to hold the Longest Proper Prefix array for the pattern of length M. Rabin Karp do not require any additional space.
Suffix Trie has a worst case space complexity of O(N²) where N is the size of the base string. Thus if the base string is too large such as in GBs, it is infeasible to use a Suffix Trie to store it.
With streaming data such as server logs, although suffix trie can be adapated to update the trie in real time by adding the extra nodes at the leaves but it may be costly as it would require to add O(N) new nodes.
In Rabin Karp, we can pre-compute the rolling hashes of the stream for each new sliding window in O(1) time. If any hash matches with pattern, we can check for exact match in O(M) time.
AVL Trees vs Skip List vs Red Black Tree for Sorted Sets
AVL Trees, Red Black Trees and Skip Lists have search and insert time complexity of O(logN).
While both AVL Tree and RB Tree maintain a height of O(logN) but AVL tree has stricter rules than RB Tree and thus require making more updates during insertion.
Thus reads are faster with AVL Trees while writes are faster with RB Trees.
Both AVL Tree and RB Tree have a space complexity of O(N) for N nodes plus the additional overheads of left and right pointers at each node.
Skip Lists also have an expected space complexity of O(N). But in worst case can go upto O(N*logN).
There are O(logN) levels of linked lists and at level k, there are O(N/(2^k)) nodes. Thus for each node, it is inserted in the k-th level with a probability of O(N/(2^k)). But this is a probability and there is no guarantee that the random number generator will not insert a node at every level.
Both AVL Tree and RB Tree has to lock multiple nodes during concurrent read/write operations which is overcome by using a Skip List. Thus in a highly concurrent access environment it is better to go with a Skip List rather than AVL Tree or RB Tree.
We only need to lock nodes directly linked to the node being updated i.e. the previous and the next nodes which is O(logN) in case of a skip list.
Leader vs Leaderless replication in distributed systems
With a leader based replication, all writes happen through the leader, which ensures writes are always strongly consistent.
In the case of leaderless replication, 2 or more servers can see different states at any point in time. Thus writes are eventually consistent implying that at some time in future all replicas will converge to the same state.
For reads, one can either opt for reading through leader or any replica. If reads always happen through the leader then reads are also strongly consistent but if some other replica is used for reads, then there is a chance that the replica might not be updated and thus it is eventually consistent.
Reads and writes are faster with leaderless replication because any replica can be used to read or write if some other replica is overloaded.
With only a single leader, if the leader is overwhelmed with requests, reads and writes can be slow.
If the leader crashes, then a new leader must be chosen through an election process. Until a leader is chosen, new requests are queued up.
In leaderless replication, all requests are served even if few replicas crashed because any replica can serve read/write requests.
Hash vs Range Partitioning in distributed systems
Hash based partitioning is good for point queries i.e. find if X exists or get the value of X whereas Range based partitioning is good for range queries i.e. find all points between X and Y or get all records in the last 30 days.
SQL vs NoSQL databases
SQL databases are good for the following query and data patterns:
- Using JOIN operations on multiple tables
- Require ACID based transactions
- Fixed schema for tables
- Data is normalized.
NoSQL databases are good for the following query and data patterns:
- No JOIN operations are required
- No ACID gaurantees
- Schema of tables changing frequently.
- Data is denormalized i.e. possibly redundant data in multiple tables.
SQL databases generally use B-Trees for indexing which gives good read performance but writes are slower because B-Trees need to re-balance and update itself based on the branching factor.
NoSQL databases generally use LSM Tree for indexing. Writes are fast because writes happen by appending to a log file and an in-memory hash table Memtable. The memtable is periodically flushed to disk.
Thus any read queries which is very recently written will be fast as they served from memtable, but if the data has been flushed to disk, then reads will be slower as it requires disk I/O to get the record.
SQL databases use leader based replication hence reads and writes are strongly consistent.
NoSQL databases generally use leaderless replication hence reads and writes are eventually consistent.
Neural Network vs Logistic Regression for ML prediction
In LR, the output variable is dependent linearly on the features whereas in neural network, the output is a non-linear combination of the features. Thus neural networks are better suited for modelling non-linear datasets.
Time and Space Complexity
Neural networks have multiple layers of weights and non-linear functions whereas LR is only a single layer. Thus LR is trained much faster per epoch than neural networks.
Similarly for inferencing, number of predictions per second with LR is much higher as compared to neural networks.
Memory requirements for neural networks is also higher as compared to LR as we need to store multiple layers of weights.
Both neural networks and LR can be trained in a distributed manner.
Neural networks usually have higher prediction accuracy than LR, but many times a complicated model cannot generalize well across different examples (termed overfitting).
The above are just a handful of examples from a software engineer’s day to day life. Let me know in the comments what trdeoff analysis have you done or doing at your work.
- KMP Algorithm for Pattern Searching — GeeksforGeeks
- Rabin-Karp Algorithm for Pattern Searching — GeeksforGeeks
- The Rabin-Karp Algorithm Explained (freecodecamp.org)
- rec06-rabin-karp-spring2011.pdf (mit.edu)
- skip_bst.dvi (clemson.edu)
- c++ — Why is std::map implemented as a red-black tree? — Stack Overflow