Problem solving with data structures in OOPs and LLDs
Having worked on variety of problems over the past 10 years or so, one of the most challenging aspect that still remains for me is how to tackle inter-dependency problems in OOPs efficiently?
The confusions that arises when dealing with problems that require dealing with inter-dependency in an OOPs are :
- What data structures should we use?
- What is the optimal number of data structures that will solve this problem?
The 2nd problem is there because in inter-dependent problems, it is not trivial to come up with some data structures that will solve all the requirements.
One might come up with a different data structure for each requirement but it need not be optimal.
In this post, I will walk through some approaches that will help to clear doubts on how to approach such kind of problems in both interviews as well as on the job.
On the way we will realize that is not just about the data structures that we have to worry about.
The E-commerce Problem
Design a class for an E-commerce website with the following requirements:
- Sellers must be able to upload, edit and delete products.
- Products must be listed under one or multiple categories such as sports, apparel, electronics etc.
- Products will have some price associated with them
- It must be possible to fetch the price of a given product.
- It must be possible to fetch the products sorted by price lowest to highest for a given category.
- If a category is deleted all products listed within that category will be removed but if any product is listed in multiple categories then it is only removed if all categories are removed.
What kind of in-memory data structures are you going to use for this problem? Discuss the pros and cons of your design.
Solution:
One way to approach such problems is to first list down all the entities that are interacting.
We can define entities as all such variables in the problem which are either an input or output for a requirement.
So in this case, we have:
- Seller
- Category
- Product
- Price
Next we list down how we want to connect the entities given our requirements.
In this format, we define an input and an output. Output can be a list or an array of entities and in that case we have to define how this list is ordered.
Thus we have the following relationships b/w the entities:
seller -> [products]
product -> price
product -> [categories]
category -> [products] (ordered by price)
Given this relationships, now it is easy to see that we can apply relevant data structure to each of these relationships separately:
seller -> [products]
Hash Map + Set
product -> price
Hash Map
product -> [categories]
Hash Map + Set
category -> [products] (ordered by price)
Hash Map + Sorted Set
Next, step is to define the functions for the class as per the requirements of the problem:
The constructor with all the Hash Maps:
from sortedcontainers import SortedSet
class Ecommerce:
def __init__(self):
self.seller_products = {}
self.product_seller = {}
self.product_price = {}
self.product_categories = {}
self.category_products = {}
Next we add the sellers and categories
def addCategory(self, cat_id):
# Add a category
if cat_id not in self.category_products:
self.category_products[cat_id] = SortedSet(key=lambda k:k[0])
def addSeller(self, seller_id):
# Add a seller
if seller_id not in self.seller_products:
self.seller_products[seller_id] = set()
Then we add the products under a seller and a category
def addProduct(self, seller_id, prod_id, price, cat_ids):
# Add a product
try:
# Relationship b/w seller and product
self.seller_products[seller_id].add(prod_id)
self.product_seller[prod_id] = seller_id
# Relationship b/w product and price
self.product_price[prod_id] = price
# Relationship b/w product and categories
if prod_id not in self.product_categories:
self.product_categories[prod_id] = set()
for cat_id in cat_ids:
self.product_categories[prod_id].add(cat_id)
self.category_products[cat_id].add((price, prod_id))
except Exception as e:
print("Error: ", e)
How we can update the price of a product?
def setPriceOfProduct(self, prod_id, price):
# Set price of product
try:
# From sorted set in category to products, remove the
# corresponding key and add a new key
cat_ids = self.product_categories[prod_id]
old_price = self.product_price[prod_id]
for cat_id in cat_ids:
self.category_products[cat_id].remove((old_price, prod_id))
self.category_products[cat_id].add((price, prod_id))
# Update product price
self.product_price[prod_id] = price
except Exception as e:
print("Error: ", e)
Methods to delete a category for a product and add an existing category to a product. This is useful for updating product categories
def deleteProductCategory(self, prod_id, cat_id):
# Delete a category from a product
try:
self.product_categories[prod_id].remove(cat_id)
price = self.product_price[prod_id]
self.category_products[cat_id].remove((price, prod_id))
except Exception as e:
print("Error: ", e)
def setProductCategory(self, prod_id, cat_id):
# Add a category to a product
try:
self.product_categories[prod_id].add(cat_id)
price = self.product_price[prod_id]
self.category_products[cat_id].add((price, prod_id))
except Exception as e:
print("Error: ", e)
How to delete a product or a category?
def deleteProduct(self, prod_id):
# Delete a product
try:
# Remove connections from seller
seller_id = self.product_seller[prod_id]
self.seller_products[seller_id].remove(prod_id)
self.product_seller.pop(prod_id)
# Remove connections from categories
price = self.product_price[prod_id]
self.product_price.pop(prod_id)
cat_ids = self.product_categories[prod_id]
for cat_id in cat_ids:
self.category_products[cat_id].remove((price, prod_id))
# Delete the keys
self.product_categories.pop(prod_id)
except Exception as e:
print("Error: ", e)
def deleteCategory(self, cat_id):
# Delete a category
try:
# Remove cat_id from all associated products
for prod_id in self.category_products[cat_id]:
self.product_categories[prod_id].remove(cat_id)
if len(self.product_categories[prod_id]) == 0:
self.deleteProduct(prod_id)
# Delete key
self.category_products.pop(cat_id)
except Exception as e:
print("Error: ", e)
That looks like a decent implementation (although we have skipped some methods required to complete the class)
But there are a few things that we can implement in a better way:
What if we want to add a new entity such as Brand of a product. For that we must also create 2 new hash maps product -> brand, brand -> [products].
Moreover during deletion of a product we must also remember to update the two new hash maps. Now think of adding 10 more entities to the problem.
Similarly, when we update the price or other attributes of a product, we must also take care of the hash maps that uses product as a key.
How would you update your code to handle all these ?
If your answer is using Graph, then you are right !!!
We can largely simplify the problem by re-implementing all the hash maps as a graph data structure where each entity represents a node and the relationships are edges b/w them.
But how do we convert hash maps into Graphs?
Remember that Graphs are represented as adjacency lists which are nothing but hash maps from one node to another node or a set of nodes.
For e.g. if we have a hash map a -> b and another from b ->[c], then we have a graph with nodes a, b and c and edges b/w them a -> b -> [c].
Let us define 2 classes ‘Node’ and ‘Graph’ for representing the Graph:
Node class
class Node:
def __init__(self, id, type, attributes=None):
self.id = id
self.type = type
self.uid = (self.id, self.type)
self.attributes = attributes
def __hash__(self):
return hash(self.uid)
def __eq__(self, other):
return self.uid == other.uid
The hash and eq methods are required for comparing two or more Node objects.
The Graph class
class Graph:
def __init__(self):
self.adj_list = {}
self.parents = {}
Graph class methods : addNode, addEdge, deleteEdge, deleteNode etc.
def addNode(self, node:Node):
# Add a node
self.adj_list[node.uid] = {}
def addEdge(self, srcNode:Node, dstNode:Node, sortKey:str=None):
# Add an edge b/w 2 nodes
if srcNode.uid not in self.adj_list:
self.adj_list[srcNode.uid] = {}
if dstNode.type not in self.adj_list[srcNode.uid]:
if sortKey:
# Add node with ordering enabled
self.adj_list[srcNode.uid][dstNode.type] \
= SortedSet(key=lambda k: k.attributes[sortKey])
else:
self.adj_list[srcNode.uid][dstNode.type] = set()
self.adj_list[srcNode.uid][dstNode.type].add(dstNode)
# Add parent nodes
if dstNode.uid not in self.parents:
self.parents[dstNode.uid] = set()
self.parents[dstNode.uid].add(srcNode)
def deleteEdge(self, srcNode:Node, dstNode:Node):
# Delete an edge b/w 2 nodes
if srcNode.uid in self.adj_list and \
dstNode.type in self.adj_list[srcNode.uid] and \
dstNode in self.adj_list[srcNode.uid][dstNode.type]:
# Remove child and parent connections
self.adj_list[srcNode.uid][dstNode.type].remove(dstNode)
self.parents[dstNode.uid].remove(srcNode)
if len(self.adj_list[srcNode.uid][dstNode.type]) == 0:
self.adj_list[srcNode.uid].pop(dstNode.type)
def getChildren(self, srcNode:Node, dstType:str):
# Get all children of a node of speecific type
if srcNode.uid in self.adj_list and \
dstType in self.adj_list[srcNode.uid]:
return list(self.adj_list[srcNode.uid][dstType])
return []
def deleteNode(self, node:Node):
# Delete a node
# Delete all child edges and orphan nodes
for _, dstNodes in dict(self.adj_list[node.uid]).items():
for dstNode in set(dstNodes):
# Delete all edges first
self.deleteEdge(node, dstNode)
self.deleteEdge(dstNode, node)
# If node is "orphan" then recursively delete that node
if len(self.adj_list[dstNode.uid]) == 0 and \
len(self.parents[dstNode.uid]) == 0:
self.deleteNode(dstNode)
# Delete all parent edges and orphan nodes
for dstNode in set(self.parents[node.uid]):
# Delete all edges first
self.deleteEdge(node, dstNode)
self.deleteEdge(dstNode, node)
# If node is "orphan" then recursively delete that node
if len(self.adj_list[dstNode.uid]) == 0 and \
len(self.parents[dstNode.uid]) == 0:
self.deleteNode(dstNode)
if len(self.adj_list[node.uid]) == 0:
self.adj_list.pop(node.uid)
Finally we re-implement our Ecommerce class using Graph instead of hash maps.
class Ecommerce:
def __init__(self):
self.graph = Graph()
self.nodes = {}
Add sellers and categories
def addSeller(self, seller_id):
# Add seller node
try:
node = Node(seller_id, "SELLER", {})
self.nodes[node.uid] = node
except Exception as e:
print("Error: ", e)
def addCategory(self, cat_id):
# Add category node
try:
node = Node(cat_id, "CATEGORY", {})
self.nodes[node.uid] = node
except Exception as e:
print("Error: ", e)
Add products
def addProduct(self, seller_id, prod_id, price, cat_ids):
# Add a product node
try:
if (prod_id, "PRODUCT") not in self.nodes:
prod_node = Node(prod_id, "PRODUCT", {"price":price})
self.nodes[prod_node.uid] = prod_node
prod_node = self.nodes[(prod_id, "PRODUCT")]
if (seller_id, "SELLER") not in self.nodes:
self.addSeller(seller_id)
# Create edges with seller node
seller_node = self.nodes[(seller_id, "SELLER")]
self.graph.addEdge(seller_node, prod_node)
self.graph.addEdge(prod_node, seller_node)
# Create edges with categories
for cat_id in cat_ids:
if (cat_id, "CATEGORY") not in self.nodes:
self.addCategory(cat_id)
cat_node = self.nodes[(cat_id, "CATEGORY")]
self.graph.addEdge(prod_node, cat_node)
self.graph.addEdge(cat_node, prod_node, "price")
except Exception as e:
print("Error: ", e)
Set the price of a product
def setPriceOfProduct(self, prod_id, price):
# Set price of product
try:
prod_node = self.nodes[(prod_id, "PRODUCT")]
seller_id = self.graph.getChildren(prod_node, "SELLER")
if len(seller_id) > 0:
seller_id = seller_id[0].id
else:
raise Exception("Seller ID Missing")
cat_ids = self.graph.getChildren(prod_node, "CATEGORY")
if len(cat_ids) == 0:
raise Exception("Category ID Missing")
else:
cat_ids = [x.id for x in cat_ids]
# First delete the node
self.graph.deleteNode(prod_node)
# Then add back with updated price
self.addProduct(seller_id, prod_id, price, cat_ids)
except Exception as e:
print("Error: ", e)
Remove category for a product and add an existing category
def deleteProductCategory(self, prod_id, cat_id):
# Delete category of a product
try:
n1 = self.nodes[(prod_id, "PRODUCT")]
n2 = self.nodes[(cat_id, "CATEGORY")]
# Remove the edges
self.graph.deleteEdge(n1, n2)
self.graph.deleteEdge(n2, n1)
except Exception as e:
print("Error: ", e)
def setProductCategory(self, prod_id, cat_id):
# Add existing category to a product
try:
n1 = self.nodes[(prod_id, "PRODUCT")]
n2 = self.nodes[(cat_id, "CATEGORY")]
# Add edges
self.graph.addEdge(n1, n2)
self.graph.addEdge(n2, n1, sortKey="price")
except Exception as e:
print("Error: ", e)
Finally delete product and category
def deleteProduct(self, prod_id):
# Delete a product
try:
prod_node = self.nodes[(prod_id, "PRODUCT")]
self.graph.deleteNode(prod_node)
except Exception as e:
print("Error: ", e)
def deleteCategory(self, cat_id):
# Delete a category
try:
cat_node = self.nodes[(cat_id, "CATEGORY")]
self.graph.deleteNode(cat_node)
except Exception as e:
print("Error: ", e)
*The implementations are not thoroughly unit tested so there could be bugs in them
Finding equivalence b/w Hash Maps and Graph:
Lets take an example of entity relationship:
a -> b, a->c, c->[a], b->[d], d->[a]
The adjacency list for the Graph would look like as follows:
(a : b, c) (b : [d]) (c : [a]) (d : [a])
Add a new entity X and associated relationships
- Hash Map — Create new hash map(s) with key as X.id. If there are multiple relationships emerging out of X, then create multiple hash maps corresponding to each relation.
- Graph — Create new node(s) with id as X.uid. Create edges to other entities for each type of relationship.
X.uid is a combination of X.id and X.type because id value may not be unique across all types.
Delete an entity X and associated relationships
- Hash Map — Delete X from all Hash Maps with X.id as the key.
For Hash Maps where X is a value, update those Hash Maps too to remove any reference to X.
If the key does not have any value after removing X, then delete the key too from Hash Map (although this is a special requirement only).
Note that how this process is recursive because now we are also deleting other entities which were only dependent on X. - Graph — Remove all edges going out to and coming in from other nodes to X.
If after removal of the edges, any node which is an “orphan” node with no outgoing or incoming edges, then delete those nodes too (special case).
Do BFS or DFS to take this action over the entire graph.
Update entity X
Updation of entities is not a simple process with inter-dependent data structures.
If the entity X do not have any dependency on other entities, then it is just a simple update to an attribute of X.
But if the entity X is linked to other entities, then there is a chance that updating X, will alter how the links with other entities work. For e.g. if there is a link b/w product price and category, if price is less than 100. Then if price is updated to 150. The link do not exists anymore.
To overcome the challenges with such dependencies, the safest way to update is to first delete the entity X and then add back X with updated attributes.
The approach of deletion with graph is more generic and less error prone than deleting bunch of keys keeping in mind the dependencies b/w multiple hash maps.
For e.g. if there are N different relationships from Product to other entities, then on deleting a Product, we also have to remove reference of product from the N other hash maps. For that we have to keep in mind, which hash maps holds references to products.
Using graphs, we have to only deal with Nodes and Edges. Since we tag each edge with the type of destination node, thus it is pretty easy to delete a Product node because we already know which edges go out of a product node and which edges come to a product node (using parents).
There are many things we have skipped while doing the implementation:
- How to serialize and deserialize the class? This is an useful step when we want to run the program at multiple intervals of time.
- What happens when the system crashes? Since the data structures are in-memory, they will be lost. One way is to create checkpoints with serialization. But that would also lead to data loss. Another technique is to use a WAL (write ahead log) to persist each command in a text file.
- What happens when the size of in-memory data increases without bounds say few 100GBs. One solution is to use a distributed implementation where we partition the graph across multiple servers.
- How to ensure atomicity? During addition of the new product, it can happen that some nodes and edges were created successfully, while some failed.
- Thread Safety? How to ensure that the graph object is thread safe? Locks are expensive but sometimes necessary.
One thing worth mentioning in the above solution is that we implicitly assumed that each requirement will be returned to us in the most time efficient complexity possible i.e. O(1) or O(logN).
In that process, we tried to compromise on the space complexity by creating additional data structures to do some precomputations.
For e.g. instead of returning the products in a category in O(N) time by keeping them in sorted sets, we can also keep them in unordered lists and at the time of query we sort the list based on product price with a time complexity of O(NlogN).