Going beyond DSA — The Pain of Real Life Software Engineering

9 min readAug 1, 2022

I have heard it often that real life software engineering do not correlate with ability to solve data structure and algorithms problems in interviews and companies should re-evaluate their hiring practices.

Although there is some truth to the above statement, but ability to solve data structure and algorithms problems in interviews is an indicator of whether the candidate will be able to solve hard problems in software engineering when the situation arises.

In this “evolving” post we will look at some of the considerations we need to take into account while designing real life software applications. These are all based on my own personal experience and may not be exhaustive by any means. Feel free to add your own experiences in comments.

Never write your code thinking that you are the only person going to work on that piece of code.

Its a common mistake fresh software engineers make often. Common symptoms of such an anti-pattern are:

All the codes are written in a single or 2 files.
Redundant code.
Constants, secrets, environment variables etc. are defined inline.
No test cases.
Parameters are hard-coded instead of accepting arguments.
Not meaningful variable names.
No comments.
Reinventing the wheel — e.g. writing function for sorting from scratch instead of using libraries.
… and so on.

Your code should be independent of the operating system, local vs. cloud and environments such as development, staging and production.

Using Docker for development and deployment mitigates the following issues:

Code could not compile or gives run time error due to differences in OS (Linux vs. Windows).
Error due to differences in installed libraries (missing or different versions) in development vs production.
Do not have permissions to download and install packages from internet in production.

Reusable cloud infrastructure. The same infrastructure settings that you use for development should be replicated as is in staging and production.

Using Infrastructure-as-a-Code (IaaC) template can help mitigate this issue. Some open source IaaC are Terraform, Pulumi, Chef etc. Cloud specific are Cloudformation (AWS), ARM (Azure Resource Manager) and so on.

Manually deploying cloud resources is error prone especially when there are many resources. What if you missed one configuration or added a different configuration in production from staging ?
Manual process is time consuming.
When multiple developers are working on the same resources, how do you track changes made ? We can use source control with IaaS.

Idempotency of pipelines. For data pipelines it is important that for the same input, the final state of all components after pipeline run is the same everytime.

In data pipelines, we have multiple data sources, data transformations and sinks (databases, filesystems etc.). It might happen that during a run, before all sinks are updated, the system crashes or some particular sink has been updated partially.

If we retry the failed operation, the system is in a different state other than the one initially i.e. before the failed operation because some sinks are updated or semi-updated whereas some are not updated at all. Thus on repeating the operation on the updated state can cause errors due to duplication.

One possible solution is to log the state of the rows affected before and after an operation in a text file for each sink and if there is a failure somewhere revert back to the previous state using the log.

Data, code and models are going to evolve overtime and it is important to version them so that we can replicate and debug production issues.

Just as we version our codes with builds and releases it is important to version data and models too when working with ML systems.

For e.g. we gave a new release for a ML model in production. But due to some mistake, the ML model from production got deleted or corrupted. Meanwhile the data for the model also got updated thus now there is no way to get back the same model that was running in production.

Unless we version and snapshot the data used to generate a production model, such issues can frequently arise.

Many times product managers come up with new metrics to be computed and compared overtime.

These metrics can be easily accomodated with a new release but for older releases where we did not have the logic implemented we cannot go back and re-implement the logic on the historical data if the data changed.

Similarly, when we saw that the new model is not performing as expected in production and giving empty responses often, so we want to revert to the old model in production.

Without model versioning there is no way to go back to the old model.

Using retry logic for failed I/O operations.

CPU bound tasks fail mostly due to incorrect code i.e. maybe some test cases were not handled correctly within the code. But I/O bound tasks such as reading file from disk or fetching an URL using HTTP connection may also fail due to several other reasons such as:

Network interruption, disconnection & reconnection.
Slow internet and timeout imposed.
Throttling due to rate limit imposed by the downstream service.
External filesystem not mounted or taking time to mount.
… and so on.

For the above reasons, we need to place retry logic in the code. Retry can be implemented using exponential backoff strategy i.e. if we retry fetching an URL after X seconds and it fails, we retry again after 2X seconds, then 4X and so on until either we succeed or we exhaust the maximum number of retries.

Logging, Tracing, Monitoring & Alerting

Systems and codes break and will break even in the hands of the best programmers but we need to:

Detect where it broke (Logging)
Detect what kind of scenarios led it to break (Monitoring)
Detect the flow of code which broke the code/system. (Tracing)
Notify people responsible for the piece of code or the system along with log path, monoitoring metrics, trace path etc. so that they can be fixed. (Alerting)

Track and maintain shared and global variables in centralized data storage such as a DB server (but with care !!!).

Systems with multiple microservices or multi-activity data pipelines leverage global or shared variables for different purposes such as:

Date — the date on which a pipeline is run.
Environment such as development, staging and production.
Feature flags for allowing only a certain services to a subset of users.
… and so on.

For e.g. if each microservice maintains a local hashmap of the feature flags or uses IF..ELSE block, then if we want to update a feature flag, we have to make code changes to multiple microservices and this is error prone due to human error.

Similarly with date, assuming that we keep track of the date on which a pipeline is run, then if the pipeline starts running on 11:45PM at night and completes at 12:15 AM the next day, then although the pipeline ran for only 30 minutes but the “date” variable is not the same for activities that ran before 12:00AM and activities that ran after 12:00AM.

Best would be to calculate the date, insert into centralized DB server, and for each activity fetch the date value from DB server instead of calculating themselves.

But there’s a catch. What if someone mistakenly updated a shared variable in the DB while an end-to-end pipeline is running ? Half of the pipeline will see one value and another half will see a different value.

Ideally we would get the shared variable(s) first time from the DB and then pass this variable along the entire pipeline with each request.

Avoiding cold start problems (warming up caches)

Often when we are giving a new feature release in production, if the features uses backend SQL queries involving multiple joins such as finding friend of friends etc. then real time performance of these queries is going to be bad.

Often the first few users of the feature suffers since the queries are not cached the first time. To overcome this we can explicitly cache the queries and results for some selected user ids.

Trading off algorithm performance for reusability, maintenance etc. in production.

When I was a junior, whenever I went through some production code I used to think why the hell did they not use a Trie for searching substrings or why are people still writing linear search algorithm instead of binary search for searching through a list of sorted ids ?

Well it’s always debatable. On the one hand, using trie or binary search will improve time complexity but on the other they increase code complexity in production which leads to writing more test cases, more custom code if in-built libraries are not available and so on.

Decouple systems as much as possible but not too much.

Nothing new here. Microservices are built on top of it. Decoupling can happen at multiple levels in an application.

Within a codebase or repository — Decoupling models (data access layers, database connections etc.) from controllers (API logic handlers i.e. how to route and handle each GET/PUT/POST request) from views (how to display in the UI)
Decoupling repositories — For large projects with distributed teams, it makes sense to decouple Git repositories for e.g. having machine learning model training code in a different repository than deployment or Infrastructure as a Service codebase etc.

Some advantages of decoupling are:

Granular role based access controls.
Lesser test cases and thus faster release cycles.
Separation of responsibilities.
Smaller codebases and files and thus less chances of bugs.

Some disadvantages as well:

Kind of hard to navigate multiple repositories for same project.
Common utility functions might be replicated across repositories.

Handle secrets such as DB passwords etc. securely. They should not be accessible to anyone.

Its good practice to use different cloud key management services such as KMS is AWS or Key Vault in Azure for storing and accessing sensitive data such as passwords or client keys.

Paaswords, client keys, access keys etc. should never be checked into git repositories as plain text, since anybody can misuse them.

Test your own algorithms against brute force approaches.

Sometimes I have to write an algorithm or data structure from scratch because it is not available as a library in my programming language and it is very essential to do so.

For e.g. a segment tree is generally not implemented in most languages for finding max/min frequency in an interval.

So how do you evaluate whether the implementation you have written is correct or not ?

Using and comparing against brute force methods. If we are 100% confident on our brute force approach then we can use that against our segment tree implementation to run on the same input and if there is mismatch then our implementation has bugs.

But if it matches we cannot be always 100% sure because the test case was “easy”.

Generate millions of random test cases using inbuilt libraries and compare.

User inputs must be validated before making SQL queries

Any data that is generated by users such as GET and POST request parameters must be validated before making SQL queries with them. There are chances that people could use the inputs to make SQL injection and probably delete your table or database completely.

I made mistake where I started to look at SQL queries for WHERE clauses and if these parameters are user generated or not. It might happen that inputs are processed through a pipeline with multiple variables in between and the WHERE clause is on these variables so it is an indirect user input which also must be verified.

Don’t use multithreading or asynchronous programming without proper understanding of how control is flowing.

Some common pitfalls of multithreading I have seen:

Using multithreading or multiprocessing for CPU bound tasks such as joining two lists. Most often using inbuilt libraries or single threaded implementations works better than multithreading for CPU bound tasks. It is recommended for I/O bound tasks such as downloading or uploading images/files or reading/writing to DBs etc.
Not using thread safety. When 2 threads are processing the same transaction it can lead to incorrect results without thread safety. For e.g. x=x+1, in this statement if x=10 initially, with 2 threads one would expect a result of 12, but it can happen that the 2 threads reads the same value=10 simultaneously and thus booth updates x to 11.
Not using blocking implementations. When using shared data structures such as shared list, shared queue etc. it can happen that when a thread is reading a value, the queue might be empty. If its non blocking queue, then the thread will exit even if there are new elements incoming.
Ignoring dependencies. Multithreading is applicable for tasks which are independent. If dependent tasks are multithreaded then we can get errors randomly when a dependent task is executed before. The same applies for writing to memory addresses. If multiple threads are writing to a shared data structure, they should always write at disjoint memory addresses.
Spawning Threads instead of using ThreadPools. It is an unnecessary overhead to create and destroy threads. Most often existing threads are waiting for tasks to be assigned.

To Be Continued…

Going beyond DSA — The Pain of Real Life Software Engineering

Written by Abhijit Mondal

Responses (1)