ML System Design — ETA Prediction

Abhijit Mondal
9 min readMar 25, 2022
source: quanta magazine

Design a system to show ETA of food delivery for Uber Eats/Doordash. ETA is shown after order is placed at a restaurant.

Type of ML problem

This is regression problem as we need to predict a duration of when the food will arrive.

But this problem is not very simple as one of the non-functional requirements is that the estimate should be as accurate as possible. If estimate is higher then user might cancel the order or if it is lower user might complain and uninstall the app. In both cases there is loss for the company.

Instead of single regression problem we can divide the problem into multiple components based on different stages of food delivery.

  • Time taken by restaurant to prepare the order (P1)
  • Time taken by find a rider match and pickup the order (P2)
  • Time taken to deliver the order from restaurant to user location (P3)

Most often P1 and P2 are concurrent i.e. they are computed parallely. Thus

ETA = max(P1, P2)+P3

Since it is not possible to know exactly when a restaurant has finished cooking the food items, thus it is technically challenging to compute P1. Thus we will use the order pickup time by rider as the proxy here. So we can replace max(P1, P2) with P1:

ETA = P1+P3

Also in order to ensure that riders do not unnecessarily wait for the restaurant to cook the food, the system must dispatch the rider request such that the time it takes to dispatch request, rider accepting a request and rider arriving at restaurant is almost same as the food preparation time.

Thus if order is placed at time T0 and request for rider is dispatched at T1, then: abs((T0+P1)-(T1+P2)) should be as minimum as possible.

Thus we can conclude that T1 = T0+P1-P2.

Before finding P2, we also need to match a rider to an order. Lets assume that there is a separate matching service which returns the most optimal rider for an order.

What are the features ?

We can use some of the following set of features for our model:

Model P1:

  • Number of orders in queue at the restaurant
  • Number of cooking staffs
  • Average time to prepare order in the last 30 days.
  • Type of restaurant — Fast Food, Chinese, Indian, etc.
  • Average number of customers eating in the restaurant at any given time in the last 30 days
  • Day of the week
  • Hour of the day

Model P2 (After rider-order matching algorithm):

  • Average speed of rider in the last 30 days.
  • Estimated distance from rider location to restaurant.
  • Estimated ETA from the Map.
  • Rider type of vehicle (model)
  • Traffic conditions — Light, Moderate, Heavy
  • Average speed of all riders picking up from the same restaurant in the last 1–2 hours.
  • Average speed of all riders picking up from the same restaurant in the last 7 days.
  • Number of riders already dispatched to the same restaurant.
  • Day of the week
  • Hour of the day
  • Weather data — Is it raining, Temperature, Humidity, Wind Speed

Model P3:

  • Average speed of rider in the last 30 days.
  • Estimated distance from restaurant location to user location.
  • Estimated ETA from the Map.
  • Rider type of vehicle (model)
  • Traffic conditions — Light, Moderate, Heavy
  • Average speed of all riders going from same pickup to same drop location in the last 1–2 hours.
  • Average speed of all riders going from same pickup to same drop location in the last 7 days.
  • Day of the week
  • Hour of the day
  • Weather data — Is it raining, Temperature, Humidity, Wind Speed

Note that model P2 and P3 are kind of similar and we can combine these 2 models into a single model. We just have to change the source and destination locations for the 2 scenarios.

How to get training and testing data, how to get labels ?

For restaurant details we will have one table:

restaurants: (res_id, res_name, res_type, number_of_staffs, in_seating_capacity, latitide, longitude, city, country)

For user details we will have one table:

users: (user_id, firstname, lastname, email, phone, address, latitide, longitude, city, country)

For orders created, will ingest them into Kafka, which will then be read by consumers and written to Cassandra or Hive table:

orders: (order_id, user_id, res_id, rider_id, list_of_food_items, total_cost, created_time, rider_dispatch_time, pickup_time, delivered_time, estimated_eta_map, estimated_distance_map, vehicle_type, day_of_week, hour_of_day, temperature, humidity, wind_speed, is_raining, is_active, is_picked_up, total_distance)

To compute: Number of orders in queue at the restaurant, we can run Spark SQL job as follows:

SELECT res_id, COUNT(*) as num_active_orders FROM orders WHERE is_picked_up=0 GROUP BY res_id

Average time to prepare the same item in the last 30 days.

SELECT res_id, AVG(pickup_time-created_time) as prep_time FROM orders WHERE created_time >= NOW()-'30 DAYS' GROUP BY res_id

Average speed of rider in the last 30 days.

SELECT rider_id, AVG(total_distance/(delivered_time-pickup_time)) as speed FROM orders WHERE created_time >= NOW()-'30 DAYS' GROUP BY rider_id

Average speed of all riders picking up from the same restaurant in the last 1–2 hours.

SELECT res_id, AVG(total_distance/(pickup_time-rider_dispatch_time)) as to_res_speed FROM orders WHERE created_time >= NOW()-'2 HOURS' GROUP BY res_idSELECT res_id, AVG(total_distance/(delivered_time-pickup_time)) as to_user_speed FROM orders WHERE created_time >= NOW()-'2 HOURS' GROUP BY res_id

Since in real time these features might be expensive to compute, we can also use in-memory database such as Redis to maintain the real time features.

In redis create a HashMap with key as the res_id and value is a doubly linked list. Each linked list node has key as the order_id.

HashMap A: res_id -> [OrderNode1, OrderNode2, … OrderNodek]

struct OrderNode {
string order_id;
Node* prev;
Node* next;
}

Whenever a new order is created for res_id create a new tail node. Also whenever an order is picked up, delete that node.

To get the number of active orders, return the length of the linked list for res_id key.

To compute average preparation time in last 30 days:

Create a redis hashmap with key as res_id and value is a linked list where each node stores the sum of the preparation times and number of orders prepared in a 30 minute interval [T, T+30].

struct PrepNode {
int total_preparation_time;
int total_orders_prepared;
double startTime;
double endTime;
Node* next;
}

Whenever a new order is prepared at time T1, if T1 ≤ T+30, then increment the preparation time and the number of orders by 1. Else add a new tail node with interval [T1, T1+30].

HashMap B: res_id -> [PrepNode1, PrepNode2, … PrepNodek]

To get average preparation time in last 30 days, compute the sum of preparation times and sum of the orders in each node till 30 days. This would be much faster as compared to running SQL queries on a huge database.

Another approach for dealing with time based aggregation is to use Structured Streaming in Spark with window function and watermarking to remove unwanted data to be stored in memory.

windowedCountsDF = \
eventsDF \
.withWatermark("eventTime", "43200 minutes") \
.groupBy(
"deviceId",
window("eventTime", "43200 minutes", "30 minutes")) \
.count()

Labels are obtained as follows:

Model P1:

SELECT order_id, (pickup_time-created_time) as preparation_time FROM orders;

Model P2:

SELECT order_id, (pickup_time-rider_dispatch_time) as rider_arrival_time FROM orders;

Model P3:

SELECT order_id, (delivered_time-pickup_time) as delivery_time FROM orders;

What feature pre-processing steps are required ?

Similar to Surge Price Prediction

How to compute the feature representations ? Handle high cardinality features ?

Similar to Surge Price Prediction

Where to store the feature representations ?

Similar to Surge Price Prediction

How to train the model using the features ?

Lets taken one of the models P3. All models will be trained similarly since these are all regression models.

  • Read the features for each order from Cassandra/Redis.
  • Read the delivery_time output for each order from Cassandra/Redis.
  • Split the set of orders into training and testing (80–20).
  • For the training data set further do 5-fold cross validation split.
  • Fit a linear regression model on the features and labels corresponding to the training dataset and validate the model RMSE on the validation dataset.
  • Use L2 regularization to prevent overfitting.
  • Train the regression model using SGD or Adam as the size of data is too huge to compute the exact matrix solution.
  • Tune the hyperparameters i.e. regularization constant with grid search.
  • The best hyperparameters are chosen based on the RMSE score on validation dataset averaged over all the 5 runs.
  • Final model is trained on the entire training data with the best hyperparameters.

How to evaluate the model offline ? Metrics ?

RMSE Score

How to save the model and model weights, architecture etc. ?

Similar to Fare Estimation.

How to train when the training data size is too huge for a single machine ?

Similar to Fare Estimation.

How to deploy the inferencing codes into production ?

Similar to Fare Estimation.

The historical features such as:

  • Average speed of rider in the last 30 days.
  • Rider type of vehicle (model)
  • Average speed of all riders going from same pickup to same drop location in the last 7 days.

…etc.

will be fetched from the Redis feature store, whereas the real time features such as:

  • Estimated ETA from the Map.
  • Traffic conditions — Light, Moderate, Heavy
  • Average speed of all riders going from same pickup to same drop location in the last 1–2 hours.
  • Day of the week
  • Hour of the day
  • Weather data — Is it raining, Temperature, Humidity, Wind Speed

…etc. will be computed in real time using Spark Streaming or from Redis hashmaps/linked list above.

How to monitor the models in production ?

Similar to Fare Estimation.

How to do online evaluation ?

There are different metrics which we can track using A/B Testing such as :

  • Number of orders placed.
  • Difference between the predicted and the actual ETA (where actual ETA > predicted ETA)
  • Number of users who ordered again from the app.

For the 1st metric, although it would capture the fact that predicted ETA is small enough for an user to place an order but it could be possible that users do not again place an order after receiving an order with ETA more than ETA predicted. This would not be tracked by the metric.

For the 2nd metric, we do not get a clear business metric as to whether users are Ok or Not with receiving orders later than predicted ETA.

Thus we look into the 3rd metric which captures both the business metrics as well as the fact that whether an user orders again.

For each order attempt we assign a label ‘0’ if they cancel an order after placing them, ‘1’ if the user placed order but have not ordered before (during the observation period) and ‘2’ if the user placed order and also have ordered before (during the observation period).

Increase in label 1 but not in 2 indicates that predicted ETA was much less than actual ETA. Whereas increase in both labels 1 and 2 indicates predicted ETA was almost perfect.

The tricky part is that we do not know how frequently an user places an order, so it could very well happen that an user “will” place another order in the future.

We can sample users from both groups whose average duration between 2 orders is less than 30 days before doing A/B Testing.

The base model (model0 or null hypothesis) in this case is that an user is shown ETA based on the Maps data i.e. Maps_ETA(rider location to restaurant) + Maps_ETA(restaurant to user address)

Our Linear Regression model is model1.

We perform chi-square test on the observations for 30 days. Divide the set of users into 2 equal groups with one of them receiving ETA prediction from model0 and another from model1.

Thus we would have 6 different counts:

  • Number of users with label 0 from model0
  • Number of users with label 1 from model0
  • Number of users with label 2 from model0
  • Number of users with label 0 from model1
  • Number of users with label 1 from model1
  • Number of users with label 2 from model1

We can compute chi-square score using the above 6 counts.

Pipeline

Training

Inference

--

--