Identity Resolution using PostgreSQL

Source: quantamagazine.com

In many CRM platforms, customers can connect multiple sources of data for different analytics and ML use cases such as Recommendations, Churn Prediction, Identity Resolution etc.

In this post we look at one of such use case which is Identity Resolution i.e. identifying which profile are the same. For e.g. users of a customer can come through multiple channels such as app, website, mobile browser, whatsapp, api etc. with different profiles.

Source: segment.com

With each channel there can be certain associated identifiers for e.g.

  • Mobile App — phone number, email id, device id, IMEI number, FB email, Google email, etc.
  • Website — email id, username, FB email etc.
  • API — API key, email id, phone number etc.

It is often the case that the same user has logged into the platform via different channels at different point in time and probably with different identifiers.

Now with so many different profiles of same users, it would be hell for customer support to reach out to these users for marketing and sales promotions, running analytics such as customer segmentation or providing recommendations and so on.

If only we could identify which profiles belong to the same user, then we can club them for analysis and ML related workflows.

Although there are several strategies ranging from simple deterministic approaches to complex ML models to identify which profiles belong to the same user, in this post we look at a simple yet effective approach — Union Find (or Breadth First Search).

  • Profiles which have at-least one common unique identifier between them would be considered to be the same user.
  • Profile A — (email: ‘abc@xyz.com’, phone:123456789)
  • Profile B — (email: ‘abc@xyz.com’, phone:987654321)
  • Profile C — (email: ‘NULL’, phone:987654321)
  • Profile D — (email: ‘def@xyz.com’, phone:234567890)
  • Then Profiles A, B and C belong to same user whereas Profile D belongs to different user.

To keep it concise for this post, I will not get into the details of different table schemas we use in Postgres to store different metadata, let’s assume that there is a ‘profiles’ table in Postgres with profile details as well as different identifier fields as columns.

This is a sample data with a handful of identifier fields shown. In real world use case there could be around 100s of such identifier fields.

profiles table in Postgres
  • Note that first_name and last_name are not identifier fields as these names can belong to anyone.

Whenever a profile is inserted into the above table, there is a Postgres trigger that will update an inverted index. The inverted index table for the above data will look somethings like:

inverted_index table

Note that in order to query profiles by the identifier fields, we can off-course index the identifier field columns, but in all practical cases there are 100s of such fields. Moreover the postgres schema for profile is customer specific i.e. there are some customers who allow API access whereas some customers have only website and thus all the identifier fields are meaningless for all the customers.

To handle such dynamic environments we chose to create the inverted index in order to handle profile queries.

The trigger written in Postgres is as follows:

There will be another table that will store the adjacency list for the profiles. Whenever a profile is inserted or updated, based on the identifier fields, a function will decide which all profile ids are neighbours of this profile in the adjacency graph.

adjacency_list table

For this we have written another trigger function in Postgres:

Once the adjacency _list table is populated, doing Union Find operation is just doing BFS over the adjacency_list for a given profile id or identifier fields.

BFS Traversal (Source: Hackerearth)

The BFS method written in PostgreSQL + V8 (Javascript):

When the above SQL method is called with the SQL query:

SELECT get_identities(‘{“email_id”: “ironman@avengers.com”, “phone_number”: “1234567890”}’)

It would return union of all the profiles found using BFS starting with the identifier field “email_id”: “ironman@avengers.com” and “phone_number”: “1234567890”. Thus all these profiles point towards the same user on the platform.

Some features of the above approach

  • The Postgres functions are written in pgsql or plv8 programming languages. I wanted to use C for writing those functions since it would be much faster but unfortunately we are using Postgres in AWS RDS which do not support C extensions due to security reasons.
  • We can cache the union find algorithm results for frequently accessed identifier fields or profiles. But caching everything comes at a cost since any update to the profile will trigger updates in inverted index as well as on adjacency_list tables. Thus cache will be invalidated and have to recalculated.
  • The above approach is quite scalable given that number of identities for an user ranges somewhere between 1 to 10. Thus making real time BFS queries very efficient.
  • Using PostgreSQL to write the BFS and adjacency list methods saves round-trip I/O times when we implement union-find algorithm with our application code for e.g. in Java or Python.

ML Engineer @ Twilio