# From EM to Embeddings

Expectation Maximization is a quite an old tool/concept in the Machine Learning domain. Although it is an old tool but it took me quite some time to grasp the concept and the intuition behind it given that most tutorials and articles out there explain it with heavy mathematical equations. But eventually I found out that, the maths behind the intuition is pretty simple to understand, only the long equations might be a bit put off.

To explain it to someone with not much mathematical background but with some idea about machine learning. In clear and concise english :

Given some observations (or data, as you prefer), you have to build a model, that can be used to explain the observations and also generalize well to new observations. For observations, which we have all the information required to compute the model parameters, we can apply some optimization technique to compute them. But sometimes the observations might not have all the information required to compute the parameters. Given such a scenario where there is missing data in observations, we apply the EM algorithm, which is an iterative procedure.

Note that if we knew the parameters of the model then we can easily compute all the data (including missing data) for the observations, and also if we had all the data we could compute the parameters of the model. Now that this has become a chicken and egg problem, we can use this two facts to generate our algorithm :

Initialization : Assign arbitrary values to the model parameters.

“E” Step : Using the model parameters (initial or from a previous iteration), compute the expected value of the missing data. “Expected” because there could be multiple possible ways of inferring the missing data, each with certain probability.

“M” Step : Using the expected value of the missing data determined in the E step above, plus the observed data, compute the most likely model parameters using optimization techniques .

Repeat the E and the M step until the model parameters converge.

As you see this technique is very intuitive and many of us have knowingly or unknowingly applied this technique elsewhere too (although without any theoretical understanding of why this works).

I wanted to learn the Python programming language (The Model) by learning the different components and their syntaxes (The Parameters). So I pick up this Python Cookbook that teaches through different examples (Observations).

So I start reading this book and encounter an example that uses Classes and Objects, Map and List data structures. If I already know what are Classes and Objects, Maps and Lists from my previous programming experiences, or if the book also had the explanation for each of the components then it would be a piece of cake to understand how a Map or a List is implemented or a Class is defined in Python.

But generally cookbooks will not introduce basic data structures and algorithms. They are only meant for teaching through examples. So this becomes the missing data in our context. One straightforward way to fill in the missing data is to use a data structures and algorithms book to learn about Maps and Lists and an OOPS books to learn about Classes and Objects. But the availability of external resources to make the incomplete data a complete one, is not an use-case for EM as it assumes that by no means, apart from only using the observations we can infer anything about the missing data.

Thus in order to only make use of the Observations i.e. the examples in the book, I could try to look at several similar examples and try to figure out what a particular block of code is trying to do. That is to say, if you look at sufficient number of examples that uses Maps and Lists and their outputs, you can deduce what a Map or a List does but maybe I will not know how hash functions work. This is similar to ‘E’ step in the algorithm. After I infer some understanding of what a Map or List does, I try to fit them in the context of the examples (The “M” Step).

To understand the complete Python programming language, I will similarly approach the understanding of other components and simultaneously deduce knowledge about missing data. This is probably not a good use-case to demonstrate EM in human learning as often, because we have easy access to external resources that would make the observed data a complete one (no missing information). But the take-away is that even in the absence of resources like Google or Stack Overflow, one could guess how a particular data structure works even if the ultimate goal is to learn Python.

This idea of mental frameworks to understand concepts naturally without any external supervision has been used quite effectively in latest Deep Learning research to infer semantics of natural languages. Artificial Neural Networks (modeled after the human brain), can be made to learn hidden relations from multiple observations. Think of what an Arts student will make out of a bunch of Python code snippets. She may not have any knowledge of programming but given many examples and the outputs (observations) on running the examples, she will eventually learn that a List can be used to store values in memory sequentially or a Map can used to retrieve a value from a key without understanding of how hashing works and so on.

This idea holds good with word embeddings i.e. individual words in a sentence are represented as multi-dimensional vectors. Someone who does not know, that the words “happy” and “glad” are semantically similar, would look at the vectors for this two words and can compute the cosine similarity to infer that these two words are similar. The meaning of “happy” and “glad” are hidden from them, but that does not prevent them from inferring that these words are semantically similar.

*Originally published at **www.stokastik.in** on May 6, 2017.*