For the past few days I have been learning and training myself in several deep learning algorithms required for work in my organization as well as for hobby. The number of deep learning papers, resources, articles and blogs are increasing at an exponential pace and its hard to keep pace with these developments.
Although the deep learning era is here upon us, most of the mathematical and statistical foundations were laid decades back but due to lack of hardware capabilities at that time, deep learning was in a dormant stage. Most of the recent breakthroughs and state of the art results are purely due to experiments with neural network architectures. Built on the principle of artificial neural networks, the most commonly used architectures are the Convolution Neural Networks Layer and Recurrent Neural Networks Layer (or the LSTM variant).
In this series of posts we are going to understand CNN and RNN/LSTM networks from a practitioners point of view, why these are so effective and what are some of the recent modifications done to achieve state of the art results on several ML challenges. We start with the most fundamental question, i.e. why did we need to move towards deep neural networks or CNN or RNN/LSTM given that we already had several “effective” approaches to solve similar problems such as SVM, Gradient Boosting (XGBOOST), Random Forests, Conditional Random Fields, HMM, Artificial Neural Network etc. ?
One reason why CNN and RNN/LSTM are so popular is because these are end to end systems. For example, if we want to build a basic question answering system, that automatically selects the best matching answer for a question, from the corpus, first we need to extract the tokens from both questions and answers, clean the tokens to remove any unwanted characters, ignore stop-word tokens or very small or large tokens, create TF-IDF matrix, generate positive and negative Q&A pairs and lastly we would probably use a logistic regression classifier to score a Q&A pair.
But as you see, this method is not very effective and would not probably give a very good AUC or F1 score on validation data, due to obvious reasons, that based on just TF-IDF values of tokens, a question can match closely with multiple possible answers and similarly an answer can match with multiple possible questions.
For example, the answers “The Taj Mahal is located in Agra, India” and “The Taj Mahal located in Agra, India was built by Shah Jahan” might be very similar in the TF-IDF space but they answer completely different questions.
There is no way to infer that a question and an answer is semantically similar or dissimilar based on just the TF-IDF scores of the tokens. Possible improvements could be to use joint features for question and answer.
For example, if a question has the feature set [“who”, “is”, “the”, “prime”, “minister”, “of”, “india”] and an answer has the features [“narendra”, “modi”, “is”, “the”, “current”, “prime”, “minister”, “of”, “india”], then the joint features would be :
[(“who”, “narendra”), (“who”, “modi”), … (“prime”, “india”)…and so on]
i.e. every possible pair from both set of features.
This might lead to some improvements as we are injecting explicit co-occurrences but would suffer due to many more noisy features (total number of features is a product of number of features of question and number of features of answer). For example, although (“prime”, “narendra”), (“minister”, “narendra”) etc. are good features, (“who”, “is”), (“who”, “current”) etc. are just plain noises.
Along with the tokens, we could improve the results if we also use the POS tags as features because many times the answer intrinsically depends on the sequence of POS tags in the question. Another possibility could be to use sequence information in the classifier (i.e. word order), for that we would probably need to feed N-grams as tokens into the logistic regression classifier or use HMM or CRF as a classifier that automatically does sequence classification.
Yet other potential features could be the question types such as “what”, “which”, “how”, “where” etc. or the number of common features between question and answer. To capture meanings of features, one can use topic modeling techniques such as LDA. The point is that we need to experiment with several possible features and classifiers before we come up with the desired level of performance. Engineers without specialization and background in computational linguistics would find it difficult to come up with such features as these require certain domain knowledge.
But instead if one can just use the extracted tokens above to generate embeddings for words in questions and answers, and then train a deep neural network with a stack of CNN, LSTM and fully connected layers and then use a logistic regression classifier at the end to output a final score, then the implementation effort reduces greatly as it does away with the need for domain expertise.
But on the one hand where we are reducing our work on features selection using domain expertise, we are introducing a new difficulty i.e. selecting the best possible combination or approach to connect the embedding layer, the CNN layer, the LSTM layer into one network that will give the best results. The task of experimenting with different architectures is greatly reduced, thanks to tools like Tensorflow, Theano and Keras. That is what deep learning is all about.
A hidden layer in a neural network transforms the inputs to a distributed representation (dense vectors) space, that captures semantic and syntactic relations between the inputs.
The hidden layer representations w.r.t Q&A’s can capture the question types, intent of a question etc. along the different dimensions of the hidden layer representations. Thus if we cluster the representations, then we might see that one cluster predominantly contains “what” type of questions, another “where” type and so on. Similarly if we learn hidden layer representations with both question and answer as inputs, then representations would capture that for “where” type of question, answer has a “location” or a “who” type of question has a “person” in the answer. Thus we do not need additional steps to preprocess inputs such as joint features of Q&A or topic modeling or POS tags or Named Entities as additional features.
Before beginning with CNN or LSTM, one must be aware of the Word Embeddings layer commonly used as the starting layer in many Deep Learning architectures. Embeddings are distributed vector representations for words and phrases in the dataset. The word vectors can be generated in an unsupervised fashion i.e. class labels are not required to compute these vectors, and thus can be generated independent of the training objective. These word vectors serves two purposes, first they capture semantic and syntactic meanings of the words and phrases in the dataset.
Plain TF-IDF scheme produces scalar value for each word, which considers each word individually and do not consider word co-occurences or word order into account and thus cannot capture meaning of the words.
For example, in TF-IDF space, the words “Paris”, “France” and “Baseball” are equivalent, but in reality we know that the words “Paris” and “France” are much similar in context compared to the word “Baseball”. In word vector representation, the distance between “Paris” and “France” would be much lower as compared to “Baseball” from either of them.
Secondly the word vectors helps in dimensionality reduction on inputs to deep neural networks with CNN and RNN/LSTM layers. Again, one could have used Singular Value Decomposition or Latent Semantic Analysis to generate word representations, but keeping aside the reduced memory requirements and better quality word vectors with the neural network based Skip Gram approach over SVD (LSA), an end to end neural network pipeline is much more convenient and simpler to use.
Next post in this series briefly explains CNN and RNN/LSTM and how these architectures have transformed deep learning altogether.