Machine Learning Interview Questions

Abhijit Mondal

5 min readJun 2, 2020

Compilation of some ML interview questions asked in top product based companies.

How does generative and discriminative classifiers differ ? What are some examples of both type of classifiers ? When would you prefer to use a generative classifier over a discriminative classifier ?
How is K-Fold Cross Validation used to find best hyperparameters for a model ?
How to handle class imbalance problem for a binary classifier ?
Explain ROC and AUC ? How would you compute AUC score for a multi-class classifier ?
When to use Expectation-Maximization algorithm and why ?
How to choose optimum number of clusters with K-Means clustering ?
How does bias and variance get affected on increasing number of clusters in K-Means ?
What are the advantages of Gaussian Mixture Models over K-Means clustering ?
When would you choose Hierarchical clustering over K-Means ? When would you choose K-Means ?
What is the ‘Naive’ Bayes assumption ?
You come to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it ? Why ?
Should you remove correlated variables before running PCA ? If not then how will PCA handle them ?
How do you interpret p-value for a model ?
What is the Central Limit Theorem ? How does it help in ML ?
What is the difference between covariance and correlation ?
What do you understand by eigenvectors and eigenvalues ?
How do you know whether a loss function has a global minima or not ?
How does SVM handle non-linear classification problems ? Explain Kernel Trick in SVM ?
What is the role of misclassification cost parameter C in SVM ?
How do you find efficiently whether 2 classes are linearly separable ?
When would you prefer LinearSVM over Logistic Regression for a linearly separable binary classification problem ?
For ranking recommendations which one would you choose LinearSVM or Logistic Regression ?
Would SVM work well in high dimensions ? Why or why not ?
What is the role of gamma in RBF kernels ?
Why do we need a large margin classifier in SVM ? Why would “any margin classifier” not work well ?
How does a decision tree split a node ?
How to handle overfitting with a decision tree classifier ?
Can decision tree be used for regression problems ?
How does Random Forests handle Bias-Variance tradeoffs ?
How are Gradient Boosted Trees different from Random Forests ?
How to handle overfitting in Gradient Boosted Trees ?
How to handle both numerical and categorical features with tree based algorithms ? What about ordinal features ?
Why and how does Random Forest prevent overfitting in decision trees ?
How does GBDTs decide to split a node ? What does it minimize ?
How to find feature importances with GBDTs ?
How can GBDTs be used for feature transformation ?
What are the regularization factors in GBDTs ?
How would you compute PCA of a feature matrix X ?
What is the difference between PCA and SVD ?
How would you determine how many principal components to consider in PCA ?
Describe situation where PCA is not a good method for dimensionality reduction ?
When do we need to standardize the variables before doing PCA ?
How would you use LSTM for Named Entity Recognition problems ?
Which one should you prefer for NER — LSTM only, Linear Chain CRF only, LSTM + Linear Chain CRF and why ?
What problem does Bi-LSTM solve instead of only LSTM ?
What is the purpose of pooling operation in case of CNN ?
How would you choose the number of filters and the filter size at each CNN layer ?
How does Conv1D and MaxPool1D work ?
What are the advantages of parameter sharing in case of convolution ?
How does CNN help in translation and rotation invariance of images ?
How would you choose which layers to freeze and which to retrain in case of transfer learning ?
What are some advantages of using character embeddings instead of word embeddings ?
How to train CNN models in parallel ? Can LSTM models be trained in parallel ? Why or why not ?
Why large filter sizes in early layers can be a bad choice ? How to choose filter size ?
Why weights are initialized with small random numbers in a neural network ? What happens when weights are all 0 or constant values ?
Why sigmoid activation is not good ? Why ReLU or Tanh is preferred over sigmoid activation function ?
How to handle dying node problems in case of ReLU activation function ?
How does Dropout help in regularization ? How is it different from L1 or L2 regularization ?
When and why use Dropout instead of L1 or L2 regularization ?
When and why use BatchNormalization ?
How to handle vanishing gradient problem in Neural Networks ?
Why do we need the bias term ?
What are the advantages and disadvantages of SGD over gradient descent ?
How does momentum technique help in SGD ?
Would you use squared error loss or binary cross-entropy loss function for binary classification and why ?
For online learning which one would you prefer SGD or Adagrad and why ?
How to train deep neural networks in a distributed manner ? What are the advantages and disadvantages ?
How to handle exploding gradient problem ?
How does BatchNormalization differ in training and inferencing ?
Why don’t we use Dropout during inferencing ?
Why do we need to shuffle data during training ?
How can we alter the learning rate depending on the training loss ? Is it ok to have a constant learning rate ? Why or why not ?
For distributed training with K machines/cores, should we use a higher or lower learning rate and why ?
How does batch size affect training of neural networks ? How to choose batch size ? What if we choose batch size of 1, will it give better or worse results ?
How is word2vec different from Glove ?
For infrequent/rare words which among CBOW and SkipGram should be used for word2vec training ?
Is it possible that both validation loss and validation accuracy increase ? Why and when such a scenario can arise ?
What can go wrong if we use a linear activation instead of ReLU ?
How does Item Based Collaborative Filtering work in recommendations ? What if number of items is in billions and number of users is in millions ?
When would you choose Item based CF over User based CF ?
How is matrix factorization useful in recommendation systems ?
What are the advantages and disadvantages of SGD over Alternating Least Squares (ALS) in matrix factorization ?
How would you find K nearest neighbors efficiently with billions of data ?
How does Locality Sensitive Hashing (LSH) work for finding nearest neighbors ? What hash function is used ?

Machine Learning Interview Questions

Written by Abhijit Mondal

Responses (1)