Machine Learning Interview Questions
5 min readJun 2, 2020
Compilation of some ML interview questions asked in top product based companies.
- How does generative and discriminative classifiers differ ? What are some examples of both type of classifiers ? When would you prefer to use a generative classifier over a discriminative classifier ?
- How is K-Fold Cross Validation used to find best hyperparameters for a model ?
- How to handle class imbalance problem for a binary classifier ?
- Explain ROC and AUC ? How would you compute AUC score for a multi-class classifier ?
- When to use Expectation-Maximization algorithm and why ?
- How to choose optimum number of clusters with K-Means clustering ?
- How does bias and variance get affected on increasing number of clusters in K-Means ?
- What are the advantages of Gaussian Mixture Models over K-Means clustering ?
- When would you choose Hierarchical clustering over K-Means ? When would you choose K-Means ?
- What is the ‘Naive’ Bayes assumption ?
- You come to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it ? Why ?
- Should you remove correlated variables before running PCA ? If not then how will PCA handle them ?
- How do you interpret p-value for a model ?
- What is the Central Limit Theorem ? How does it help in ML ?
- What is the difference between covariance and correlation ?
- What do you understand by eigenvectors and eigenvalues ?
- How do you know whether a loss function has a global minima or not ?
- How does SVM handle non-linear classification problems ? Explain Kernel Trick in SVM ?
- What is the role of misclassification cost parameter C in SVM ?
- How do you find efficiently whether 2 classes are linearly separable ?
- When would you prefer LinearSVM over Logistic Regression for a linearly separable binary classification problem ?
- For ranking recommendations which one would you choose LinearSVM or Logistic Regression ?
- Would SVM work well in high dimensions ? Why or why not ?
- What is the role of gamma in RBF kernels ?
- Why do we need a large margin classifier in SVM ? Why would “any margin classifier” not work well ?
- How does a decision tree split a node ?
- How to handle overfitting with a decision tree classifier ?
- Can decision tree be used for regression problems ?
- How does Random Forests handle Bias-Variance tradeoffs ?
- How are Gradient Boosted Trees different from Random Forests ?
- How to handle overfitting in Gradient Boosted Trees ?
- How to handle both numerical and categorical features with tree based algorithms ? What about ordinal features ?
- Why and how does Random Forest prevent overfitting in decision trees ?
- How does GBDTs decide to split a node ? What does it minimize ?
- How to find feature importances with GBDTs ?
- How can GBDTs be used for feature transformation ?
- What are the regularization factors in GBDTs ?
- How would you compute PCA of a feature matrix X ?
- What is the difference between PCA and SVD ?
- How would you determine how many principal components to consider in PCA ?
- Describe situation where PCA is not a good method for dimensionality reduction ?
- When do we need to standardize the variables before doing PCA ?
- How would you use LSTM for Named Entity Recognition problems ?
- Which one should you prefer for NER — LSTM only, Linear Chain CRF only, LSTM + Linear Chain CRF and why ?
- What problem does Bi-LSTM solve instead of only LSTM ?
- What is the purpose of pooling operation in case of CNN ?
- How would you choose the number of filters and the filter size at each CNN layer ?
- How does Conv1D and MaxPool1D work ?
- What are the advantages of parameter sharing in case of convolution ?
- How does CNN help in translation and rotation invariance of images ?
- How would you choose which layers to freeze and which to retrain in case of transfer learning ?
- What are some advantages of using character embeddings instead of word embeddings ?
- How to train CNN models in parallel ? Can LSTM models be trained in parallel ? Why or why not ?
- Why large filter sizes in early layers can be a bad choice ? How to choose filter size ?
- Why weights are initialized with small random numbers in a neural network ? What happens when weights are all 0 or constant values ?
- Why sigmoid activation is not good ? Why ReLU or Tanh is preferred over sigmoid activation function ?
- How to handle dying node problems in case of ReLU activation function ?
- How does Dropout help in regularization ? How is it different from L1 or L2 regularization ?
- When and why use Dropout instead of L1 or L2 regularization ?
- When and why use BatchNormalization ?
- How to handle vanishing gradient problem in Neural Networks ?
- Why do we need the bias term ?
- What are the advantages and disadvantages of SGD over gradient descent ?
- How does momentum technique help in SGD ?
- Would you use squared error loss or binary cross-entropy loss function for binary classification and why ?
- For online learning which one would you prefer SGD or Adagrad and why ?
- How to train deep neural networks in a distributed manner ? What are the advantages and disadvantages ?
- How to handle exploding gradient problem ?
- How does BatchNormalization differ in training and inferencing ?
- Why don’t we use Dropout during inferencing ?
- Why do we need to shuffle data during training ?
- How can we alter the learning rate depending on the training loss ? Is it ok to have a constant learning rate ? Why or why not ?
- For distributed training with K machines/cores, should we use a higher or lower learning rate and why ?
- How does batch size affect training of neural networks ? How to choose batch size ? What if we choose batch size of 1, will it give better or worse results ?
- How is word2vec different from Glove ?
- For infrequent/rare words which among CBOW and SkipGram should be used for word2vec training ?
- Is it possible that both validation loss and validation accuracy increase ? Why and when such a scenario can arise ?
- What can go wrong if we use a linear activation instead of ReLU ?
- How does Item Based Collaborative Filtering work in recommendations ? What if number of items is in billions and number of users is in millions ?
- When would you choose Item based CF over User based CF ?
- How is matrix factorization useful in recommendation systems ?
- What are the advantages and disadvantages of SGD over Alternating Least Squares (ALS) in matrix factorization ?
- How would you find K nearest neighbors efficiently with billions of data ?
- How does Locality Sensitive Hashing (LSH) work for finding nearest neighbors ? What hash function is used ?