Machine Learning Interview Questions

Abhijit Mondal
5 min readJun 2, 2020

--

Compilation of some ML interview questions asked in top product based companies.

  1. How does generative and discriminative classifiers differ ? What are some examples of both type of classifiers ? When would you prefer to use a generative classifier over a discriminative classifier ?
  2. How is K-Fold Cross Validation used to find best hyperparameters for a model ?
  3. How to handle class imbalance problem for a binary classifier ?
  4. Explain ROC and AUC ? How would you compute AUC score for a multi-class classifier ?
  5. When to use Expectation-Maximization algorithm and why ?
  6. How to choose optimum number of clusters with K-Means clustering ?
  7. How does bias and variance get affected on increasing number of clusters in K-Means ?
  8. What are the advantages of Gaussian Mixture Models over K-Means clustering ?
  9. When would you choose Hierarchical clustering over K-Means ? When would you choose K-Means ?
  10. What is the ‘Naive’ Bayes assumption ?
  11. You come to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it ? Why ?
  12. Should you remove correlated variables before running PCA ? If not then how will PCA handle them ?
  13. How do you interpret p-value for a model ?
  14. What is the Central Limit Theorem ? How does it help in ML ?
  15. What is the difference between covariance and correlation ?
  16. What do you understand by eigenvectors and eigenvalues ?
  17. How do you know whether a loss function has a global minima or not ?
  18. How does SVM handle non-linear classification problems ? Explain Kernel Trick in SVM ?
  19. What is the role of misclassification cost parameter C in SVM ?
  20. How do you find efficiently whether 2 classes are linearly separable ?
  21. When would you prefer LinearSVM over Logistic Regression for a linearly separable binary classification problem ?
  22. For ranking recommendations which one would you choose LinearSVM or Logistic Regression ?
  23. Would SVM work well in high dimensions ? Why or why not ?
  24. What is the role of gamma in RBF kernels ?
  25. Why do we need a large margin classifier in SVM ? Why would “any margin classifier” not work well ?
  26. How does a decision tree split a node ?
  27. How to handle overfitting with a decision tree classifier ?
  28. Can decision tree be used for regression problems ?
  29. How does Random Forests handle Bias-Variance tradeoffs ?
  30. How are Gradient Boosted Trees different from Random Forests ?
  31. How to handle overfitting in Gradient Boosted Trees ?
  32. How to handle both numerical and categorical features with tree based algorithms ? What about ordinal features ?
  33. Why and how does Random Forest prevent overfitting in decision trees ?
  34. How does GBDTs decide to split a node ? What does it minimize ?
  35. How to find feature importances with GBDTs ?
  36. How can GBDTs be used for feature transformation ?
  37. What are the regularization factors in GBDTs ?
  38. How would you compute PCA of a feature matrix X ?
  39. What is the difference between PCA and SVD ?
  40. How would you determine how many principal components to consider in PCA ?
  41. Describe situation where PCA is not a good method for dimensionality reduction ?
  42. When do we need to standardize the variables before doing PCA ?
  43. How would you use LSTM for Named Entity Recognition problems ?
  44. Which one should you prefer for NER — LSTM only, Linear Chain CRF only, LSTM + Linear Chain CRF and why ?
  45. What problem does Bi-LSTM solve instead of only LSTM ?
  46. What is the purpose of pooling operation in case of CNN ?
  47. How would you choose the number of filters and the filter size at each CNN layer ?
  48. How does Conv1D and MaxPool1D work ?
  49. What are the advantages of parameter sharing in case of convolution ?
  50. How does CNN help in translation and rotation invariance of images ?
  51. How would you choose which layers to freeze and which to retrain in case of transfer learning ?
  52. What are some advantages of using character embeddings instead of word embeddings ?
  53. How to train CNN models in parallel ? Can LSTM models be trained in parallel ? Why or why not ?
  54. Why large filter sizes in early layers can be a bad choice ? How to choose filter size ?
  55. Why weights are initialized with small random numbers in a neural network ? What happens when weights are all 0 or constant values ?
  56. Why sigmoid activation is not good ? Why ReLU or Tanh is preferred over sigmoid activation function ?
  57. How to handle dying node problems in case of ReLU activation function ?
  58. How does Dropout help in regularization ? How is it different from L1 or L2 regularization ?
  59. When and why use Dropout instead of L1 or L2 regularization ?
  60. When and why use BatchNormalization ?
  61. How to handle vanishing gradient problem in Neural Networks ?
  62. Why do we need the bias term ?
  63. What are the advantages and disadvantages of SGD over gradient descent ?
  64. How does momentum technique help in SGD ?
  65. Would you use squared error loss or binary cross-entropy loss function for binary classification and why ?
  66. For online learning which one would you prefer SGD or Adagrad and why ?
  67. How to train deep neural networks in a distributed manner ? What are the advantages and disadvantages ?
  68. How to handle exploding gradient problem ?
  69. How does BatchNormalization differ in training and inferencing ?
  70. Why don’t we use Dropout during inferencing ?
  71. Why do we need to shuffle data during training ?
  72. How can we alter the learning rate depending on the training loss ? Is it ok to have a constant learning rate ? Why or why not ?
  73. For distributed training with K machines/cores, should we use a higher or lower learning rate and why ?
  74. How does batch size affect training of neural networks ? How to choose batch size ? What if we choose batch size of 1, will it give better or worse results ?
  75. How is word2vec different from Glove ?
  76. For infrequent/rare words which among CBOW and SkipGram should be used for word2vec training ?
  77. Is it possible that both validation loss and validation accuracy increase ? Why and when such a scenario can arise ?
  78. What can go wrong if we use a linear activation instead of ReLU ?
  79. How does Item Based Collaborative Filtering work in recommendations ? What if number of items is in billions and number of users is in millions ?
  80. When would you choose Item based CF over User based CF ?
  81. How is matrix factorization useful in recommendation systems ?
  82. What are the advantages and disadvantages of SGD over Alternating Least Squares (ALS) in matrix factorization ?
  83. How would you find K nearest neighbors efficiently with billions of data ?
  84. How does Locality Sensitive Hashing (LSH) work for finding nearest neighbors ? What hash function is used ?

--

--

Responses (1)