What exactly do you need to know to become a specialist in Data Science and Machine Learning?

Today, we present to you the 100 most common questions asked during job interviews at an IT company.

Do you know the answers to these questions?

Check yourself!

If not, the educational program “Computer Science. Artificial Intelligence and Project Management” will help you find the answers!

Questions on mathematical statistics

What is a normal distribution?
The average project grade in a group of 10 students is 7, but the median is 8. How is that possible? Which metric is more reliable?
What is the probability that a patient is infected if their test is positive, but the disease prevalence in their country is 0.1%?
What is the Central Limit Theorem? What is its practical significance?
Can you give examples of datasets with a non-Gaussian distribution? What is the likelihood maximization method?
You are running for office. In a sample of 100 voters, 60 will vote for you. Can you be confident in your victory?
How do you assess the statistical significance of an analysis?
How many different paths can a mouse take to reach cheese by moving along grid lines?
What’s the difference between linear and logistic regression?
Give three examples of long-tailed distributions. Why are they important in classification and regression tasks?
What is the Law of Large Numbers?
What does a p-value indicate?
What is the binomial probability formula?
A Geiger counter records 100 radioactive decays in 5 minutes. Estimate the 95% confidence interval for the hourly rate.
How do you calculate the required sample size?
When would you use MSE and MAE?
When is the median a better measure than the mean?
What’s the difference between mode, median, and mathematical expectation?

What are the differences between Series and DataFrame in Pandas?
Write a function that calculates the number of steps to convert one word into another.
What are the advantages of NumPy arrays over nested Python lists?
How do map, apply, and applymap differ in Pandas?
The simplest way to implement a moving average using NumPy?
Does Python support regular expressions?
Continue: “try, except, …”.
How to build a simple logistic regression model in Python?
How to select rows from a DataFrame based on column values?
How to determine the data type of elements in a NumPy array?
What’s the difference between loc and iloc in Pandas?
Write code to generate all N-grams from a sentence.
What are the ways to load an array from a text file in Python?
What’s the difference between multithreading and multiprocessing?
How can you use groupby + transform?
Write the final values of A0,…, A7.
How do mean() and average() differ in NumPy?
Give an example of using filter and reduce on an iterable.
How to combine two NumPy arrays?
Write a one-liner to count uppercase letters in a file.
How would you clean a dataset using Pandas?
What’s the difference between an array and an ndarray?
Calculate the minimum element in each row of a 2D array.
How to check whether a dataset or time series is random?
What’s the difference between pivot and pivot_table?
Implement the k-average method using SciPy.
What are the options for iterating through a DataFrame object?
What is a decorator? How to write my own?

What is sampling? How many sampling methods do you know?
How does correlation differ from covariance?
What is cross-validation? What problems does it solve?
What is a confusion matrix? What is it needed?
How does the Box-Cox transformation improve model performance?
What methods can be used to fill in missing data, and what are the consequences of not filling in the data?
What is an ROC curve? What is AUC?
What are recall and precision?
How would you deal with different forms of seasonality in time series modeling?
What mistakes can occur during sampling?
What is RCA (Root Cause Analysis)? How do you distinguish cause from correlation?
What is an outlier and an internal error? How would you detect and handle them in a dataset?
What is A/B testing?
When does a General Linear Model fail?
Is imputing missing values with the mean acceptable? Why?
You have call duration data. Create a plan for analyzing this data. What might the distribution look like? How would you check if your assumptions?

What is TF/IDF vectorization?
What is overfitting, and how can it be avoided?
You’re given a dataset of tweets and need to predict their sentiment (positive or negative). How would you preprocess the data?
Tell us about SVM.
When would you use SVM over Random Forest, and vice versa?
What are the consequences of setting an incorrect learning rate?
Explain the difference between epoch, batch, and iteration.
Why is the nonlinear Softmax function often the final operation in a deep neural network?
Explain and provide examples of collaborative filtering, content-based filtering, and hybrid filtering.
What is the difference between bagging and boosting for an ensemble?
How would you choose the number k for k-means clustering without visualizing the clusters?
What is the most effective way to represent data with five dimensions?
What are ensembles, and why are they useful?
Your computer has 5GB of RAM, but you need to train a model on a 10GB dataset. How would you do it?
Do gradient descent methods always converge to the same point?
What are recommendation systems?
Explain the bias-variance tradeoff and give examples of algorithms with high and low bias.
What is PCA, and how can it help?
Explain the difference between L1 and L2 regularization.