Data Science Interview Questions

1. What is Data Science?

Data science is defined as a multidisciplinary subject used to extract meaningful insights out of different types of data by employing various scientific methods such as scientific processes and algorithms. Data science helps in solving the analytically complex problems in a simplified way. It acts as a stream where you can utilize raw data to generate business value.

2. What do you mean by word Data Science?

Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, It is also known as knowledge discovery and data mining.

3. Why do you want to work as a data scientist?

This question plays off of your definition of data science. However, now recruiters are looking to understand what you’ll contribute and what you’ll gain from this field. Focus on what makes your path to becoming a data scientist unique – whether it be a mentor or a preferred method of data extraction.

4. What is the Law of Large Numbers?

It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate.

5. Where to seek help in case of discrepancies in Tableau?

When you face any issue regarding Tableau, try searching in the Tableau community forum. It is one of the best places to get your queries answered. You can always write your question and get the query answered with an hour or a day. You can always post on LinkedIn and follow people.

6. Why is data cleaning essential in Data Science?

Data cleaning is more important in Data Science because the end results or the outcomes of the data analysis come from the existing data where useless or unimportant need to be cleaned periodically as of when not required. This ensures the data reliability & accuracy and also memory is freed up.

7. What is A/B testing in Data Science?

A/B testing is also called Bucket Testing or Split Testing. This is the method of comparing and testing two versions of systems or applications against each other to determine which version of application performs better. This is important in the cases where multiple versions are shown to the customers or end-users in order to achieve the goals.

8. How Machine Learning Is Deployed In Real World Scenarios?

Here are some of the scenarios in which machine learning finds applications in the real world:

Ecommerce: Understanding customer churn, deploying targeted advertising, remarketing.
Search engine: Ranking pages depending on the personal preferences of the searcher
Finance: Evaluating investment opportunities & risks, detecting fraudulent transactions
Medicare: Designing drugs depending on the patient’s history and needs
Robotics: Machine learning for handling situations that are out of the ordinary
Social media: Understanding relationships and recommending connections
Extraction of information: framing questions for getting answers from databases over the web.

9. What Is Power Analysis?

Power analysis is a vital part of the experimental design. It is involved with the process of determining the sample size needed for detecting an effect of a given size from a cause with a certain degree of assurance. It lets you deploy a specific probability in a sample size constraint.

10. What Is K-means? How Can You Select K For K-means?

K-means clustering can be termed as the basic unsupervised learning algorithm. It is the method of classifying data using a certain set of clusters called K clusters. It is deployed for grouping data in order to find similarity in the data.

11. Why is resampling done?

Resampling is done in any of these cases:

Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets (bootstrapping, cross-validation)

12.What tools or devices help you succeed in your role as a data scientist?

This question’s purpose is to learn the programming languages and applications the candidate knows and has experience using. The answer will show the candidate’s need for additional training of basic programming languages and platforms or any transferable skills. This is vital to understand as it can cost more time and money to train if the candidate is not knowledgeable in all of the languages and applications required for the position.

13.What are the differences between overfitting and underfitting?

In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.

Overfitting : It is a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitting has poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting : It occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.

14.What is Machine Learning?

Machine Learning explores the study and construction of algorithms that can learn from and make predictions on data. Closely related to computational statistics. Used to devise complex models and algorithms that lend themselves to a prediction which in commercial use is known as predictive analytics.

15. What is underfitting?

Any prediction rate which has provides low prediction in the training error and the test error leads to a high business problem, if the error rate in training set is high and the error rate in the test set is also high, then we can conclude it as overfitting model.

16. How Do Data Scientists Use Statistics?

Statistics help Data Scientists to look into the data for patterns, hidden insights and convert Big Data into Big insights. It helps to get a better idea of what the customers are expecting. Data Scientists can learn about consumer behavior, interest, engagement, retention and finally conversion all through the power of insightful statistics. It helps them to build powerful data models in order to validate certain inferences and predictions. All this can be converted into a powerful business proposition by giving users what they want at precisely when they want it.

17. What is collaborative filtering?

Filtering is a process used by recommender systems to find patterns and information from numerous data sources, several agents, and collaborating perspectives. In other words, the collaborative method is a process of making automatic predictions from human preferences or interests.

18. What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.

19. What is Cluster Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements. For e.g., A researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling.

20.What is the difference between Machine learning Vs Data Mining?

Data mining is about working on unlimited data and then extract it to a level anywhere the unusual and unknown patterns are identified. Machine learning is any method about a study whether it closely relates to design, development concerning the algorithms that provide an ability to certain computers to capacity to learn.

21. What are the types of biases that can occur during sampling?

Some simple models of selection bias are described below. Undercoverage occurs when some members of the population live badly represented inside the sample.The survey relied on a service unit, drawn of telephone directories and car registration lists.

Selection bias
Under coverage bias
Survivorship bias

22. Why data cleaning plays a vital role in the analysis?

Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because – as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.

23. Do you prefer Python or R for text analytics?

Here, you’re being asked to insert your own opinion. However, most data scientists agree that the right opinion is Python. This is because Python has Pandas library which has strong data analysis tools and an easy-to-use structure. What’s more, Python is typically faster for text analytics.

24.Explain Star Schema?

It is a traditional database schema with a central table. Satellite tables map ID’s to physical name or description and can be connected to the central fact table using the ID fields; these tables are known as lookup tables, and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

25. What do you understand by term hash table collisions?

Hash table (hash map) is a kind of data structure used to implement an associative array, a structure that can map keys to values. Ideally, the hash function will assign each key to a unique bucket, but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. It is known as hash collisions.

26. Explain Cross-validation?

It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.

27. What is a Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

28. Can you explain the difference between a Test Set and a Validation Set?

Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid Overfitting of the model being built. On the other hand, the test set is used for testing or evaluating the performance of a trained machine learning model.

29. How do you define data science?

This question allows you to show your interviewer who you are. For example, what’s your favorite part of the process, or what’s the most impactful project you’ve worked on? Focus first on what data science is to everyone – a means of extracting insights from numbers – then explain what makes it personal.

30. What are the types of machine learning?

Supervised learning
Unsupervised learning
Reinforcement Learning

31. How often should an algorithm be updated?

This quasi-trick question has no specific time-based answer. This is because an algorithm should be updated whenever the underlying data is changing or when you want the model to evolve over time. Understanding the outcomes of dynamic algorithms is key to answering this question with confidence.

32. What is an Auto-Encoder?

The Auto-Encoders are learning networks that work for transforming the inputs into outputs with no errors or minimized error. It means the output must be very close to the input. We add a few layers between the input and output and the sizes of these layers would be smaller than the input layer. Actually, the Auto-encoder is provided with the unlabelled input then it would be transmitted into reconstructing the input.

33.What makes the difference between “Long” and “Wide” Format data?

In a wide format method, when we take a subject, the repeated responses are recorded in a single row, and each recorded response is in a separate column. When it comes to Long format data, each row acts as a one-time point per subject. In wide format, the columns are generally divided into groups whereas in a long-form the rows are divided into groups.

34.What is meant by supervised and unsupervised learning in data?

Supervised Learning: Supervised learning is a process of training machines with the labeled or right kind of data. In supervised learning, the machine uses the labeled data as a base to give the next answer. Unsupervised learning: It is another form of training machines using information which is unlabeled or unstructured one. Unlike Supervised learning, there is no special teacher or predefined data for the machine to quickly learn from.

35. How can the outlier values be treated?

We can identify the outlier values by using graphical analysis method or by using Univariate method. It becomes easier and can be assessed individually when the outlier values are few but when the outlier values are more in number then these values required to be substituted either with the 1st or with the 99th percentile values.Below are the common ways to treat outlier values.

To bring down and change the value
To remove the value

36. What are Eigenvalue and Eigenvector?

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing or stretching.

37. What do you mean by Deep Learning and Why has it become popular now?

Deep Learning is nothing but a paradigm of machine learning which has shown incredible promise in recent years. This is because of the fact that Deep Learning shows a great analogy with the functioning of the human brain.Now although Deep Learning has been around for many years, the major breakthroughs from these techniques came just in recent years.

38. What are the variants of Back Propagation?

Stochastic Gradient Descent: We use only a single training example for calculation of gradient and update parameters.
Batch Gradient Descent: We calculate the gradient for the whole dataset and perform the update at each iteration.
Mini-batch Gradient Descent: It’s one of the most popular optimization algorithms. It’s a variant of Stochastic Gradient Descent and here instead of single training example, mini-batch of samples is used.

39. Compare Sas, R, And Python Programming?

SAS: it is one of the most widely used analytics tools used by some of the biggest companies on earth. It has some of the best statistical functions, graphical user interface, but can come with a price tag and hence it cannot be readily adopted by smaller enterprises.
SAS: it is one of the most widely used analytics tools used by some of the biggest companies on earth. It has some of the best statistical functions, graphical user interface, but can come with a price tag and hence it cannot be readily adopted by smaller enterprises.
Python: Python is a powerful open source programming language that is easy to learn, works well with most other tools and technologies. The best part about Python is that it has innumerable libraries and community created modules making it very robust. It has functions for statistical operation.

40. Describe the structure of Artificial Neural Networks?

Artificial Neural Networks works on the same principle as a biological Neural Network. It consists of inputs which get processed with weighted sums and Bias, with the help of Activation Functions.

41. What is a Random Forest?

Random forest is a versatile method in machine learning that performs both classification and regression tasks. It also helps in areas like treats missing values, dimensionality reduction, and outlier values. It is like gathering the various weak modules comes together to form a robust model

42. How regularly must an algorithm be updated?

You will want to update an algorithm when:

You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity

43. What are the time series algorithms?

Time series algorithms like ARIMA, ARIMAX, SARIMA, Holts winters are very interesting to learn and use as well to solve a lot of complex problems for businesses. Data preparation for time series analysis plays a vital role. The stationarity, seasonality, cycles, and noises need time and attention. Take as much time as you would like to make the data right. Then you can run any model on top of it.

44. Explain The Various Benefits Of R Language?

The R programming language includes a set of a software suite that is used for graphical representation, statistical computing, data manipulation, and calculation.

45. What is Linear Regression?

46. What is Interpolation and Extrapolation?

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.

47. What is power analysis?

An experimental design technique for determining the effect of a given sample size.

48. What is Collaborative filtering?

The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.

49. What is the difference between Cluster and Systematic Sampling?

Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list,it is progressed from the top again. The best example for systematic sampling is equal probability method.

50. Are expected value and mean value different?

They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.For Sampling Data,Mean value is the only value that comes from the sampling data.Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population mean.For Distributions,Mean value and Expected value are same irrespective of the distribution, under the condition that the distribution is in the same population.

51. Do gradient descent methods always converge to same point?

No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions.

52. What is the goal of A/B Testing?

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

53.What Is Power Analysis?

54. How can you assess a good logistic model?

There are various methods to assess the results of a logistic regression analysis:

Using Classification Matrix to look at the true negatives and false positives.

Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.

Lift helps assess the logistic model by comparing it with random selection.

55. What are various steps involved in an analytics project?

Understand the business problem

Explore the data and become familiar with it.

Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.

After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step till the best possible outcome is achieved.

Validate the model using a new data set.

Start implementing the model and track the result to analyse the performance of the model over the period of time.

56. How can you iterate over a list and also retrieve element indices at the same time?

This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.
57. What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?

In bayesian estimate we have some knowledge about the data/problem (prior) .There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predcitions i.e. one for each pair of parameters but with the same prior. So, if a new example need to be predicted than computing the weighted sum of these predictions serves the purpose.
58. Can you cite some examples where both false positive and false negatives are equally important?

In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.Banks don’t want to lose good customers and at the same point of time they don’t want to acquire bad customers. In this scenario both the false positives and false negatives become very important to measure.These days we hear many cases of players using steroids during sport competitions Every player has to go through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and a false negative can make the game unfair.
59. What are the basic assumptions to be made for linear regression?

Normality of error distribution, statistical independence of errors, linearity and additivity.
60.Can you write the formula to calculat R-square?

R-Square can be calculated using the below formular – 1 – (Residual Sum of Squares/ Total Sum of Squares

61. What is the advantage of performing dimensionality reduction before fitting an SVM?

Support Vector Machine Learning Algorithm performs better in the reduced space. It is beneficial to perform dimensionality reduction before fitting an SVM if the number of features is large when compared to the number of observations.
62. How will you assess the statistical significance of an insight whether it is a real insight or just by chance?

Statistical importance of an insight can be accessed using Hypothesis Testing.
63. How would you create a taxonomy to identify key customer trends in unstructured data?

The best way to approach this question is to mention that it is good to check with the business owner and understand their objectives before categorizing the data. Having done this, it is always good to follow an iterative approach by pulling new data samples and improving the model accordingly by validating it for accuracy by soliciting feedback from the stakeholders of the business. This helps ensure that your model is producing actionable results and improving over the time.
64. What is collaborative filtering?

Filtering is a process used by recommender systems to find patterns and information from numerous data sources, several agents, and collaborating perspectives. In other words, the collaborative method is a process of making automatic predictions from human preferences or interests.
65. What does NLP stand for?

NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.

66. Do you think 50 small decision trees are better than a large one? Why?

Another way of asking this question is “Is a random forest a better model than a decision tree?” And the answer is yes because a random forest is an ensemble method that takes many weak decision trees to make a strong learner. Random forests are more accurate, more robust, and less prone to overfitting.
67. Why is mean square error a bad measure of model performance? What would you suggest instead?

Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).
68. What are Eigenvalue and Eigenvector?

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
69. What is boosting?

Boosting is an ensemble method to improve a model by reducing its bias and variance, ultimately converting weak learners to strong learners. The general idea is to train a weak learner and sequentially iterate and improve the model by learning from the previous learner.

70. What are the important libraries of Python that are used in Data Science?

Some of the important libraries of Python that are used in Data Science are :

Numpy

SciPy

Pandas

Matplotlib

Keras

TensorFlow

Scikit-learn

71.Can you name the type of biases that occur in machine learning?

There are four main types of biases that occur while building machine learning algorithms:

Sample Bias

Prejudice Bias

Measurement Bias

Algorithm Bias

72. For tuning hyperparameters of your machine learning model, what will be the ideal seed?

There is no fixed value for the seed and no ideal value. The seed is initialized randomly in order to tune the hyperparameters of the machine learning model.

Request more information

Data Science Interview Questions