Here is the list Data Scientist Interview Questions which are recently asked in Accenture company. These questions are included for both Freshers and Experienced professionals. Our **Data Science Training** has Answered all the below Questions.

### 1. How can you assess a good logistic model?

- Likelihood Ratio Test and Pseudo R^2.
- Hosmer-Lemeshow, Wald Test.
- Variable Importance, Classification Rate.
- ROC Curve, K-Fold Cross-Validation.

### 2. What are various steps involved in an analytics project?

- Find an Interesting Topic followed by obtain and understand Data.
- Data Preparation and data modelling.
- Model Evaluation.
- Deployment and Visualization.

### 3. During analysis, how do you treat missing values?

- Deleting Rows with missing values.
- Impute missing values for continuous variable and categorical variable.
- Other Imputation Methods.
- Using Algorithms that support missing values with prediction of missing values.

### 4. Explain about the box cox transformation in regression models?

A Box Cox transformation is a transformation of non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.

### 5. Can you use machine learning for time series analysis?

Time series forecasting is an important area of machine learning that is often neglected. It is important because there are so many prediction problems that involve a time component. Standard definitions of time series, time series analysis, and time series forecasting.

### 6. Write a function that takes in two sorted lists and outputs a sorted list that is their union.

- Take in the number of elements for the first list and store it in a variable.
- Take in the elements of the list one by one.
- Similarly, take in the elements for the second list also.
- Merge both the lists using the ‘+’ operator and then sort the list.
- Display the elements in the sorted list.
- Exit.

### 7. What is Regularization and what kind of problems does regularization solve?

Overfitting is a phenomenon that occurs when a Machine Learning model is constraint to training set and not able to perform well on unseen data. Regularization is a technique used to reduce the errors by fitting the function appropriately on the given training set and avoid overfitting.

### 8. What is multicollinearity and how you can overcome it?

Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.

### 9. What is the curse of dimensionality?

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.

### 10. How do you decide whether your linear regression model fits the data?

- Make sure the assumptions are satisfactorily met.
- Examine potential influential point, the change in R2 and Adjusted R2 statistics.
- Check necessary interaction and apply the model to another data set and check its performance.

### 11. What is Data Science?

Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.

### 12. What is the Law of Large Numbers?

The law of large numbers is a theorem from probability and statistics that suggests that the average result from repeating an experiment multiple times will better approximate the true or expected underlying result. All sample observations for an experiment are drawn from an idealized population of observations.

### 13. How Machine Learning Is Deployed In Real World Scenarios?

Deployment is the method by which you integrate a machine learning model into an existing production environment to make practical business decisions based on data. It is one of the last stages in the machine learning life cycle and can be one of the most cumbersome.

### 14. What is collaborative filtering?

Collaborative filtering uses a large set of data about user interactions to generate a set of recommendations. The idea behind collaborative filtering is that users with similar evaluations of certain items will enjoy the same things both now and in the future. User preference data can also be gathered implicitly.

### 15. What are the important libraries of Python that are used in Data Science?

- TensorFlow, NumPy.
- SciPy, Matplotlib.
- Pandas, Keras.
- SciKit-Learn, Statsmodels.

**Free PDF : Get our updated Data Science Course Content pdf**Download Now

### 16. What is the difference between squared error and absolute error?

The magnitude of the difference between the individual measurement and the true value of the quantity is called the absolute error of the measurement. The arithmetic mean of all the absolute error is taken as the mean absolute error of the value of the physical quantity.

### 17. What is Machine Learning?

Machine-learning algorithms use statistics to find patterns in massive amounts of data. And data, here, encompasses a lot of things—numbers, words, images, clicks, what have you. If it can be digitally stored, it can be fed into a machine-learning algorithm.

### 18. How are confidence intervals constructed and how will you interpret them?

A confidence interval displays the probability that a parameter will fall between a pair of values around the mean. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. They are most often constructed using confidence levels of 95% or 99%.

### 19. How will you explain logistic regression to an economist, physican scientist and biologist?

Logistic regression is a statistical analysis method used to predict a data value based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.

### 20. How can you overcome Overfitting?

- Cross-validation and train with more data.
- Remove features and early stopping.
- Regularization, ensembling.

### 21. Differentiate between wide and tall data formats?

Wide data has a column for each variable whereas long format data has a column for possible variable types & a column for the values of those variables.

### 22. Is Naïve Bayes bad? If yes, under what aspects.

One of the disadvantages of Naïve-Bayes is that if you have no occurrences of a class label and a certain attribute value together then the frequency-based probability estimate will be zero. And this will get a zero when all the probabilities are multiplied.

### 23. How would you develop a model to identify plagiarism?

- Tokenize the document.
- Remove all the stop words using NLTK library.
- Use GenSim library and find the most relevant words, line by line. This can be done by creating the LDA or LSA of the document.
- Use Google Search API to search for those words.

### 24. How will you define the number of clusters in a clustering algorithm?

The optimal number of clusters can be defined as follow: Compute clustering algorithm (e.g., k-means clustering) for different values of k. For each k, calculate the total within-cluster sum of square (wss). Plot the curve of wss according to the number of clusters k.

### 25. Is it possible to perform logistic regression with Microsoft Excel?

To activate the Logistic regression dialog box, start XLSTAT then select the XLSTAT / Modeling data / Logistic regression function. When you click on the button, the Logistic regression dialog box appears. Select the data on the Excel sheet.

### 26. What are Eigenvalue and Eigenvector?

Eigenvectors are unit vectors which mean that their length or magnitude is equal to 1. They are often referred to as right vectors which simply mean a column vector whereas eigenvalues are coefficients applied to eigenvector give the vectors their length or magnitude.

### 27. Compare SAS, R, And Python Programming?

All big IT organizations choose SAS as their data analytics tools. As R is very good with heavy calculations, it is largely used by statisticians and researchers. Startups prefer Python over the other two due to its lightweight nature, large community, and deep learning capabilities.

### 28. How regularly must an algorithm be updated?

Algorithm can be updated regularly based on its need, usage, and market growth. For example, Google is reported to change its search algorithm around 500 to 600 times each year.

### 29. What is the goal of A/B Testing?

The goal of A/B testing is to find the best performing content for a specific goal (or goals). Choosing the goal of your test should be part of your test development process. Most marketers focus on improving one of a few different key performance indicators (KPIs).

### 30. What are the feature vectors?

The feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis.