CAPGEMINI DATA Science Interview Questions

Here is the list Data Science Interview Questions which are recently asked in IBM company. These questions are included for both Freshers and Experienced professionals. Our Data Science Training has Answered all the below Questions.

1. What do you mean by word Data Science?

Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. It encompasses preparing data for analysis, including cleansing, aggregating, and manipulating the data to perform advanced data analysis.

2. Explain the term botnet?

A botnet is a collection of internet-connected devices infected by malware that allow hackers to control them. Cyber criminals use botnets to instigate botnet attacks, which include malicious activities such as credentials leaks, unauthorized access, data theft and DDoS attacks.

3. What is Data Visualization?

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

4. How you can define Data cleaning as a critical part of process?

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct.

5. Differentiate between Data modelling and Database design?

Database design is stored in the database schema, which is in turn stored in the data dictionary. Data model is a set or collection of construct used for creating a database and producing designs for the databases.

6. Differentiate between Data modelling and Database design?

7. What are Recommender Systems?

A Recommender System refers to a system that is capable of predicting the future preference of a set of items for a user, and recommend the top items. One key reason why we need a recommender system in modern society is that people have too much option to use from due to the prevalence of Internet.

8. Why data cleaning plays a vital role in analysis?

Data cleaning can help in analysis because: Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with. Data Cleaning helps to increase the accuracy of the model in machine learning.

9. What is Linear Regression?

Linear Regression is a very powerful statistical technique and can be used to generate insights on consumer behavior, understanding business and factors influencing profitability. Linear regressions can be used in business to evaluate trends and make estimates or forecasts.

10. What do you understand by term hash table collisions?

A collision occurs when two keys are hashed to the same index in a hash table. Collisions are a problem because every slot in a hash table is supposed to store a single element.

11. What are various steps involved in an analytics project?

The data analytics encompasses six phases that are data discovery, data aggregation, planning of the data models, data model execution, communication of the results, and operationalization. These six phases of data analytics lifecycle are iterative with backward and forward and sometimes overlapping movement.

12. How can you iterate over a list and also retrieve element indices at the same time?

A optional start argument is used to enumerate the function which is very helpful when I need to count from 1 or any other number instead of 0. for index, value in enumerate(numbers, start=1): print ‘The value at position’, index, ‘is’, value.

13. What is collaborative filtering?

Collaborative filtering is a class of recommenders that leverage only the past user-item interactions in the form of a ratings matrix. It is a technique used by recommender systems.

14. What is boosting?

Boosting is an ensemble learning technique that uses a set of Machine Learning algorithms to convert weak learner to strong learners in order to increase the accuracy of the model.

For tuning hyperparameters of your machine learning model, what will be the ideal seed?

Grid search is arguably the most basic hyperparameter tuning method. With this technique, we simply build a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results.

Free PDF : Get our updated Data Science Course Content pdf

15. Compare and contrast R and SAS?

SAS is a specific programming language designed primarily for statistical analysis of data from spreadsheets or databases. R programming language is open source free software widely used among statisticians and data miners to develop statistical software and data analysis.

16. What do you understand by letter ‘R’?

R analytics (or R programming language) is free, open-source software used for all kinds of data science, statistics, and visualization projects. R also allows you to build and run statistical models using Sisense data, automatically updating this as new information flows into the model.

17. What is Interpolation and Extrapolation?

Extrapolation is an estimation of a value based on extending a known sequence of values or facts beyond the area that is certainly known. Interpolation is an estimation of a value within two known values in a sequence of values.

18. What is the difference between Cluster and Systematic Sampling?

Systematic sampling selects a random starting point from the population, and then a sample is taken from regular fixed intervals of the population depending on its size. Cluster sampling divides the population into clusters and then takes a simple random sample from each cluster.

19. Are expected value and mean value different?

Expected value is the average value of a random variable over a large number of experiments. A random variable maps numeric values to each possible outcome in an experiment.

20. What does P-value signify about the statistical data?

The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. The P stands for probability and measures how likely it is that any observed difference between groups is due to chance.

21. What is the goal of A/B Testing?

A/B testing is a basic randomized control experiment. It is a way to compare the two versions of a variable to find out which performs better in a controlled environment.

22. What is an Eigenvalue and Eigenvector?

Eigenvectors are unit vector that their length or magnitude is equal to 1. They are often referred to as right vectors which simply mean a column vector whereas, eigenvalues are coefficients applied to eigenvectors that give the vectors their length or magnitude.

23. How can you assess a good logistic model?

Plotting the pairs of sensitivity and specificities (or, more often, sensitivity versus one minus specificity) on a scatter plot provides an ROC (Receiver Operating Characteristic) curve. The area under this curve (AUC of the ROC) provides an overall measure of fit of the model.

24. What is Machine Learning?

Machine learning is a division of computer science which contracts with system programming in order to mechanically learn and perk up with experience. For example, Robots are programmed so that they can carry out the task based on data they collect from sensors. It mechanically learns programs from data.

25. Why do you want to work as a data scientist?

Data scientist work with passion towards working for data-driven by solving issues using an analytical approach and passionate about incorporating technology into the work.

26. What is the Law of Large Numbers?

The law of large numbers states that an observed sample average from a large sample will be close to the true population average and that it will get closer the larger the sample.

27. Why is data cleaning essential in Data Science?

Clean data increase the overall productivity and allow for the highest quality information in your decision-making. The benefit includes removal of errors when multiple sources of data are at play. Fewer errors make for happier clients and less-frustrated employees.

28. What Is K-means? How Can You Select K For K-means?

K-means clustering is a method of vector quantization that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster. There is a popular method known as elbow method which is used to determine the optimal value of K to perform the K-Means Clustering Algorithm.

29. What is underfitting?

Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Request more information

CAPGEMINI DATA Science Interview Questions