Capgemini Hadoop Interview Questions

Here is the list of Hadoop Interview Questions which are recently asked in Capgemini company. These questions are included for both Freshers and Experienced professionals. Our Hadoop Training has Answered all the below Questions.

1. What is serialization?

Serialization is the process of the0 converting an object into a stream of bytes to store the object or transmit it to memory, a database, or a file.

2. How to remove the duplicate records from a hive table?

Use Insert Overwrite and DISTINCT Keyword.
GROUP BY Clause to Remove Duplicate.
Use Insert Overwrite with row_number() analytics functions.

3. How to find the number of delimiter from a file?

Just read a few lines, count the number of commas and the number of tabs and compare them. If there’s 20 commas and no tabs, it’s in the CSV. If there’s 20 tabs and 2 commas , it’s in TSV.

4. Replace a certain word from a file using Unix?

The procedure to change the text in files under Linux/Unix using sed are:

Use Stream EDitor (sed) as follows.
The s is the substitute command of sed for find and replace.
It tells sed to find all occurrences of ‘old-text’ and replace with ‘new-text’ in a file named input.

Free PDF : Get our updated AWS Course Content pdf

5. How to import a table without a primary key?

In order to run this command, open the terminal on your computer and paste above sqoop import command into it and press enter. When you do that, Sqoop import will start the distributing source table data into it’s mappers based on the column that is specified in split-by directive.

6. What is cogroup in pig?

The COGROUP operator works more or less in the same way as the GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in the statements involving two or more relations.

7. How to write a UDF in Hive?

Add Dependency JAR file to your eclipse build path. You can get the hive-exec JAR from.
Create a Java class extending hive’s “UDF” class.
Export JAR file from Eclipse Project.
Add Jar On to Hive.
Create UDF under Hive.
Create function and add jar permanently.

8. How you can join two big tables in Hive?

Hive will simply perform the normal Inner Join. If both tables have the same amount of buckets and the data is the sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands: set hive.

9. The difference between order by and sort by?

The difference between “order by” and “sort by” is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than the one reducer, “sort by” may give partially ordered final results.

Request more information

Capgemini Hadoop Interview Questions