IBM Hadoop Interview Questions

Here is the list of Hadoop Interview Questions which are recently asked in IBM company. These questions are included for both Freshers and Experienced professionals. Our Hadoop Training has Answered all the below Questions.

1. What is Hive variable?

Hive variables are key-value pairs that can be set using the set command and they can be used in the scripts and Hive SQL. The values of the variables in Hive scripts are substituted during the query construct.

2. What is Object inspector?

The Object Inspector is an OpenOffice.org extension created to help the developer to inspect the arbitrary UNO-Objects. UNO provides a service-oriented API with the abstract service descriptions exporting defined interfaces.

3. Please explain Consolidation in hive.

Consolidation isn’t any particular feature of Hive. It is a technique used to the merge smaller files into bigger files. Consolidation technique isn’t covered anywhere online, so this particular technique is the very important especially when any batch applications read the data.

4. What are the differences between MapReduce and YARN?

MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which the support parallel processing that we known as MapReduce.

5. Can you differentiate between Spark and MapReduce?

The primary difference between Spark and MapReduce is that Spark processes and retains the data in memory for subsequent steps, whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark’s data processing speeds are up to 100x faster than the MapReduce.

6. Explain RDD and data frames in spark?

RDD is a distributed collection of data elements spread across many machines in the cluster. RDDs are a set of the Java or Scala objects representing data.

7. Can you write the syntax for Sqoop import?

The following command is used to verify the imported data from emp table to HDFS emp/ directory. $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-* It shows you the emp table data with comma (,) and the separated fields.

8. What do you know about Hive views?

A view allows a query to be saved and the treated like a table. It is a logical construct, as it does not store the data like a table. Logically, you can imagine that the Hive executes the view and then uses the results in the rest of the query.

9. Difference between Hive external table and Hive managed Table.

Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are managed and the controlled by Hive. External tables are tables where Hive has loose coupling with the data. If a Managed table or partition is dropped, the data and metadata associated with that the table or partition are deleted.

10. What are the differences between HBase and Hive?

Hive is a SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. But just as Google can be used for search and the Facebook for social networking, Hive can be used for the analytical queries while HBase for real-time querying.

Free PDF : Get our updated Hadoop Course Content pdf

11. What are Orderby, sortby, and clustered by?

There are some limitations in the “order by” clause. In the strict mode, the order by clause has to be followed by a “limit” clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in the order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.

12. What is Speculative execution?

Speculative execution is a technique used by modern CPUs to speed up performance. The CPU may execute the certain tasks ahead of time, “speculating” that they will be needed. For example, the Meltdown security vulnerability affects the some Intel CPUs due to a flaw in their speculative execution.

13. Which all Alter column command in hive you have worked?

You can rename table and column of existing Hive tables.
You can add new column to the table.
Rename Hive table column.
Add or drop table partition.
Add Hadoop archive option to Hive table.

14. What is lazy evaluation in pig?

In Pig, each processing step results in a new relation except for the dump and store operator. No value or operation is evaluated until the value or the transformed data is required. The actual processing of the data starts when we submit the dump or store commands.

15. What is dynamic partition and static partition in hive?

Static partitioning we need to specify the partition column value in each and every LOAD statement. Dynamic partition allow us not to specify the partition column value each time. create a non-partitioned table t2 and insert data into it.

16. What is the use of partitions and bucketing in hive?

Both Partitioning and Bucketing in Hive are used to the improve performance by eliminating table scans when dealing with a large set of data on a Hadoop file system. The major difference between the Partitioning vs Bucketing lives in the way how they split the data.

17. Explain the flow of MapReduce program?

Mapper is overridden by the developer according to the business logic and this Mapper run in a parallel manner in all the machines in our cluster. The intermediate output generated by the Mapper is stored on the local disk and shuffled to the reducer to reduce the task.

18. What is default partition in MapReduce and how can we override it?

The default partitioner in Hadoop will create one Reduce task for each unique “key” as output by the context. write key, value. All values with the same key will go to the same instance of your reducer, in a single call to the reduce function.

19. What is difference between key class and value class in MapReduce?

The Map is responsible to filter the data. It also provides the environment to group the data on the basis of key. Key– It is field/ text/ object on which the data groups and aggregates on the reducer. Value– It is the field/ text/ object which each individual reduces the method handles.

20. What is the level of sub queries in hive?

Hive supports subqueries only in the FROM clause through Hive 0.12. The subquery has to be given a name because every table in a FROM clause must have a name. Columns in the subquery select list must have a unique names.

21. What is transformation and action in spark?

Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like the transformation. Thus, Actions are Spark RDD operations that give non-RDD values.

Request more information

IBM Hadoop Interview Questions