Infosys Hadoop Interview Questions

Here is the list of Hadoop Interview Questions which are recently asked in Infosys company. These questions are included for both Freshers and Experienced professionals. Our Hadoop Training has Answered all the below Questions.

1. Why Hadoop? (Compare to RDBMS)

It is more flexible in storing, processing, and the managing data than traditional RDBMS. Unlike the traditional systems, Hadoop enables the multiple analytical processes on the same data at the same time.

2. What would happen if NameNode failed? How do you bring it up?

If NameNode gets fail the whole Hadoop cluster will not work. Actually, there will not any data loss only the cluster work will be shut down, because the NameNode is only the point of contact to all the DataNodes and if the NameNode fails all communication will stop.

3. What details are in the “fsimage” file?

FsImage is a file stored on the OS filesystem that contains the complete directory structure namespace of the HDFS with details about the location of the data on the Data Blocks and the which blocks are stored on the which node. This file is used by the NameNode when it is started.

4. What is Secondary NameNode?

FsImage file : This file is the snapshot of the HDFS metadata at a certain point of time.
Edits Log file: This file stores the records for changes that have been made in the HDFS namespace. The main function of the Secondary NameNode is to store the latest copy of the FsImage and the Edits Log files.

5. Explain the MapReduce processing framework? (start to end)

Hadoop MapReduce is a software framework for distributed processing of large data sets on computing clusters. It is a sub-project of the Apache Hadoop project. In layman’s term Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once.

6. What is Combiner? Where does it fit and give an example? Preferably from your project.

A classic example of combiner in mapreduce is with Word Count program, where map task tokenizes each line in the input file and emits output records as (word, 1) pairs for each word in input line. The reduce() method simply sums the integer counter values associated with each map output key word.

7. What is Partitioner? Why do you need it and give an example? Preferably from your project.

The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper output. By hash function, key or a subset of the key is used to derive the partition. A total number of partitions depends on the number of reduce task.

8. Oozie – What are the nodes?

Control nodes define job chronology, setting rules for beginning and ending a workflow. In this way, Oozie controls the workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks. Oozie triggers workflow actions, but Hadoop MapReduce executes them.

9. What are the actions in Action Node?

Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for different types of actions: Hadoop map-reduce, Hadoop file system, Pig, SSH, HTTP, eMail and Oozie sub-workflow.

10. Explain your Pig project?

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Free PDF : Get our updated Hadoop Course Content pdf

11. What log file loaders did you use in Pig?

Fortunately Piggybank, a repository of user-submitted UDF, contains a custom loader function Common Log Loader to load Apache’s Common Log Format files into pig.

12. Hive Joining? What did you join?

Basically, for combining specific fields from two tables by using values common to each one we use Hive JOIN clause. In other words, to combine records from two or more tables in the database we use JOIN clause. However, it is more or less similar to SQL JOIN. Also, we use it to combine rows from multiple tables.

13. Explain Partitioning & Bucketing (based on your project)?

Based on partition keys it divides tables into different parts. Partition keys determine how the data is stored in the table. Bucketing is a technique where the tables or partitions are further sub-categorized into buckets for better structure of data and efficient querying.

14. Why do we need bucketing?

Bucketing in hive is useful when dealing with large datasets that may need to be segregated into clusters for more efficient management and to be able to perform join queries with other large datasets. The primary use case is in joining two large datasets involving resource constraints like memory limits.

15. Did you write any Hive UDFs?

User Defined Functions, also known as UDF, allow you to create custom functions to process records or groups of records. Hive comes with a comprehensive library of functions. There are however some omissions, and some specific cases for which UDFs are the solution.

16. Filter – What did you filter out?

Use filters to temporarily hide some of the data in a table, so you can focus on the data you want to see.

17. HBase?

HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.

18. Flume?

Apache Flume is an open-source, powerful, reliable and flexible system used to collect, aggregate and move large amounts of unstructured data from multiple data sources into HDFS/Hbase (for example) in a distributed fashion via it’s strong coupling with the Hadoop cluster.

19. Sqoop?

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external datastores such as relational databases, enterprise data warehouses. Sqoop is used to import data from external datastores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase.

20. Zookeeper?

Apache ZooKeeper provides operational services for a Hadoop cluster. ZooKeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed systems. Distributed applications use Zookeeper to store and mediate updates to important configuration information.

Request more information

Infosys Hadoop Interview Questions