Barclays Hadoop Interview Questions

Here is the list of Hadoop Interview Questions which are recently asked in Barclays company. These questions are included for both Freshers and Experienced professionals. Our Hadoop Training has Answered all the below Questions.

1. How will you initiate the installation process if you have to setup a Hadoop Cluster for the first time?

Virtually by using VMware Tools or Virtual box. You need atleast 8 GB RAM and sufficient hard disk space. Create 3 Virtual machines and make one of them a namenode and making the rest of two as datanodes by changing the configurations and providing privileges. Now for the connection between the nodes, for multinode clustering you need to assign the ip address of the datanodes in the /etc/hosts file in the namenode machine. After establishing the connection you can get to know how a cluster works.

2. How will you install a new component or add a service to an existing Hadoop cluster?

Step 1 – Add a Service to the cluster. API Call.
Step 2 – Add Components to the service.
Step 3 – Create configuration.
Step 4 – Apply configuration to the cluster.
Step 5 – Create host components.
Step 6 – Install & Start the service.

3. If Hive Metastore service is down, then what will be its impact on the Hadoop cluster?

Hive Metastore Service is one of the main service for Hive and all Hive queries and jobs will fails as it will able to get the metadata of underlying tables. Some other services like Impala those uses Hive Metastore will also not the work. Rest of the services those are not dependent on the Hive like HDFS, YARN etc will not have any impact.

4. How will you decide the cluster size when setting up a Hadoop cluster?

For a small cluster of 5-50 nodes, 64 GB RAM should be a fair enough. For medium-to-large sized clusters, 50 to 1,000 128 GB RAM can be recommended. Or use this formula: Memory amount = HDFS cluster management memory + NameNode memory + OS memory.

5. How can you run Hadoop and real-time processes on the same cluster?

Hadoop was never built for real-time processing. Hadoop initially started with the MapReduce, which offers batch processing where queries take hours, minutes or at best seconds. This is and will be great for the complex transformations and computations of big data volumes.

Free PDF : Get our updated Hadoop Course Content pdf

6. If you get a connection refused exception – when logging onto a machine of the cluster, what could be the reason? How will you solve this issue?

I was trying to get the Hadoop working in standalone mode on my MBP (OS 10.9.5), but I kept getting “connection refused” errors. I found that telnet localhost 9000 gives the same error, which is what I was told to try as a diagnostic. The same thing happens if I try 127.0.0.1 instead of localhost, or if I try ftp instead of telnet. However ping localhost and ssh localhost work fine.

7. How can you identify and troubleshoot a long running job?

On a very high level you will follow the below steps.

Understand the symptom.
Analyze the situation.
Identify the problem areas.
Propose solution.

8. How can you decide the heap memory limit for a NameNode and Hadoop Service?

Estimating NameNode Heap Memory Needed are:

Blocksize=128 MB, Replication=1.
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block.
Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks.

9. If the Hadoop services are running slow in a Hadoop cluster, what would be the root cause for it and how will you identify it?

First we need to check which component is responding slow (DataNode, NameNode ….etc).
We will need to get the PID of that process and collect 5-6 thread dumps during slowness.
We will also need to see the Memory utilization/GC pause of those HDFS components. For that we should take a look at the GC log of those components. Also in the logs of NameNode/DataNode we will find the JVMUtil logging if Garbage Collection Pause is too high.

All daemons are running on one node. Both Master & Slave nodes are on the same machine.

10. How many DataNodes can be run on a single Hadoop cluster?

A good rule of thumb is to be assume 1GB of NameNode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the NameNode provides plenty of the room to grow the cluster.

Request more information

Barclays Hadoop Interview Questions