Amazon Hadoop Interview Questions

Here is the list of Hadoop Interview Questions which are recently asked in Amazon company. These questions are included for both Freshers and Experienced professionals. Our Hadoop Training has Answered all the below Questions.

1. What are the differences between Hadoop and Spark?

Hadoop MapReduce and Spark lies in the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. As a result, the speed of processing differs the significantly Spark may be up to 100 times faster.

2. What are the daemons required to run a Hadoop cluster?

Apache Hadoop two consists of the following Daemons: NameNode. DataNode. Secondary Name Node. Namenode, Secondary NameNode, and Resource Manager works on a Master System while the Node Manager and DataNode work on the Slave machine.

3. How will you restart a NameNode?

By following methods we can restart the NameNode are:

You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command. Then start the NameNode using /sbin/hadoop-daemon.sh start namenode.
Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first.

4. Explain about the different schedulers available in Hadoop?

There are three types of Schedulers in Hadoop are:

First In First Out Scheduler.
Capacity Scheduler.
Fair Scheduler.

5. List few Hadoop shell commands that are used to perform a copy operation.

ls: This command is used to list all the files.
mkdir: To create a directory.
touchz: It creates an empty file.
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store.
cat: To print file contents.
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.

6. What is JPS command used for?

JPS command is used to check if a specific daemon is up or not. The command of JPS displays all the processes that are based on the Java for a particular user. The command of JPS should run from the root to check all the operating nodes in the host.

7. What are the important hardware considerations when deploying Hadoop in production environment?

A Hadoop Platform should be designed by moving the computing activities to data and thus achieving scalability and the high performance. Capacity: Large Form Factor disks will cost less and allow for more storage. Network: Two TOR switches per rack is ideal to avoid any chances for redundancy.

8. How many NameNodes can you run on a single Hadoop cluster?

Hadoop 2.2 has two Namenodes- Active Namenode and Passive Namenode.

9. What happens when the NameNode on the Hadoop cluster goes down?

When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates the checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.

10. What is the conf/hadoop-env.sh file and which variable in the file should be set for Hadoop to work?

This file specifies environment variables that affect the JDK used by Hadoop Daemon bin/hadoop. As Hadoop framework is written in the Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.

Free PDF : Get our updated Hadoop Course Content pdf

11. Apart from using the JPS command is there any other way that you can check whether the NameNode is working or not.

To check Hadoop daemons are running or not, what you can do is just run the jps command in the shell. You just have to the type ‘jps’make sure JDK is installed in your system. You can also check if the daemons are running or not through their web ui.

12. Which command is used to verify if the HDFS is corrupt or not?

HDFS is used to check the health of the file system, to find the missing files, over replicated, under the replicated and corrupted blocks.

13. List some use cases of the Hadoop Ecosystem.

Call Data Records Management.
Servicing of Telecom Data Equipment.
Advanced Telecom infrastructure planning.
Creating new products and services.
Network traffic analytics.

14. Which is the best operating system to run Hadoop?

Linux is the only supported production platform, but other flavors of Unix can be used to run Hadoop for the development. Windows is only supported as a development platform, and the additionally requires Cygwin to run.

15. What are the network requirements to run Hadoop?

HDFS:The namenode service typically runs on port 8020 fs.defaultFS property, the datanode service on the port 50010 or 1004 in Kerberos environments dfs.datanode.address property.
WebHDFS: The namenode service for the WebHDFS typically runs on port 50070 on each namenode.

16. What is the best practice to deploy a secondary NameNode?

It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not the interfere with the operations of the primary node.

17. How often should the NameNode be reformatted?

It is required once whiile you setup your cluster. If you format the everytime then you will loose your data.

18. How can you add and remove nodes from the Hadoop cluster?

Add and remove nodes from the Hadoop cluster are:

Shut down the NameNode.
Set dfs.
Restart NameNode.
In the dfs exclude file, specify the nodes using the full hostname or IP or IP:port format.
Do the same in mapred.exclude.
execute bin/hadoop dfsadmin -refreshNodes.
execute bin/hadoop mradmin -refreshNodes.

19. Explain about the different configuration files and where are they located.

System-wide software often uses configuration files stored in /etc , while user applications often use a “dotfile” – a file or directory in the home directory prefixed with a period, which in Unix hides the file or directory from the casual listing. Some configuration files run a set of the commands upon startup.

20. What is the role of the namenode?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and the tracks where across the cluster the file data is kept. When the NameNode goes down, the file system goes offline.

Request more information

Amazon Hadoop Interview Questions