Adobe Hadoop Interview Questions

Here is the list of Hadoop Interview Questions which are recently asked in Adobe company. These questions are included for both Freshers and Experienced professionals. Our Hadoop Training has Answered all the below Questions.

1. What is Fact Table and Dimension Table (When I said that I am aware of Dataware house concept)?

A fact table works with the dimension tables. A fact table holds the data to be analyzed, and a dimension table stores data about the ways in which the data in the fact table can be analyzed. Thus, the fact table consists of two types of columns.

2. What type of data we should store in Fact table and dimension table?

Fact table is defined by their grain or its most atomic level whereas Dimension table should be wordy, descriptive, complete, and quality assured. Fact table helps to store the report labels whereas Dimension table contains detailed data.

3.There is a string in a Hive column, how you will find the count of a character. For example, the string is “hdfstutorial”, then how to count number of ‘t’.

INSTR function in the Apache Hive helps in finding the position of a substring in a string. It returns only the first first occurrence of the given input. Returns null if either of the arguments are null and returns 0 if the substring could not be found in the string.

4. There is a table in Hive, and the columns are student id, score and year. Find the top 3 students based on the score in each year.

There is a table in Hive, and the columns are student id, score and year. Find the top 3 students based on the score in each year.

Use a HiveQL query to find the top 3 students based on the score in each year from a Hive table. This query uses the ROW_NUMBER() to rank the students within each year. Based on the rank, filter the values to get the top 3 for each year.

SELECT student_id, score, year FROM ( SELECT student_id, score, year, ROW_NUMBER() OVER (PARTITION BY year ORDER BY score DESC) as rank FROM student_scores ) ranked_students WHERE rank <= 3;

5. There is a table having 500 Million records. Now you want to copy the data of that table in some other table, what best approach you will choose.

There is a table having 500 Million records. Now you want to copy the data of that table in some other table, what best approach you will choose.

To copy a table with 500 million records to another table in Hadoop, an efficient approach is required to handle the large volume of data with the important steps, such as:

Use Hive as it can handle large datasets and perform data transfer efficiently.
Optimize data transfer using partitioning and bucketing to improve the performance and reduce the time taken.
Increase parallelism by adjusting Hive settings and split up data transfer into smaller chunks.
To copy data at the HDFS level, Use Distcp command for large data sets. After copying the data, ensure that the operation is successful by verifying the row counts and data integrity.

6. You have 10 tables, and there are certain join conditions you have to put and then the result needs to be updated in another table. How you will do it and what best practice you will follow.

Relations are possible between the 10 tables, but this is just considering relations between tables as it will make that the number much bigger. If we make the restriction that each table may appear at most once, there are 2^10-1 = 1023 possibilities.

Free PDF : Get our updated Hadoop Course Content pdf

7. Which all analytical functions you have used in Hive?

Hive provides the following set of the analytical functions are:

RANK
DENSE_RANK
ROW_NUMBER
PERCENT_RANK
CUME_DIST
NTILE

8. Why we use bucketing?

Bucketing is used in Hive to improve the performance of certain operations, such as:

Join operations
Sampling of data
Even distribution of data across buckets
file management
Map tasks
Partition elimination

9. what is actually happening in bucketing and when we apply

Bucketing technique is involved in various tasks such as:

Hashing: Dividing the table into equal sized buckets based on hash value.
Data distribution: distribution of data across the buckets based on the computed hash values.
File storage: Each bucket has a separate file within the table’s directory in HDFS.
Query execution to optimize joins, sampling, and aggregations.

10. How bucketing is different from Partition and why we use it?

Bucketing decomposes data into more manageable or equal parts. With partitioning, there is a possibility that you can create the multiple small partitions based on the column values. If you go for bucketing, you are restricting number of buckets to store the data. This number is defined by during the table creation scripts.

11. If you have a bucketed table then can you take those records to Sqoop directly.

So you would have to import the data to an intermediate table and then insert into the bucketed table.

Request more information

Adobe Hadoop Interview Questions