Hadoop Interview Questions and Answers Part-1 ~ Techie's Notes

http://twimgs.com/informationweek/galleries/automated/723/01_Hadoop_full.jpg

Why the name Hadoop?

Hadoop doesn’t have any expanding version like oops. The charming yellow elephant you see is basically named after Dougs son’s toy elephant!

Why do we need Hadoop?

-- Everyday a large amount of unstructured data is getting dumped into our machines.
-- The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations.
-- In this situation a necessity for Hadoop arises.
-- Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way.
-- It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel.
-- This is also known as parallel computing.

What is Hadoop framework?

Hadoop is a open source framework which is written in java by apache software foundation. This framework is used to wirite software application which requires to process vast amount of data (It could handle multi tera bytes of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner.

Give a brief overview of Hadoop history?

In 2002, Doug Cutting created an open source, web crawler project.

In 2004, Google published MapReduce, GFS papers.

In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project.

In 2008, Yahoo ran 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark.

In 2009, Facebook launched SQL support for Hadoop

What are the two main parts of the Hadoop framework?

Hadoop consists of two main parts

Hadoop distributed file system, a distributed file system with high throughput,

Hadoop MapReduce, a software framework for processing large data sets.

What problems can Hadoop solve?

The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.

How should hadoop solving the big data problems?

First, we have to see challenges of bigdata

Bigdata is structured and unstructured and semi-structured data.we can’t store and process large data in traditional RDBMS which can’t cope with storing billions of rows of data.so,by using hadoop we can store and process(unstructured and semi-structured data).

Hadoop is built to run on a cluster of machines–actual data should be stored on different nodes in cluster with a very high degree of fault tolerance and high availability.

On What concept the Hadoop framework works?

It works on MapReduce, and it is devised by the Google.

What are the core components of Hadoop?

Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and MapReduce is used to process such large data sets.

What is the basic difference between traditional RDBMS and Hadoop?

Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to seek one record from Big data, whereas, Hadoop will be useful when you want Big data in one shot and perform analysis on that later.

How Hadoop MapReduce works?

In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.

What is Distributed Cache in Hadoop?

Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

How is the splitting of file invoked in Hadoop framewor?

It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.

What happens when a datanode fail?

When a datanode fails

Jobtracker and namenode detect the failure

On the failed node all tasks are re-scheduled

Namenode replicates the users data to another node

What is the difference between an InputSplit and a Block?

Block is a physical division of data and does not take in to account the logical boundary of records. Meaning you could have a record that started in one block and ends in another block. Where as InputSplit considers the logical boundaries of records as well.

Explain what are the basic parameters of a Mapper?

The basic parameters of a Mapper are

LongWritable and Text

Text and IntWritable

Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker

What is the relationship between Jobs and Tasks in Hadoop?

One job is broken down into one or many tasks in Hadoop.

What is PIG?

Pig is a Apache open source project which is run on hadoop,provides engine for data flow in parallel on hadoop. It includes language called pig latin,which is for expressing these data flow.It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing and processing.

How to write ‘for each’ statement for map datatype in pig scripts?

for map we can use hash(‘#’)

bball = load ‘baseball’ as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);

avg = foreach bball generate bat#’batting_average';

What are relational operations in PIG?

a)for each

b)order by

c)filters

d)group

e)distinct

f)join

g)limit

What is the purpose of ‘dump’ keyword in pig?

in PIG, The 'dump' keyword is used to display the output on the screen.

How does speculative execution work in Hadoop?

JobTracker makes different TaskTrackers pr2ocess same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

What are the scalar datatypes in PIG?

int - 4 bytes,

float - 4 bytes,

double - 8 bytes,

long - 8 bytes,

chararray,

bytearray

What are the complex datatypes in pig?

Map:

map in pig is char array to data element mapping where element have pig data type including complex data type.

example of map [‘city’#’hyd’,’pin’#500086]

the above example city and pin are data elements(key) mapping to values

Tuple:

Tuple have fixed length and it have collection datatypes.tuple containing multiple fields and also tuples are ordered.

For example, (hyd,500086) which containing two fields.

Bag:

A bag containing collection of tuples which are unordered,Bag constants are constructed using braces, with tuples in the bag separated by commas.

For example, {(‘hyd’, 500086), (‘chennai’, 510071), (‘bombay’, 500185)}

How is PIG Useful For?

In three categories,we can use pig .they are

1)ETL data pipline

2)Research on raw data

3)Iterative processing

Most common usecase for pig is data pipeline.Let us take one example, web based compaines gets the weblogs,so before storing data into warehouse,they do some operations on data like cleaning and aggeration operations..etc.i,e transformations on data.

Explain what is distributed Cache in MapReduce Framework?

Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.

How Pig differs from MapReduce?

In mapreduce,groupby operation performed at reducer side and filter,projection can be implemented in the map phase.pig latin also provides standard-operation similar to mapreduce like orderby and filters,groupby..etc.we can analyze pig script and know data flows ans also early to find the error checking.pig Latin is much lower cost to write and maintain thanJava code for MapReduce.

What is difference between PIG and SQL?

-- PIG latin is procedural version of SQL.

-- PIG has certainly similarities,more difference from sql.

-- SQL is a query language for user asking question in query form.

-- SQL makes answer for given but dont tell how to answer the given question.

-- Suppose ,if user want to do multiple operations on tables,we have write multiple queries and also use temporary table for storing,sql is support for sub queries but intermediate we have to use temporary tables,SQL users find sub queries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of sub queries or to worry about storing data in temporary tables

What is a Combiner?

The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

What is the purpose of RecordReader in Hadoop?

The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.

Consider case scenario: In M/R system, - HDFS block size is 64 MB - Input format is FileInputFormat – We have 3 files of size 64K, 65Mb and 127Mb?

Hadoop will make 5 splits as follows:

- 1 split for 64K files

- 2 splits for 65MB files

- 2 splits for 127MB files

Explain what is Job Tracker in Hadoop What are the actions followed by Hadoop?

In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own JVM process

Hadoop performs following actions in Hadoop

Client application submit jobs to the job tracker

JobTracker communicates to the Namemode to determine data location

Near the data or with available slots JobTracker locates TaskTracker nodes

On chosen TaskTracker Nodes, it submits the work

When a task fails, Job tracker notify and decides what to do then.

The TaskTracker nodes are monitored by JobTracker

Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker

What is the difference between a Hadoop database and Relational Database?

Hadoop is not a database, it is an architecture with a filesystem called HDFS. The data is stored in HDFS which does not have any predefined containers. Relational database stores data in predefined containers.

What happens if mapper output does not match reducer input in Hadoop?

A real-time exception will be thrown and map-reduce job will fail.

Explain what is Speculative Execution?

In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.

How can we control particular key should go in a specific reducer?

By using a custom partitioner.

What are the core components of Hadoop ?

Core components of Hadoop are HDFS and Map Reduce. HDFS is basically used to store large data sets and Map Reduce is used to process such large data sets.

Explain what is the function of MapReducer partitioner?

The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers

Explain what happens in textinputformat?

In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text

Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker

What happens if mapper output does not match reducer input in Hadoop?

A real-time exception will be thrown and map-reduce job will fail.

What are sequence files and why are they important in Hadoop?

Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs

What is a SequenceFile in Hadoop?

A. Sequence File contains a binary encoding ofan arbitrary number of homogeneous writable objects.

B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous writable objects.

C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.

D. ASequenceFilecontains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be sametype.

What is the Reducer used for?

Reducer is used to combine the multiple outputs of mapper to one.

What if job tracker machine is down?

In Hadoop 1.0, Job Tracker is single Point of availability means if JobTracker fails, all jobs must restart.Overall Execution flow will be interupted. Due to this limitation, In hadoop 2.0 Job Tracker concept is  replaced by YARN.

In YARN, the term JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).

Resource Manager

Node Manager(node specific)

What is the meaning Replication factor?

Replication factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3times the amount of storage needed to store the data. Each file is split into data blocks and spread across the cluster.

How does master slave architecture in the Hadoop?

The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node will have one TaskTracker.

The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

What is the difference between partitioning and bucketing in hive and when we use bucketing or partitioning?

Basically both Partitioning and Bucketing slice the data for executing the query much more efficiently than on the non-sliced data. The major difference is that the number of slices will keep on changing in the case of partitioning as data is modified, but with bucketing the number of slices are fixed which are specified while creating the table.

Bucketing happen by using a Hash algorithm and then a modulo on the number of buckets. So, a row might get inserted into any of the bucket. Bucketing can be used for sampling of data, as well also for joining two data sets much more effectively and much more.

What is the InputSplit in mapreduce?

An inputsplit is the slice of data to be processed by a single Mapper. It generally is of the block size which is stored on the datanode.

What is the default replication factor in HDFS?

The default hadoop comes with 3 replication factor. You can set the replication level individually for each file in HDFS. In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow. Most Hadoop administrators set the default replication factor for their files to be three. The main assumption here is that if you keep three copies of the data, your data is safe. this to be true in the big clusters that we manage and operate. In addition to fault tolerance having replicas allow jobs that consume the same data to be run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple copies of the same task and take which ever finishes first. This is useful if for some reason a box is being slow.

What is structured and unstructured data?

Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text. It is not in the form of rows and columns.

What are the primary phases of the Reducer?

Reducer has 3 primary phases: shuffle, sort and reduce.

Explain what is shuffling in MapReduce?

The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle.

Explain what happens in textinputformat?

In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text.

How can we control particular key should go in a specific reducer?

Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

What is compute and Storage nodes?

- ComputeNode: This is the computer or machine where your actual business logic will be executed.

-- StorageNode: This is the computer or machine where your file system reside to store the processing data.

-- In most of the cases compute node and storage node would be the same machine.

Explain what combiners is and when you should use a combiner in a MapReduce Job?

To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop

Why we use metadata and metastore in hive?

The Hive metastore service stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the metastore service API

What is Hadoop Streaming?

Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.

What is the difference between HDFS and NAS?

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS

-- In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.

-- HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations.

-- HDFS runs on a cluster of machines and provides redundancy usinga replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

What are the restriction to the key and value class?

The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be comp

What is TaskTracker?

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.

What are some typical functions of Job Tracker?

The following are some typical tasks of JobTracker:-

- Accepts jobs from clients

- It talks to the NameNode to determine the location of the data.

- It locates TaskTracker nodes with available slots at or near the data.

- It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.

What is configuration of a typical slave node on Hadoop cluster How many JVMs run on a slave node?

Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.

Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.

One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

How does speculative execution work in Hadoop?

JobTracker makes different TaskTrackers process same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Which are the three modes in which Hadoop can be run?

The three modes in which Hadoop can be run are:

1. standalone (local) mode

2. Pseudo-distributed mode

3. Fully distributed mode

How can you add the arbitrary key-value pairs in your mapper?

You can set arbitrary (key, value) pairs of configuration data in your Job, e.g. with Job.getConfiguration().set("myKey", "myVal"), and then retrieve this data in your mapper with Context.getConfiguration().get("myKey").

This kind of functionality is typically done in the Mapper's setup() method.

What is next step after Mapper or MapTask?

The output of the Mapper are sorted and Partitions will be created for the output. Number of partition depends on the number of reducer.

Explain the Reducer’s Sort phase?

The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged (It is similar to merge-sort).

What is Hadoop Map Reduce?

For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process.

What is the use of Context object?

The Context object allows the mapper to interact with the rest of the Hadoop system. It includes configuration data for the job, as well as interfaces which allow it to emit output.

What is the use of Combiner?

It is an optional component or class, and can be specify via Job.setCombinerClass(ClassName), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

Explain the Reducer’s reduce phase?

In this phase the reduce(MapOutKeyType, Iterable, Context) method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via Context.write(ReduceOutKeyType, ReduceOutValType). Applications can use the Context to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted.

Explain what is shuffling in MapReduce?

The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle

What is NoSQL?

NoSQL is a whole new way of thinking about a database. NoSQL is not a relational database. The reality is that a relational database model may not be the best solution for all situations. The easiest way to think of NoSQL, is that of a database which does not adhering to the traditional relational database management system (RDMS) structure. Sometimes you will also see it revered to as 'not only SQL'.

Why would NoSQL be better than using a SQL Database And how much better is it?

It would be better when your site needs to scale so massively that the best RDBMS running on the best hardware you can afford and optimized as much as possible simply can't keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in extreme cases.

Explain what does the conf.setMapper Class do?

Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper.

What is compute and Storage nodes?

Compute Node: This is the computer or machine where your actual business logic will be executed.

Storage Node: This is the computer or machine where your file system reside to store the processing data. In most of the cases compute node and storage node would be the same

machine.

Explain how input and output data format of the Hadoop framework?

The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.

See the flow mentioned below (input) -> map -> -> combine/sorting -> -> reduce -> (output)

What happens if you don’t override the Mapper methods and keep them as it is?

If you do not override any methods (leaving even map as-is), it will act as the identity function, emitting each input record as a separate output.

Which are the methods in the Mapper interface?

The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. All above methods you can override in your code.

What happens if number of reducers are 0?

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

What is the use of Combiners in the Hadoop framework?

Combiners are used to increase the efficiency of a MapReduce program.

They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed; Hadoop may or may not execute a combiner. Also, if required it may execute it more than 1 times. Therefore your

MapReduce jobs should not depend on the combiners’ execution.

What is the Hadoop MapReduce API contract for a key and value Class?

The Key must implement the org.apache.hadoop.io.WritableComparable interface.

The value must implement the org.apache.hadoop.io.Writable interface.

How HDFS differs with NAS?

Following are differences between HDFS and NAS

-- In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.

-- HDFS is designed to work with MapReduce System, since computation is moved to data.

-- NAS is not suitable for MapReduce since data is stored separately from the computations.

-- HDFS runs on a cluster of machines and provides redundancy using replication protocol.

-- Whereas NAS is provided by a single machine therefore does not provide data redundancy.

How does a NameNode handle the failure of the data nodes?

-- HDFS has master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.

-- In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

-- The NameNode and DataNode are pieces of software designed to run on commodity

machines.

-- NameNode periodically receives a Heartbeat and a Block report from each of the DataNodes in the cluster.

-- Receipt of a Heartbeat implies that the DataNode is functioning properly.

-- A Blockreport contains a list of all blocks on a DataNode.

-- When NameNode notices that it has not received a heartbeat message from a data node

after a certain amount of time, the data node is marked as dead.

-- Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead DataNode.

-- The NameNode Orchestrates the replication of data blocks fromone DataNode to another.

-- The replication data transfer happens directly between DataNode and the data never passes through the NameNode.

Can Reducer talk with each other?

No, Reducer runs in isolation

Where the Mapper’s Intermediate data will be stored?

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

Explain the use of TaskTracker in the Hadoop cluster?

-- A Tasktracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation.

-- Tasktracker also runs in its own JVM Process.

-- Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept.

-- The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker.

-- The Tasktracker monitors these task instances, capturing the output and exit codes.

-- When the Task instances finish, successfully or not, the task tracker notifies the JobTracker.

-- The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive.

-- These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

What do you mean by TaskInstance?

-- Task instances are the actual MapReduce jobs which run on each slave node.

-- The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down

the entire task tracker.

-- Each Task Instance runs on its own JVM process.

-- There can be multiple processes of task instance running on a slave node.

-- This is based on the number of slots configured on task tracker.

-- By default a new task instance JVM process is spawned for a task.

How many daemon processes run on a Hadoop cluster?

Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.

Following 3 Daemons run on Master nodes.

-- NameNode - This daemon stores and maintains the metadata for HDFS.

-- Secondary NameNode - Performs housekeeping functions for the NameNode.

-- JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Following 2 Daemons run on each Slave nodes

-- DataNode – Stores actual HDFS data blocks.

-- TaskTracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

How many maximum JVM can run on a slave node?

One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

What is NAS?

It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS.

How many daemon processes run on a Hadoop cluster?

Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.

Following 3 Daemons run on Master nodes.

-- NameNode - This daemon stores and maintains the metadata for HDFS.

-- Secondary NameNode - Performs housekeeping functions for the NameNode.

-- JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Following 2 Daemons run on each Slave nodes

-- DataNode – Stores actual HDFS data blocks.

-- TaskTracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

What is the JobTracker?

JobTracker is a daemon service which submits and tracks the MapReduce tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a separate machine, and each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

What does a JobTracker perform in a Hadoop Cluster?

JobTracker in Hadoop performs following actions Client applications submit jobs to the Job tracker.

-- The JobTracker talks to the #NameNode to determine the location of the data

-- The JobTracker locates TaskTracker nodes with available slots at or near the data

-- The JobTracker submits the work to the chosen #TaskTracker nodes.

-- The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.

-- A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.

-- When the work is completed, the JobTracker updates its status.

Client applications can poll the JobTracker for information.

How a task is scheduled by a JobTracker?

-- The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive.

-- These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

-- When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

How many instances of Tasktracker run on a Hadoop cluster?

There is one Daemon Tasktracker process for each slave node in the Hadoop cluster

How many instances of JobTracker can run on a Hadoop Cluster?

Only one

It can be possible that a Job has 0 reducers?

It is legal to set the number of reduce-tasks to zero if no reduction is desired.

How many Reducers should be configured?

-- The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapreduce.tasktracker.reduce.tasks.maximum).

-- With 0.95 all of the reduces can launch immediately and start transfering map outputs as

the maps finish.

-- With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

-- Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

Explain the Reducer’s reduce phase?

-- In this phase the reduce(MapOutKeyType, Iterable, Context) method is called foreach pair in the grouped inputs.

-- The output of the reduce task is typically written to the FileSystem via Context.write(ReduceOutKeyType, ReduceOutValType).

-- Applications can use the Context to report progress, set application-level status messages and update Counters, or just indicate that they are alive.

-- The output of the Reducer is not sorted.

Explain the Reducer’s Sort phase?

The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged (It is similar to merge-sort).

Explain the shuffle?

Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

What are the primary phases of the Reducer?

Shuffle, Sort and Reduce

Explain the core methods of the Reducer?

-- The API of Reducer is very similar to that of Mapper, there's a run() method that receives a Context containing the job's configuration as well as interfacing methods that return data from the reducer itself back to the framework.

-- The run() method calls setup() once, reduce() once for each key associated with the reduce task, and cleanup() once at the end.

-- Each of these methods can access the job's configuration data by using Context.getConfiguration().

-- As in Mapper, any or all of these methods can be overridden with custom implementations. If none of these methods are overridden, the default reducer

operation is the identity function; values are passed through without further

processing.

The heart of Reducer is its reduce() method. This is called once per key; the

second argument is an Iterable which returns all the values associated with that

key.

What is the Reducer used for?

Reducer reduces a set of intermediate values which share a key to a (usually smaller) set of values

How many maps are there in a particular Job?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

Generally it is around 10-100 maps per-node. Task set up takes awhile, so it is best if the maps take at least a minute to execute.

Suppose, if you expect 10TB of input data and have a block size of 128MB, you'll end up with 82,000 maps, to control the number of block you can use the mapreduce.job.maps parameter (which only provides a hint to the framework).

Ultimately, the number of tasks is controlled by the number of splits returned by the InputFormat.getSplits() method (which you can override).

How many maps are there in a particular Job?

The number of maps is usually driven by the total size of the inputs, that is,

the total number of blocks of the input files. Generally it is around 10-100 maps per-node. Task setup takes awhile, so it is best if the maps take at least a minute to execute. Suppose, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, to control the number of block you can use the mapreduce.job.maps parameter (which only provides a hint to the framework). Ultimately, the number of tasks is controlled by the number of splits returned by the InputFormat.getSplits() method (which you can override).

How can we control particular key should go in a specific reducer?

Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

What is next step after Mapper or MapTask?

The output of the Mapper are sorted and Partitions will be created for the output. Number of partition depends on the number of reducer.

Which object can be used to get the progress of a particular job?

Context

How does Mapper’s run() method works?

The Mapper.run() method then calls map(KeyInType, ValInType, Context) for each key/value pair in the InputSplit for that task

What is the use of Context object?

The Context object allows the mapper to interact with the rest of the Hadoop system. It includes configuration data for the job, as well as interfaces which allow it to emit output.

Which are the methods in the Mapper interface?

The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. All above methods you can override in your code

How Mapper is instantiated in a running job?

The Mapper itself is instantiated in the running job, and will be passed a MapContext object which it can use to configure itself.

How does master slave architecture in the Hadoop?

The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node will have one TaskskTracker.

The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

What is a Secondary Namenode Is it a substitute to the Namenode?

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

What if rack 2 and datanode fails?

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

Do we need to place 2nd and 3rd data in rack 2 only?

Yes, this is to avoid datanode failure.

On what basis data will be stored on a rack?

When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

Why do we need a password-less SSH in Fully Distributed environment?

We need a password-less SSH in a Fully-Distributed environment because when the cluster is LIVE and running in Fully

Distributed environment, the communication is too frequent. The job tracker should be able to send a task to task tracker quickly.

What are the network requirements for Hadoop?

The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires password-less SSH connection between the master and all the slaves and the secondary machines.

If datanodes increase, then do we need to upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.

If a data Node is full how it’s identified?

When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.

If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

Replication causes data redundancy then why is is pursued in HDFS?

HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at atleast 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

What is Cloudera and why it is used?

Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera belongs to Apache and is used for data processing.

.Is fs.mapr.working.dir a single directory?

Yes, fs.mapr.working.dir it is just one directory.

What is a spill factor with respect to the RAM?

Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.

What are the port numbers of Namenode, job tracker and task tracker?

The port number for Namenode is ’70', for job tracker is ’30' and for task tracker is ’60'.

In which directory Hadoop is installed?

Cloudera and Apache has the same directory structure. Hadoop is installed in cd /usr/lib/hadoop-0.20/.

What are the features of Fully Distributed mode?

Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running and another host on which datanode is running and then there are machines on which task tracker is running. We have separate masters and separate slaves in this distribution.

Is it mandatory to set input and output type/format in MapReduce?

No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’.

What is the input type/format in MapReduce by default?

By default the type input type in MapReduce is ‘text’.

What are the benefits of block transfer?

A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

What is Streaming?

Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.

Why we cannot do aggregation (addition) in a mapper Why we require reducer for that?

We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

Does Hadoop follows the #UNIX pattern?

-- Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘#conf‘ directory as in the case of UNIX.

How many #Daemon processes run on a Hadoop system?

-- Hadoop is comprised of five separate daemons.

-- Each of these daemon run in its own #JVM.

Following 3 Daemons run on #MasterNodes

a) #NameNode – This daemon stores and maintains the metadata for HDFS.

b) Secondary NameNode – Performs #housekeeping functions for the NameNode.

c) #JobTracker – Manages #MapReduce jobs, distributes individual tasks to machines running the #TaskTracker.

Following 2 Daemons run on each #SlaveNodes

d) #DataNode – Stores actual HDFS data blocks.

e) #TaskTracker – Responsible for instantiating and monitoring individual #Map and #Reduce tasks.

Explain what is sqoop in Hadoop?

To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS

Explain what is WebDAV in Hadoop?

To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.

How the Client communicates with HDFS?

The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

What is a IdentityMapper and IdentityReducer in MapReduce?

-- org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.

-- org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

What is “fsck”?

fsck standards for File System Check.

How would you restart Namenode?

To restart Namenode, you could either write:

sudo hdfs

su-hdfs

/etc/init.d/ha, press enter, then /etc/init.d/hadoop-0.10-namenode start

and then press Enter, or you could simply click stop-all.sh, then select start-all.sh.

Is there another way to check whether Namenode is working?

Besides the jps command, you can also use: /etc/init.d/hadoop-0.20-namenode status.

What sorts of actions does the job tracker process perform?

-- Client applications send the job tracker jobs.

-- Job tracker determines the location of data by communicating with Namenode.

-- Job tracker finds nodes in task tracker that has open slots for the data.

-- Job tracker submits the job to task tracker nodes.

-- Job tracker monitors the task tracker nodes for signs of activity. If there is not enough activity, job tracker transfers the job to a different task tracker node.

-- Job tracker receives a notification from task tracker if the job has failed. From there, job tracker might submit the job elsewhere, as described above. If it doesn’t do this, it might blacklist either the job or the task tracker.

How does master slave architecture in the Hadoop?

The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node will have one TaskskTracker.

The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

What is compute and Storage nodes?

Compute Node: This is the computer or machine where your actual business logic will be executed.

Storage Node: This is the computer or machine where your file system reside to store the processing data.In most of the cases compute node and storage node would be the same

machine.

Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker

Mention what are the main configuration parameters that user need to specify to run Mapreduce Job?

The user of Mapreduce framework needs to specify

Job’s input locations in the distributed file system

Job’s output location in the distributed file system

Input format

Output format

Class containing the map function

Class containing the reduce function

JAR file containing the mapper, reducer and driver classes

Explain what does the conf.setMapper Class do?

Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper

Explain what combiners is and when you should use a combiner in a MapReduce Job?

To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop.

Explain what is NameNode in Hadoop?

NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines

Explain what is NameNode in Hadoop?

NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines

Explain what is distributed Cache in MapReduce Framework?

Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.

Explain what is shuffling in MapReduce?

The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle

How Hadoop MapReduce works?

In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.

We have already SQL then Why #NoSQL?

-- NoSQL is high performance with high availability, and offers rich query language and easy scalability.

-- NoSQL is gaining momentum, and is supported by Hadoop, #MongoDB and others.

-- The NoSQL Database site is a good reference for someone looking for more information.

What are the functions of JobTracker in Hadoop?

-- Once you submit your code to your #cluster, the JobTracker determines the execution plan by determining which files to process, assigns# nodes to different #tasks, and monitors all tasks as they are running.

-- If a task fail, the #JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries.

-- There is only one JobTracker daemon per Hadoop cluster. It is typically run on a server as a #masternode of the cluster.

What is a Task instance in Hadoop? Where does it run?

Task ?instances? are the actual MapReduce jobs which are run on each slave node. The Task Tracker starts a separate ?JVM? processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a ?slave? node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

What is Hadoop and where did Hadoop come from?

The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What is the partitions formula?

define partitions with the following formula

partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )

where

nextPrimeNumberAbove(x) - prime number which is greater than x

K - multiplicator to calculate start with 1 and encrease untill join

perfomance start to degrade

and

1. This is a formula to estimate Hadoop storage (H):

H=c*r*S/(1-i)

where:

c = average compression ratio. It depends on the type of compression used (Snappy, LZOP, ...) and size of the data. When no compression is used, c=1.

r = replication factor. It is usually 3 in a production cluster.

S = size of data to be moved to Hadoop. This could be a combination of historical data and incremental data. The incremental data can be daily for example and projected over a period of time (3 years for example).

i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to storing intermediate results of Map phases.

Example: With no compression i.e. c=1, a replication factor of 3, an intermediate factor of .25=1/4

H= 1*3*S/(1-1/4)=3*S/(3/4)=4*S

With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.

2. This is the formula to estimate the number of data nodes (n):

n= H/d = c*r*S/(1-i)*d

where d= disk space available per node. All other parameters remain the same as in 1.

Example: If 8TB is the available disk space per node (10 disks with 1 TB , 2 disk for operating system etc were excluded.).

Assuming initial data size is 600 TB. n= 600/8=75 data nodes needed

Techie's Notes

while( !(succeed=try())){}

Popular Posts

Recent Posts

Categories

Blog Archive

Contributors

Followers

Total Pageviews

Search This Blog

Blogroll

About

Blogger templates

Saturday 8 August 2015

Hadoop Interview Questions and Answers Part-1

0 comments:

Post a Comment