Install Hadoop/YARN on Ubuntu and VM (VirtualBox) ~ Techie's Notes

This article describes the step-by-step approach to install Hadoop/YARN 2.5.1 or latest version on Ubuntu .

Install JDK 7 or JDK 8

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

else download java 8

Verify the Java installation:

$ java -version
java version "1.7.0_55" 
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Create a symlink for easier configuration later

$ cd /usr/lib/jvm/
$ sudo ln -s java-7-oracle jdk

Install OpenSSH Server

$ sudo apt-get install openssh-server
$ ssh-keygen -t rsa

Hit enter on all prompts i.e. accept all defaults including “no passphrase”. Next, to prevent password prompts, add the public key of this machine to the authorized keys folder (Hadoop services use ssh to talk among themselves even on a single node cluster).

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

SSH to localhost to test ssh server and also save localhost in the list of known hosts. Next time when you ssh to localhost, there will be no prompts

$ ssh localhost

Download Hadoop

Note 1: You should use a mirror URL from the official downloads page

Note 2: vkhan is my user name as well as group name on the ubuntu machine. Please replace this with your own user/group name

$ cd Downloads/
$ wget http://apache.claz.org/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gz
$ tar zxvf hadoop-2.2.0.tar.gz
$ sudo mv hadoop-2.2.0 /usr/local/
$ cd /usr/local
$ sudo ln -s hadoop-2.2.0 hadoop
$ sudo chown -R vkhan :vkhan hadoop-2.2.0
$ sudo chown -R vkhan :vkhan hadoop

Environment Configuration

$ cd ~
$ vim .bashrc

Add the following to the end of .bashrc file

#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL

Modify hadoop-env.sh

$ cd /usr/local/hadoop/etc/hadoop
$ vim hadoop-env.sh
 
#modify JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/jdk/

Verify hadoop installation

$ source ~/.bashrc (refresh shell to reflect the configuration changes we’ve made)
$ hadoop version
Hadoop 2.5.0
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1583262
Compiled by jenkins on 2015-03-31T08:29Z
Compiled with protoc 2.4.0
From source with checksum 375b2832a6641759c6eaf6e3e998147
This command was run using /usr/local/hadoop-2.4.0/share/hadoop/common/hadoop-common-2.4.0.jar

Hadoop Configuration

$ cd ~
$ mkdir -p mydata/hdfs/namenode
$ mkdir -p mydata/hdfs/datanode

core-site.xml

$ cd /usr/local/hadoop/etc/hadoop/
$ vim core-site.xml

Add the following between the <configuration></configuration> elements

<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:9000</value>
</property>

yarn-site.xml

$ vim yarn-site.xml

Add the following between the <configuration></configuration> elements

<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
</property>
<property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

mapred-site.xml

$ cp mapred-site.xml.template mapred-site.xml
$ vim mapred-site.xml

Add the following between the <configuration></configuration> elements

<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>

hdfs-site.xml

$ vim hdfs-site.xml

Add the following between the <configuration></configuration> elements. Replace /home/vkhan with your own home directory.

<property>
 <name>dfs.replication</name>
 <value>1</value>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/home/vkhan/mydata/hdfs/namenode</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name<value>file:/home/vkhan/mydata/hdfs/datanode</value>
 </property>

Running Hadoop

Format the namenode

$ hdfs namenode -format

Start hadoop

$ start-dfs.sh
$ start-yarn.sh

Verify all services are running

$ jps
5037 SecondaryNameNode
4690 NameNode
5166 ResourceManager
4777 DataNode
5261 NodeManager
5293 Jps

Check web interfaces of different services

Namenode: http://localhost:50070

YARN: http://localhost:8088

Run a hadoop example MR job

$ cd /usr/local/hadoop
$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.0.jar pi 2 5

start-all.sh & stop-all.sh : Used to start and stop hadoop daemons all at once. Issuing it on the master machine will start/stop the daemons on all the nodes of a cluster. Deprecated as you have already noticed.
start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh : Same as above but start/stop HDFS and YARN daemons separately on all the nodes from the master machine. It is advisable to use these commands now over start-all.sh & stop-all.sh
hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager : To start individual daemons on an individual machine manually. You need to go to a particular node and issue these commands.
Use case : Suppose you have added a new DN to your cluster and you need to start the DN daemon only on this machine,

bin/hadoop-daemon.sh start datanode

Note : You should have ssh enabled if you want to start all the daemons on all the nodes from one machine.

References

http://codesfusion.blogspot.in/2013/10/setup-hadoop-2x-220-on-ubuntu.html

http://stackoverflow.com/questions/17569423/what-is-best-way-to-start-and-stop-hadoop-ecosystem

Techie's Notes

while( !(succeed=try())){}

Popular Posts

Recent Posts

Categories

Blog Archive

Contributors

Followers

Total Pageviews

Search This Blog

Blogroll

About

Blogger templates

Saturday, 8 August 2015

Install Hadoop/YARN on Ubuntu and VM (VirtualBox)

Install JDK 7 or JDK 8

Install OpenSSH Server

Download Hadoop

Environment Configuration

Hadoop Configuration

core-site.xml

yarn-site.xml

mapred-site.xml

hdfs-site.xml

Running Hadoop

Run a hadoop example MR job

References

http://stackoverflow.com/questions/17569423/what-is-best-way-to-start-and-stop-hadoop-ecosystem

0 comments:

Post a Comment

Popular Posts

Recent Posts

Categories

Blog Archive

Contributors

Subscribe To

Followers

Total Pageviews

Search This Blog

Blogroll

About

Blogger templates

Saturday, 8 August 2015

Install JDK 7 or JDK 8

Install OpenSSH Server

Download Hadoop

Environment Configuration

Hadoop Configuration

core-site.xml

yarn-site.xml

mapred-site.xml

hdfs-site.xml

Running Hadoop

Run a hadoop example MR job

References

http://stackoverflow.com/questions/17569423/what-is-best-way-to-start-and-stop-hadoop-ecosystem

0 comments:

Post a Comment