Installing Hadoop on a Single Node Cluster

This article attempts to give a step by step walk through for creating a single Node Hadoop Cluster. It is an hands on tutorial so that even a novice user can follow the steps and create the Hadoop Cluster.

The Author of this article Sumod Pawgi is a project manager and technology enthusiast working at Persistent Systems. This article can also be found at his blog http://spawgi.wordpress.com/

Setup – Ubuntu 12.04 VM
Details of VM
- virtual Box - 32 bit - Ubuntu 12.04 - RAM – 1 GB - HDD – 40 GB

Details of Java on VM
- OpenJDK 1.6 – IcedTea
Reference Used
- Michale Noll’s tutorial

My login was ‘sumod’. I am part of ‘root’ group. We will create a dedicated ‘hadoop’ user who is part of ‘hadoop’ group

sumod@sumod-hadoop:~$ sudo groupadd hadoop sumod@sumod-hadoop:~$ sudo useradd -g hadoop hadoop sumod@sumod-hadoop:~$ sudo passwd hadoop Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully sumod@sumod-hadoop:~$
Note that you may need to create login directory for ‘hadoop’ user by giving the -d option. You may also need to change the login shell of the user to that of your choice.

Enable passwordless SSH for your Hadoop nodes.
If SSH is not already installed on your system, you can install it using the command
sudo apt-get install openssh-server

Create SSH key for ‘hadoop’ user
sumod@sumod-hadoop:~$ su -l hadoop
Password:
hadoop@sumod-hadoop:~$ ssh-keygen -t rsa -P “”
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory ‘/home/hadoop/.ssh’.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
d9:22:2b:d2:04:f4:f8:97:d7:8c:52:19:76:7e:8d:d6 hadoop@sumod-hadoop
The key’s randomart image is:
//You will see an image like structure here

We are creating RSA type of key as indicated by the flat ‘-t’. Normally we should not keep password empty. It is done here to enable seamless interacations of the hadoop system with your node.

We need to indicate that the public keys are authorized for SSH access. This is done using the command.
hadoop@sumod-hadoop:~/.ssh$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Let’s now test our setup.
hadoop@sumod-hadoop:~/.ssh$ ssh localhost
The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.
ECDSA key fingerprint is f9:be:8b:17:5a:8a:95:13:fa:96:22:c2:45:2b:08:cf.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-27-generic-pae i686)
This gives a warning about the ‘unknown’ host. If you accept this and go ahead, this host is added to the known_hosts file in your .ssh directory. After this, you can verify again that you are able to login with ‘hadoop’ user without needing to enter your password.

Disable IPv6
I wanted to disable IPv6 only for Hadoop and not for the complete setup. So I chose to update the hadoop-env.sh file later- after installing Hadoop.

Hadoop Installation
We will select Apache Hadoop 1.0.x version, which is the latest stable release.
This was the mirror suggested to me -
http://apache.techartifact.com/mirror/hadoop/common/
We will select version 1.0.3 in the tar.gz file format.
The complete link location is -
http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
We will put it in /usr/local directory.
These are the commands in sequence. It would be cool if they can be put into a script.
As my ‘hadoop’ user was not in ‘sudoers’ list, but user ‘sumod’ was, I used ‘sumod’ user to get the tar.gz file.
sumod@sumod-hadoop:/usr/local$ sudo wget http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
We will untar the directory.
sumod@sumod-hadoop:/usr/local$ sudo tar -zxvf hadoop-1.0.3.tar.gz
We now have the directory hadoop-1.0.3. I will not rename it so that I always know the version number.
Let’s change the ownership of the installation.
sumod@sumod-hadoop:/usr/local$ sudo chown -R hadoop:hadoop hadoop-1.0.3

We will now set HADOOP_HOME, JAVA_HOME and add HADOOP_HOME to the path by editing .bashrc of the ‘hadoop’ user.
#Add HADOOP_HOME, JAVA_HOME and update PATH export HADOOP_HOME="/usr/local/hadoop-1.0.3" export JAVA_HOME="/usr/lib/jvm/java-6-openjdk-i386" export PATH=$PATH:$HADOOP_HOME/bin
If these changes are not taking effect when you switch user to hadoop or when you ssh, please add this line in your .bash_profile file in your home directory. If .bash_profile file does not exist create it first.
source $HOME/.bashrc

Configuration
We need to configure JAVA_HOME variable for the hadoop environment as well. The configuration files will be usually in the ‘conf’ subdirectory while the executables will be in the ‘bin’ subdirectory.
The important files in ‘conf’ directory are
hadoop-env.sh, hdfs-site.xml, core-site.xml, mapred-site.xml.

hadoop-env.sh – Open the hadoop-env.sh file. It says on the top that hadoop specific environment variables are stored here. The only required variable is JAVA_HOME. In this file, the variable is already defined and the line is commented out. Edit the line to update the JAVA_HOME variable. In our case,

conf/*-site.xml – The earlier hadoop-site.xml file is now replaced with three different settings files – core-site.xml, hdfs-site.xml, mapred-site.xml. The main parameters that you need to refer to or modify in these three files are
core-site.xml – hadoop.tmp.dir, fs.default.name
hdfs-site.xml – dfs.replication
mapred-site.xml – mapred.job.tracker

hadoop.tmp.dir is used as a temporary directory for both local file system and for HDFS. We will use the directory ‘/app/hadoop/tmp/’ (Same as Michael Knoll). We need to create the directory and change its ownership.
sumod@sumod-hadoop:~$ sudo mkdir -p /app/hadoop/tmp [sudo] password for sumod: sumod@sumod-hadoop:~$ sudo chown hadoop:hadoop /app/hadoop/tmp sumod@sumod-hadoop:~$
In the configuration files, add the properties mentioned above.

conf/core-site.xml


<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
<property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>Default file system URI.
URI:scheme://authority/path
scheme:method of access
authority:host,port etc.</description>
</property>
</configuration>

conf/hdfs-site.xml


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.Usually 3,
1 in our case
</description>
</property>
</configuration>

conf/mapred-site.xml -


<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>Host and port for jobtracker. As we use localhost,
it will be single map and reduce task.</description>
</property>
</configuration>

Take some to think about why we are using different parameters and what their purpose is. Remember that HDFS is like a virtual file system on top of actual local file system. Virtual in a way that to the user, the different nodes should not appear separately. To the end user, HDFS should still appear homogeneous.

Now that we have downloaded, extracted and configured hadoop, it is time to start the installation. The first step would be to format Namenode. This initializes the FSNameSystem specified by the ‘dfs.name.dir’ variable. It will also write a VERSION file that specifies the namespace ID of this instance, ctime and version.If you format NameNode, you also have to clean up datanodes. Note that if you are just adding new datanodes to the cluster, you do not need to format NameNode.

Format HDFS system via NameNode

I gave the command – $hadoop namenode -formatI got the warning - $HADOOP_HOME is deprecated. So I am going to make following change in hadoop-env.sh file. export HADOOP_HOME_WARN_SUPPRESS="TRUE" If you get any exceptions with XML file, please check if you have properly closed the tags. This is the output on my machine. hadoop@sumod-hadoop:~$ hadoop namenode -format 12/09/08 01:36:15 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = sumod-hadoop/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012 ************************************************************/ 12/09/08 01:36:15 INFO util.GSet: VM type = 32-bit 12/09/08 01:36:15 INFO util.GSet: 2% max memory = 19.33375 MB 12/09/08 01:36:15 INFO util.GSet: capacity = 2^22 = 4194304 entries 12/09/08 01:36:15 INFO util.GSet: recommended=4194304, actual=4194304 12/09/08 01:36:16 INFO namenode.FSNamesystem: fsOwner=hadoop 12/09/08 01:36:16 INFO namenode.FSNamesystem: supergroup=supergroup 12/09/08 01:36:16 INFO namenode.FSNamesystem: isPermissionEnabled=true 12/09/08 01:36:16 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 12/09/08 01:36:16 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/09/08 01:36:16 INFO namenode.NameNode: Caching file names occuring more than 10 times 12/09/08 01:36:16 INFO common.Storage: Image file of size 112 saved in 0 seconds. 12/09/08 01:36:16 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted. 12/09/08 01:36:16 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at sumod-hadoop/127.0.1.1 ************************************************************/ hadoop@sumod-hadoop:~$

Start your cluster
Everything has gone well so far, start the single node cluster.
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ ./start-all.sh starting namenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-namenode-sumod-hadoop.out localhost: starting datanode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-datanode-sumod-hadoop.out localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-secondarynamenode-sumod-hadoop.out starting jobtracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-jobtracker-sumod-hadoop.out localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-tasktracker-sumod-hadoop.out hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$

Use jps to make sure all services are running as expected.
Note that – If jps is not found in your version of OpenJDK, you can update the JDK to get latest version and then use jps. You can run ‘sudo apt-get install openjdk-6-jdk’. I updated my JDK while hadoop was running and hadoop was not affected. But I do not advise that.
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ jps 9855 Jps 9488 SecondaryNameNode 9575 JobTracker 9810 TaskTracker 9266 DataNode 9053 NameNode

We can also use netstat to make sure that all Java processes are running for hadoop.
sumod@sumod-hadoop:~$ sudo netstat -nlp | grep java | grep 54310 tcp6 0 0 127.0.0.1:54310 :::* LISTEN 3366/java sumod@sumod-hadoop:~$ sudo netstat -nlp | grep java | grep 54311 tcp6 0 0 127.0.0.1:54311 :::* LISTEN 3908/java sumod@sumod-hadoop:~$

Stopping the cluster
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ ./stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$

In this part, we will see how to run a sample MapReduce – MR – job. We will run the WordCount example. It should count the number of times each word appears and output the same. The output will be text files.

We will download books from Project Gutenberg that will serve as inputs. I have selected following books and downloaded them in Text UTF-8 format.
1. The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
2. Pride and Prejudice by Jane Austen
3. Ulysses by James Joyce
4. War and Peace by graf Leo Tolstoy
5. Anna Karenina by graf Leo Tolstoy

I have created input directory at – ‘/tmp/wcinput’.
The total size of the files is – 7.9M
sumod@sumod-hadoop:/tmp/wcinput$ ls -lh total 7.9M -rw-rw-r-- 1 sumod sumod 701K Sep 8 16:22 pg1342.txt -rw-rw-r-- 1 sumod sumod 2.0M Sep 8 16:26 pg1399.txt -rw-rw-r-- 1 sumod sumod 581K Sep 8 16:20 pg1661.txt -rw-rw-r-- 1 sumod sumod 3.2M Sep 8 16:25 pg2600.txt -rw-rw-r-- 1 sumod sumod 1.6M Sep 8 16:23 pg4300.txt sumod@sumod-hadoop:/tmp/wcinput$

hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ ./start-all.sh

Copy Local Files to HDFS
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ hadoop fs -mkdir /user/hadoop/wcinput hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ hadoop fs -put /tmp/wcinput/pg*.txt /user/hadoop/wcinput hadoop@sumod-hadoop:/tmp/wcinput$ hadoop fs -ls /user/hadoop/wcinput Found 5 items -rw-r--r-- 1 hadoop supergroup 717571 2012-09-08 16:36 /user/hadoop/wcinput/pg1342.txt -rw-r--r-- 1 hadoop supergroup 2039777 2012-09-08 16:36 /user/hadoop/wcinput/pg1399.txt -rw-r--r-- 1 hadoop supergroup 594933 2012-09-08 16:36 /user/hadoop/wcinput/pg1661.txt -rw-r--r-- 1 hadoop supergroup 3288746 2012-09-08 16:36 /user/hadoop/wcinput/pg2600.txt -rw-r--r-- 1 hadoop supergroup 1573150 2012-09-08 16:36 /user/hadoop/wcinput/pg4300.txt hadoop@sumod-hadoop:/tmp/wcinput$

Run the MapReduce job
We will run the MR job using the examples jar file. We will use WordCount as the main class. The command format to run a hadoop job is
$hadoop jar

hadoop@sumod-hadoop:~$ hadoop jar $HADOOP_HOME/hadoop*examples*.jar wordcount /user/hadoop/wcinput /user/hadoop/wcoutput 12/09/08 16:49:02 INFO input.FileInputFormat: Total input paths to process : 5 12/09/08 16:49:02 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/09/08 16:49:02 WARN snappy.LoadSnappy: Snappy native library not loaded 12/09/08 16:49:03 INFO mapred.JobClient: Running job: job_201209081630_0001 12/09/08 16:49:04 INFO mapred.JobClient: map 0% reduce 0% 12/09/08 16:49:26 INFO mapred.JobClient: map 28% reduce 0% 12/09/08 16:49:29 INFO mapred.JobClient: map 40% reduce 0% 12/09/08 16:49:44 INFO mapred.JobClient: map 80% reduce 0% 12/09/08 16:49:50 INFO mapred.JobClient: map 100% reduce 13% 12/09/08 16:49:59 INFO mapred.JobClient: map 100% reduce 100% 12/09/08 16:50:04 INFO mapred.JobClient: Job complete: job_201209081630_0001 12/09/08 16:50:04 INFO mapred.JobClient: Counters: 29 12/09/08 16:50:04 INFO mapred.JobClient: Job Counters 12/09/08 16:50:04 INFO mapred.JobClient: Launched reduce tasks=1 12/09/08 16:50:04 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=69131 12/09/08 16:50:04 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/09/08 16:50:04 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/09/08 16:50:04 INFO mapred.JobClient: Launched map tasks=5 12/09/08 16:50:04 INFO mapred.JobClient: Data-local map tasks=5 12/09/08 16:50:04 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=28198 12/09/08 16:50:04 INFO mapred.JobClient: File Output Format Counters 12/09/08 16:50:04 INFO mapred.JobClient: Bytes Written=1089803 12/09/08 16:50:04 INFO mapred.JobClient: FileSystemCounters 12/09/08 16:50:04 INFO mapred.JobClient: FILE_BYTES_READ=4523145 12/09/08 16:50:04 INFO mapred.JobClient: HDFS_BYTES_READ=8214767 12/09/08 16:50:04 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6994566 12/09/08 16:50:04 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1089803 12/09/08 16:50:04 INFO mapred.JobClient: File Input Format Counters 12/09/08 16:50:04 INFO mapred.JobClient: Bytes Read=8214177 12/09/08 16:50:04 INFO mapred.JobClient: Map-Reduce Framework 12/09/08 16:50:04 INFO mapred.JobClient: Map output materialized bytes=2341906 12/09/08 16:50:04 INFO mapred.JobClient: Map input records=168458 12/09/08 16:50:04 INFO mapred.JobClient: Reduce shuffle bytes=2341906 12/09/08 16:50:04 INFO mapred.JobClient: Spilled Records=468932 12/09/08 16:50:04 INFO mapred.JobClient: Map output bytes=13667264 12/09/08 16:50:04 INFO mapred.JobClient: CPU time spent (ms)=12680 12/09/08 16:50:04 INFO mapred.JobClient: Total committed heap usage (bytes)=818434048 12/09/08 16:50:04 INFO mapred.JobClient: Combine input records=1478686 12/09/08 16:50:04 INFO mapred.JobClient: SPLIT_RAW_BYTES=590 12/09/08 16:50:04 INFO mapred.JobClient: Reduce input records=159771 12/09/08 16:50:04 INFO mapred.JobClient: Reduce input groups=97063 12/09/08 16:50:04 INFO mapred.JobClient: Combine output records=220205

Let’s check the result of the run.
hadoop@sumod-hadoop:/tmp/wcinput$ hadoop fs -ls /user/hadoop/wcoutput Found 3 items -rw-r--r-- 1 hadoop supergroup 0 2012-09-08 16:49 /user/hadoop/wcoutput/_SUCCESS drwxr-xr-x - hadoop supergroup 0 2012-09-08 16:49 /user/hadoop/wcoutput/_logs -rw-r--r-- 1 hadoop supergroup 1089803 2012-09-08 16:49 /user/hadoop/wcoutput/part-r-00000 hadoop@sumod-hadoop:/tmp/wcinput$

You can see that the job run is a success. There is one output file and one log file. There is one file that indicates success of the job run.

Note the way I ran the jar file. Sometimes, people would run the job from the hadoop folder and give only the name of the file. I have chosen to run the job from the home and then specify hadoop home so that hadoop can locate the jar file correctly.

You can specify parameters on the command line using the option ‘-D’ and then = format.

View the result of the MR job using HDFS
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ hadoop fs -cat /user/hadoop/wcoutput/part-r-00000
This is sample output of what we can see on screen.
" 1 "'A 1 "'About 1 "'Absolute 1 "'After 1 "'Ah!' 2 "'Ah, 2 "'Ample.' 1 "'And 10 "'Anna, 1 "'Arthur!' 1

Note that the quotes do not have much significance from hadoop point view. They will be dependent upon the string tokenizer.

Hadoop Web Interfaces
According to Michael Noll’s tutorial, the web interfaces can be found detailed in the file – conf/hadoop-default.xml. However, in my particular setup the settings for WebUI for NameNode and JobTracker dameon were found in the directory – src/packages/templates/conf in /hdfs-site.xml and in mapred-site.xml respectively. The setting for TaskTracker daemon was found in – /src/mapred/mapred-default.xml. The Web URLs are
NameNode daemon – http://localhost:50070/ -

http://localhost:50070/dfshealth.jsp

JobTracker daemon – http://localhost:50030/ – http://localhost:50030/jobtracker.jsp
TaskTracker daemon – http://localhost:50060/ – http://localhost:50060/tasktracker.jsp

Screenshots for the web interfaces.
NameNode -

JobTracker -

TaskTracker -

Using NameNode web interface, we can browse the hadoop file system and logs. It is the HDFS layer of the system. Using the JobTracker, we can see the job history. Using the TaskTracker web interface, we can view the log files. JobTracker and TaskTracker come in the MapReduce layer of the system. We can also view number of Map and Reduce tasks scheduled. Using NameNode, we can view the output, input files, status of the nodes. I am able to see in my setup the default block size is 64 MB. In the usual hadoop setup, the default block size is 128 MB.

Well, that was pretty much about setting up Hadoop on a single node Ubuntu cluster. Thanks to Michael Noll for the helpful tutorial which is a fantastic reference. My goal is to provide more of a workshop than a tutorial. So I plan to experiment with the system further and update the blog. Thanks for reading!

Next Topics here :

Map Reduce Introduction and internals

Word Count Without Mapper and Reducer

WordCount With Custom Mapper and Reducer

Latest Images

Trending Articles

Latest Images