Quantcast
Channel: Big Data Circus » hdfs
Viewing all articles
Browse latest Browse all 10

Installing Hadoop on a Single Node Cluster – A Walkthrough

$
0
0

This article attempts to give a step by step walk through for creating a single Node Hadoop Cluster. It is an hands on tutorial so that even a novice user can follow the steps and create the Hadoop Cluster.

The Author of this article Sumod Pawgi is a project manager and technology enthusiast working at Persistent Systems. This article can also be found at his blog http://spawgi.wordpress.com/

Setup – Ubuntu 12.04 VM
Details of VM
- virtual Box - 32 bit - Ubuntu 12.04 - RAM – 1 GB - HDD – 40 GB

Details of Java on VM
- OpenJDK 1.6 – IcedTea
Reference Used
Michale Noll’s tutorial

My login was ‘sumod’. I am part of ‘root’ group. We will create a dedicated ‘hadoop’ user who is part of ‘hadoop’ group


sumod@sumod-hadoop:~$ sudo groupadd hadoop
sumod@sumod-hadoop:~$ sudo useradd -g hadoop hadoop
sumod@sumod-hadoop:~$ sudo passwd hadoop
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
sumod@sumod-hadoop:~$

Note that you may need to create login directory for ‘hadoop’ user by giving the -d option. You may also need to change the login shell of the user to that of your choice.

Enable passwordless SSH for your Hadoop nodes.
If SSH is not already installed on your system, you can install it using the command
sudo apt-get install openssh-server

Create SSH key for ‘hadoop’ user
sumod@sumod-hadoop:~$ su -l hadoop
Password:
hadoop@sumod-hadoop:~$ ssh-keygen -t rsa -P “”
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory ‘/home/hadoop/.ssh’.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
d9:22:2b:d2:04:f4:f8:97:d7:8c:52:19:76:7e:8d:d6 hadoop@sumod-hadoop
The key’s randomart image is:
//You will see an image like structure here

We are creating RSA type of key as indicated by the flat ‘-t’. Normally we should not keep password empty. It is done here to enable seamless interacations of the hadoop system with your node.

We need to indicate that the public keys are authorized for SSH access. This is done using the command.
hadoop@sumod-hadoop:~/.ssh$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Let’s now test our setup.
hadoop@sumod-hadoop:~/.ssh$ ssh localhost
The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.
ECDSA key fingerprint is f9:be:8b:17:5a:8a:95:13:fa:96:22:c2:45:2b:08:cf.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘localhost’ (ECDSA) to the list of known hosts.
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-27-generic-pae i686)
This gives a warning about the ‘unknown’ host. If you accept this and go ahead, this host is added to the known_hosts file in your .ssh directory. After this, you can verify again that you are able to login with ‘hadoop’ user without needing to enter your password.

Disable IPv6
I wanted to disable IPv6 only for Hadoop and not for the complete setup. So I chose to update the hadoop-env.sh file later- after installing Hadoop.

Hadoop Installation
We will select Apache Hadoop 1.0.x version, which is the latest stable release.
This was the mirror suggested to me -
http://apache.techartifact.com/mirror/hadoop/common/
We will select version 1.0.3 in the tar.gz file format.
The complete link location is -
http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
We will put it in /usr/local directory.
These are the commands in sequence. It would be cool if they can be put into a script.
As my ‘hadoop’ user was not in ‘sudoers’ list, but user ‘sumod’ was, I used ‘sumod’ user to get the tar.gz file.
sumod@sumod-hadoop:/usr/local$ sudo wget http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
We will untar the directory.
sumod@sumod-hadoop:/usr/local$ sudo tar -zxvf hadoop-1.0.3.tar.gz
We now have the directory hadoop-1.0.3. I will not rename it so that I always know the version number.
Let’s change the ownership of the installation.
sumod@sumod-hadoop:/usr/local$ sudo chown -R hadoop:hadoop hadoop-1.0.3

We will now set HADOOP_HOME, JAVA_HOME and add HADOOP_HOME to the path by editing .bashrc of the ‘hadoop’ user.
#Add HADOOP_HOME, JAVA_HOME and update PATH
export HADOOP_HOME="/usr/local/hadoop-1.0.3"
export JAVA_HOME="/usr/lib/jvm/java-6-openjdk-i386"
export PATH=$PATH:$HADOOP_HOME/bin

If these changes are not taking effect when you switch user to hadoop or when you ssh, please add this line in your .bash_profile file in your home directory. If .bash_profile file does not exist create it first.
source $HOME/.bashrc

Configuration
We need to configure JAVA_HOME variable for the hadoop environment as well. The configuration files will be usually in the ‘conf’ subdirectory while the executables will be in the ‘bin’ subdirectory.
The important files in ‘conf’ directory are
hadoop-env.sh, hdfs-site.xml, core-site.xml, mapred-site.xml.

hadoop-env.sh – Open the hadoop-env.sh file. It says on the top that hadoop specific environment variables are stored here. The only required variable is JAVA_HOME. In this file, the variable is already defined and the line is commented out. Edit the line to update the JAVA_HOME variable. In our case,

conf/*-site.xml – The earlier hadoop-site.xml file is now replaced with three different settings files – core-site.xml, hdfs-site.xml, mapred-site.xml. The main parameters that you need to refer to or modify in these three files are
core-site.xml – hadoop.tmp.dir, fs.default.name
hdfs-site.xml – dfs.replication
mapred-site.xml – mapred.job.tracker

hadoop.tmp.dir is used as a temporary directory for both local file system and for HDFS. We will use the directory ‘/app/hadoop/tmp/’ (Same as Michael Knoll). We need to create the directory and change its ownership.
sumod@sumod-hadoop:~$ sudo mkdir -p /app/hadoop/tmp
[sudo] password for sumod:
sumod@sumod-hadoop:~$ sudo chown hadoop:hadoop /app/hadoop/tmp
sumod@sumod-hadoop:~$ 

In the configuration files, add the properties mentioned above.

conf/core-site.xml


<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
<property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>Default file system URI.
URI:scheme://authority/path
scheme:method of access
authority:host,port etc.</description>
</property>
</configuration>

conf/hdfs-site.xml


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.Usually 3,
1 in our case
</description>
</property>
</configuration>

conf/mapred-site.xml -


<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>Host and port for jobtracker. As we use localhost,
it will be single map and reduce task.</description>
</property>
</configuration>

Take some to think about why we are using different parameters and what their purpose is. Remember that HDFS is like a virtual file system on top of actual local file system. Virtual in a way that to the user, the different nodes should not appear separately. To the end user, HDFS should still appear homogeneous.

Now that we have downloaded, extracted and configured hadoop, it is time to start the installation. The first step would be to format Namenode. This initializes the FSNameSystem specified by the ‘dfs.name.dir’ variable. It will also write a VERSION file that specifies the namespace ID of this instance, ctime and version.If you format NameNode, you also have to clean up datanodes. Note that if you are just adding new datanodes to the cluster, you do not need to format NameNode.

Format HDFS system via NameNode

I gave the command – $hadoop namenode -format
I got the warning - $HADOOP_HOME is deprecated. So I am going to make following change in hadoop-env.sh file.
export HADOOP_HOME_WARN_SUPPRESS="TRUE"
If you get any exceptions with XML file, please check if you have properly closed the tags.
This is the output on my machine.
hadoop@sumod-hadoop:~$ hadoop namenode -format
12/09/08 01:36:15 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = sumod-hadoop/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.3
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012
************************************************************/
12/09/08 01:36:15 INFO util.GSet: VM type = 32-bit
12/09/08 01:36:15 INFO util.GSet: 2% max memory = 19.33375 MB
12/09/08 01:36:15 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/09/08 01:36:15 INFO util.GSet: recommended=4194304, actual=4194304
12/09/08 01:36:16 INFO namenode.FSNamesystem: fsOwner=hadoop
12/09/08 01:36:16 INFO namenode.FSNamesystem: supergroup=supergroup
12/09/08 01:36:16 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/09/08 01:36:16 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/09/08 01:36:16 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/09/08 01:36:16 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/09/08 01:36:16 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/09/08 01:36:16 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
12/09/08 01:36:16 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at sumod-hadoop/127.0.1.1
************************************************************/
hadoop@sumod-hadoop:~$

Start your cluster
Everything has gone well so far, start the single node cluster.
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ ./start-all.sh
starting namenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-namenode-sumod-hadoop.out
localhost: starting datanode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-datanode-sumod-hadoop.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-secondarynamenode-sumod-hadoop.out
starting jobtracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-jobtracker-sumod-hadoop.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.3/libexec/../logs/hadoop-hadoop-tasktracker-sumod-hadoop.out
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$

Use jps to make sure all services are running as expected.
Note that – If jps is not found in your version of OpenJDK, you can update the JDK to get latest version and then use jps. You can run ‘sudo apt-get install openjdk-6-jdk’. I updated my JDK while hadoop was running and hadoop was not affected. But I do not advise that.
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ jps
9855 Jps
9488 SecondaryNameNode
9575 JobTracker
9810 TaskTracker
9266 DataNode
9053 NameNode

We can also use netstat to make sure that all Java processes are running for hadoop.
sumod@sumod-hadoop:~$ sudo netstat -nlp | grep java | grep 54310
tcp6 0 0 127.0.0.1:54310 :::* LISTEN 3366/java
sumod@sumod-hadoop:~$ sudo netstat -nlp | grep java | grep 54311
tcp6 0 0 127.0.0.1:54311 :::* LISTEN 3908/java
sumod@sumod-hadoop:~$

Stopping the cluster
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ ./stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$

In this part, we will see how to run a sample MapReduce – MR – job. We will run the WordCount example. It should count the number of times each word appears and output the same. The output will be text files.

We will download books from Project Gutenberg that will serve as inputs. I have selected following books and downloaded them in Text UTF-8 format.
1. The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle
2. Pride and Prejudice by Jane Austen
3. Ulysses by James Joyce
4. War and Peace by graf Leo Tolstoy
5. Anna Karenina by graf Leo Tolstoy

I have created input directory at – ‘/tmp/wcinput’.
The total size of the files is – 7.9M
sumod@sumod-hadoop:/tmp/wcinput$ ls -lh
total 7.9M
-rw-rw-r-- 1 sumod sumod 701K Sep 8 16:22 pg1342.txt
-rw-rw-r-- 1 sumod sumod 2.0M Sep 8 16:26 pg1399.txt
-rw-rw-r-- 1 sumod sumod 581K Sep 8 16:20 pg1661.txt
-rw-rw-r-- 1 sumod sumod 3.2M Sep 8 16:25 pg2600.txt
-rw-rw-r-- 1 sumod sumod 1.6M Sep 8 16:23 pg4300.txt
sumod@sumod-hadoop:/tmp/wcinput$

hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ ./start-all.sh

Copy Local Files to HDFS
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ hadoop fs -mkdir /user/hadoop/wcinput
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ hadoop fs -put /tmp/wcinput/pg*.txt /user/hadoop/wcinput
hadoop@sumod-hadoop:/tmp/wcinput$ hadoop fs -ls /user/hadoop/wcinput
Found 5 items
-rw-r--r-- 1 hadoop supergroup 717571 2012-09-08 16:36 /user/hadoop/wcinput/pg1342.txt
-rw-r--r-- 1 hadoop supergroup 2039777 2012-09-08 16:36 /user/hadoop/wcinput/pg1399.txt
-rw-r--r-- 1 hadoop supergroup 594933 2012-09-08 16:36 /user/hadoop/wcinput/pg1661.txt
-rw-r--r-- 1 hadoop supergroup 3288746 2012-09-08 16:36 /user/hadoop/wcinput/pg2600.txt
-rw-r--r-- 1 hadoop supergroup 1573150 2012-09-08 16:36 /user/hadoop/wcinput/pg4300.txt
hadoop@sumod-hadoop:/tmp/wcinput$

Run the MapReduce job
We will run the MR job using the examples jar file. We will use WordCount as the main class. The command format to run a hadoop job is
$hadoop jar

hadoop@sumod-hadoop:~$ hadoop jar $HADOOP_HOME/hadoop*examples*.jar wordcount /user/hadoop/wcinput /user/hadoop/wcoutput
12/09/08 16:49:02 INFO input.FileInputFormat: Total input paths to process : 5
12/09/08 16:49:02 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/09/08 16:49:02 WARN snappy.LoadSnappy: Snappy native library not loaded
12/09/08 16:49:03 INFO mapred.JobClient: Running job: job_201209081630_0001
12/09/08 16:49:04 INFO mapred.JobClient: map 0% reduce 0%
12/09/08 16:49:26 INFO mapred.JobClient: map 28% reduce 0%
12/09/08 16:49:29 INFO mapred.JobClient: map 40% reduce 0%
12/09/08 16:49:44 INFO mapred.JobClient: map 80% reduce 0%
12/09/08 16:49:50 INFO mapred.JobClient: map 100% reduce 13%
12/09/08 16:49:59 INFO mapred.JobClient: map 100% reduce 100%
12/09/08 16:50:04 INFO mapred.JobClient: Job complete: job_201209081630_0001
12/09/08 16:50:04 INFO mapred.JobClient: Counters: 29
12/09/08 16:50:04 INFO mapred.JobClient: Job Counters
12/09/08 16:50:04 INFO mapred.JobClient: Launched reduce tasks=1
12/09/08 16:50:04 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=69131
12/09/08 16:50:04 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/09/08 16:50:04 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/09/08 16:50:04 INFO mapred.JobClient: Launched map tasks=5
12/09/08 16:50:04 INFO mapred.JobClient: Data-local map tasks=5
12/09/08 16:50:04 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=28198
12/09/08 16:50:04 INFO mapred.JobClient: File Output Format Counters
12/09/08 16:50:04 INFO mapred.JobClient: Bytes Written=1089803
12/09/08 16:50:04 INFO mapred.JobClient: FileSystemCounters
12/09/08 16:50:04 INFO mapred.JobClient: FILE_BYTES_READ=4523145
12/09/08 16:50:04 INFO mapred.JobClient: HDFS_BYTES_READ=8214767
12/09/08 16:50:04 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6994566
12/09/08 16:50:04 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1089803
12/09/08 16:50:04 INFO mapred.JobClient: File Input Format Counters
12/09/08 16:50:04 INFO mapred.JobClient: Bytes Read=8214177
12/09/08 16:50:04 INFO mapred.JobClient: Map-Reduce Framework
12/09/08 16:50:04 INFO mapred.JobClient: Map output materialized bytes=2341906
12/09/08 16:50:04 INFO mapred.JobClient: Map input records=168458
12/09/08 16:50:04 INFO mapred.JobClient: Reduce shuffle bytes=2341906
12/09/08 16:50:04 INFO mapred.JobClient: Spilled Records=468932
12/09/08 16:50:04 INFO mapred.JobClient: Map output bytes=13667264
12/09/08 16:50:04 INFO mapred.JobClient: CPU time spent (ms)=12680
12/09/08 16:50:04 INFO mapred.JobClient: Total committed heap usage (bytes)=818434048
12/09/08 16:50:04 INFO mapred.JobClient: Combine input records=1478686
12/09/08 16:50:04 INFO mapred.JobClient: SPLIT_RAW_BYTES=590
12/09/08 16:50:04 INFO mapred.JobClient: Reduce input records=159771
12/09/08 16:50:04 INFO mapred.JobClient: Reduce input groups=97063
12/09/08 16:50:04 INFO mapred.JobClient: Combine output records=220205

Let’s check the result of the run.
hadoop@sumod-hadoop:/tmp/wcinput$ hadoop fs -ls /user/hadoop/wcoutput
Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2012-09-08 16:49 /user/hadoop/wcoutput/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2012-09-08 16:49 /user/hadoop/wcoutput/_logs
-rw-r--r-- 1 hadoop supergroup 1089803 2012-09-08 16:49 /user/hadoop/wcoutput/part-r-00000
hadoop@sumod-hadoop:/tmp/wcinput$

You can see that the job run is a success. There is one output file and one log file. There is one file that indicates success of the job run.

Note the way I ran the jar file. Sometimes, people would run the job from the hadoop folder and give only the name of the file. I have chosen to run the job from the home and then specify hadoop home so that hadoop can locate the jar file correctly.

You can specify parameters on the command line using the option ‘-D’ and then = format.

View the result of the MR job using HDFS
hadoop@sumod-hadoop:/usr/local/hadoop-1.0.3/bin$ hadoop fs -cat /user/hadoop/wcoutput/part-r-00000
This is sample output of what we can see on screen.
" 1
"'A 1
"'About 1
"'Absolute 1
"'After 1
"'Ah!' 2
"'Ah, 2
"'Ample.' 1
"'And 10
"'Anna, 1
"'Arthur!' 1

Note that the quotes do not have much significance from hadoop point view. They will be dependent upon the string tokenizer.

Hadoop Web Interfaces
According to Michael Noll’s tutorial, the web interfaces can be found detailed in the file – conf/hadoop-default.xml. However, in my particular setup the settings for WebUI for NameNode and JobTracker dameon were found in the directory – src/packages/templates/conf in /hdfs-site.xml and in mapred-site.xml respectively. The setting for TaskTracker daemon was found in – /src/mapred/mapred-default.xml. The Web URLs are
NameNode daemon – http://localhost:50070/ -

http://localhost:50070/dfshealth.jsp

JobTracker daemon – http://localhost:50030/http://localhost:50030/jobtracker.jsp
TaskTracker daemon – http://localhost:50060/http://localhost:50060/tasktracker.jsp

Screenshots for the web interfaces.
NameNode -

Screenshot of the NameNode web interface

JobTracker -

Screenshot of the JobTracker web interface

TaskTracker -

Screenshot of TaskTracker web interface

Using NameNode web interface, we can browse the hadoop file system and logs. It is the HDFS layer of the system. Using the JobTracker, we can see the job history. Using the TaskTracker web interface, we can view the log files. JobTracker and TaskTracker come in the MapReduce layer of the system. We can also view number of Map and Reduce tasks scheduled. Using NameNode, we can view the output, input files, status of the nodes. I am able to see in my setup the default block size is 64 MB. In the usual hadoop setup, the default block size is 128 MB.

Well, that was pretty much about setting up Hadoop on a single node Ubuntu cluster. Thanks to Michael Noll for the helpful tutorial which is a fantastic reference. My goal is to provide more of a workshop than a tutorial. So I plan to experiment with the system further and update the blog. Thanks for reading!

Next Topics here :

Map Reduce Introduction and internals

Word Count Without Mapper and Reducer

WordCount With Custom Mapper and Reducer



Viewing all articles
Browse latest Browse all 10

Latest Images

Trending Articles





Latest Images