In this tutorial I am going to walk you through the process of launching a Hadoop Cluster. To keep it simple, I am going to launch a small cluster comprising of only two nodes i.e., one master and the other one worker. Let’s get started. I am using Hadoop 3.0.0 for this tutorial.
Provision EC2 instances
The number of instances depends upon individual requirements. You can provision as many instances as required for your particular scenario. However, it might be a good idea to start with two and once the setup is successfully completed, the worker can be used to create an AMI (Amazon Machine Image) and launch as many workers as you like.
I am going to provision two EC2 instances for this tutorial. First, an m4.xlarge instance for the master and second, an i2.2xlarge for the worker. Again, you can choose the instance type according to your requirements for e.g. how much storage space do you need to store the data (please consider the replication factor when estimating the storage space) ? or how much computational power you require to process the data at hand?
Install Java and Hadoop
First, update your system. Its always a good idea!
sudo apt-get update && sudo apt-get dist-upgrade
Next, install Java.
sudo apt-get install openjdk-8-jdk
Now download and install Hadoop.
sudo tar zxvf hadoop-3.0.0.tar.gz -C /usr/local/
Rename the directory for easy access (optional)
mv /usr/local/hadoop-3.0.0.tar.gz hadoop
Setting up Environmental Variables
readlink -f $(which java)
Let’s setup these environment variables so that you don’t have to do it again
Add the following to ~/.bashrc and save.
Configure the Cluster
sudo vim core-site.xml
Make sure that the following properties are set:
yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.auxservices.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.hostname public_dns_of_the_Resource_Manager The hostname of the Ressource Manager.
sudo vim yarn-site.xml
yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.auxservices.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.hostname public_dns_of_the_Resource_Manager The hostname of the Resource Manager.
Update hadoop-env.sh. We need to setup the JAVA_HOME and HADOOP_HOME here.
sudo vim hadoop-env.sh
Look for export JAVA_HOME and update it to your java home dir.
Similarly, find export HADOOP_HOME and update it to point to the hadoop directory.
sudo vim hdfs-site.xml
Adjust the replication factor according to the number of nodes you have and the number of replicas you want to create. Min is one therefore if you set it to anything below 1, you are likely to receive an error. Since we have two nodes and I decided to keep two replicas, I am setting it to 2.
dfs.replication 2 dfs.namenode.name.dir file:///usr/local/hadoop/data/hdfs/namenode
The dfs.namenode.name.dir specifies the path on the local filesystem that is used by the NameNode to store namespace and transactions logs persistently. Let’s create that directory or you can specify an existing directory if you like.
sudo mkdir -p $HADOOP_HOME/data/hdfs/namenode
We need to setup the masters and workers files appropriately. Create masters is it doesn’t already exist. Add the hostname of the master node to the masters file while add hostname(s) of DataNode(s) to the workers file.
sudo vim $HADOOP_HOME/etc/hadoop/masters
sudo vim $HADOOP_HOME/etc/hadoop/workers
Change the owner of $HADOOP_HOME
sudo chown -R ubuntu $HADOOP_HOME
Setup the MapReduce configuration.
sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following properties:
mapreduce.jobtracker.address :54311 mapreduce.framework.name yarn yarn.app.mapreduce.am.env HADOOP_MAPRED_HOME=$HADOOP_HOME mapreduce.map.env HADOOP_MAPRED_HOME=$HADOOP_HOME mapreduce.reduce.env HADOOP_MAPRED_HOME=$HADOOP_HOME
Create an AMI
Now lets create an AMI (Amazon Machine Image) to simplify provisioning of the worker nodes.
- Go to your EC2 console.
- Select the instance (master).
- From the Actions menu, Image and then Create Image.
Launch Worker Nodes
Launch a worker node from the AMI created in the previous step. Now we need to configure access between nodes so that the nodes can communicate with each other.
To launch an instance from an AMI:
- Click on AMI under IMAGES in the left menu.
- Select the AMI you want to launch the instance from.
- From the Actions menu, hit Launch
Configure Access between Cluster Nodes
We can achieve this by generating a public/private key pair on the master.
ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""
Here, we used ssh-keygen to generate the key pair. -f option is used to specify the output directory while the -t is used to specify the type of the key to be generated. -P indicates the old passphrase which is empty here hence the “”.
Append the generated public key to all the worker nodes. This is necessary for your master nodes to be able to communicate to the workers.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Configure SSH config
To make things a bit more easier in the long run, lets configure SSH config to allow seamless login into the worker nodes from the master. SSH programs receive their configuration either from the command line or from configuration files. In general, the command line parameters take precedence over configuration files. The ~/.ssh/config takes precedence next while the global /etc/ssh/ssh_config is checked the last. You can read more about this here.
Host nnode HostName User ubuntu IdentityFile ~/.ssh/id_rsa Host dnode1 HostName User ubuntu IdentityFile ~/.ssh/id_rsa
You should be able to login without password now by simply doing
The first time you bring up HDFS, it must be formatted. Format a new distributed filesystem as hdfs:
$ hdfs namenode -format
On the worker node, create a directory and change its owner to ubuntu.
sudo mkdir -p $HADOOP_HOME/data/hdfs/datanode
sudo chown -R ubuntu $HADOOP_HOME
Launch Hadoop Cluster
To start a Hadoop cluster you will need to start both the HDFS and YARN cluster. Run the following commands on the master.
or you can simply do start-dfs.sh if you export the $HADOOP_HOME to the $PATH.
Lets also start the mapReduce job history daemon since we’ll run a simple example soon to test that everything is working…
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
We can run Java Virtual Machine Process Status Tool (jps) to make sure that the services we want are up and running.
And the output should be similar to the screenshot below on master and worker respectively.
If you are missing some services, its a good time to dig into logs and see what wen wrong. Hadoop stores a number of log files in the logs directory in $HADOOP_HOME.
If you have successfully completed all the steps above and can see the hadoop and MapReduce processes running in jps output then let’s now run a simple mapreduce example to see things in action.
Lets generate 200MB of data using teragen and store the output to hdfs in /teraInput.
hadoop jar $HADOOP_MAPREDUCE_EXAMPLES/hadoop-mapreduce-examples-3.0.0.jar teragen 200000 /teraInput
Sort the data generated in the last step.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput
Validate the results
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /teraOutput /teraValidate
List the content on hdfs.
hadoop fs -ls hdfs:///teraValidate
and thats all for this post…