How to setup a Hadoop (v. 3.0.0) Cluster using AWS EC2?

In this tutorial I am going to walk you through the process of launching a Hadoop Cluster. To keep it simple, I am going to launch a small cluster comprising of only two nodes i.e., one master and the other one worker. Let’s get started. I am using Hadoop 3.0.0 for this tutorial.

Provision EC2 instances

The number of instances depends upon individual requirements. You can provision as many instances as required for your particular scenario. However, it might be a good idea to start with two and once the setup is successfully completed, the worker can be used to create an AMI (Amazon Machine Image) and launch as many workers as you like.

I am going to provision two EC2 instances for this tutorial. First, an m4.xlarge instance for the master and second, an i2.2xlarge for the worker. Again, you can choose the instance type according to your requirements for e.g. how much storage space do you need to store the data (please consider the replication factor when estimating the storage space) ? or how much computational power you require to process the data at hand?

Install Java and Hadoop

First, update your system. Its always a good idea!

sudo apt-get update && sudo apt-get dist-upgrade

Next, install Java.

sudo apt-get install openjdk-8-jdk

Now download and install Hadoop.

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz

Extract…

sudo tar zxvf hadoop-3.0.0.tar.gz -C /usr/local/

Rename the directory for easy access (optional)

mv /usr/local/hadoop-3.0.0.tar.gz hadoop

Setting up Environmental Variables

readlink -f $(which java)

Let’s setup these environment variables so that you don’t have to do it again

vim ~/.bashrc

Add the following to ~/.bashrc and save.

Setup JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

Setup HADOOP_HOME

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Setup HADOOP_CONF_DIR

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Setup HADOOP_MAPREDUCE_EXAMPLES

export HADOOP_MAPREDUCE_EXAMPLES=$HADOOP_HOME/share/hadoop/mapreduce/

2cd41-screen2bshot2b2018-01-052bat2b9-33-552bam

Configure the Cluster

cd $HADOOP_CONF_DIR
sudo vim core-site.xml

Make sure that the following properties are set:


  
      yarn.nodemanager.aux-services
      mapreduce_shuffle
  
  
      yarn.nodemanager.auxservices.mapreduce.shuffle.class
      org.apache.hadoop.mapred.ShuffleHandler
  
  
      yarn.resourcemanager.hostname
      public_dns_of_the_Resource_Manager
      The hostname of the Ressource Manager.
  

sudo vim yarn-site.xml

  
      yarn.nodemanager.aux-services
      mapreduce_shuffle
  
  
      yarn.nodemanager.auxservices.mapreduce.shuffle.class
      org.apache.hadoop.mapred.ShuffleHandler
  
  
      yarn.resourcemanager.hostname
      public_dns_of_the_Resource_Manager
      The hostname of the Resource Manager.
  

Update hadoop-env.sh. We need to setup the JAVA_HOME and HADOOP_HOME here.

sudo vim hadoop-env.sh

Look for export JAVA_HOME and update it to your java home dir.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Similarly, find export HADOOP_HOME and update it to point to the hadoop directory.

export HADOOP_HOME=/usr/local/hadoop

746e5-screen2bshot2b2018-01-022bat2b3-31-502bpm

sudo vim hdfs-site.xml

Adjust the replication factor according to the number of nodes you have and the number of replicas you want to create. Min is one therefore if you set it to anything below 1, you are likely to receive an error. Since we have two nodes and I decided to keep two replicas, I am setting it to 2.


  
    dfs.replication
    2
  
  
    dfs.namenode.name.dir
    file:///usr/local/hadoop/data/hdfs/namenode
  

The dfs.namenode.name.dir specifies the path on the local filesystem that is used by the NameNode to store namespace and transactions logs persistently. Let’s create that directory or you can specify an existing directory if you like.

sudo mkdir -p $HADOOP_HOME/data/hdfs/namenode

We need to setup the masters and workers files appropriately. Create masters is it doesn’t already exist. Add the hostname of the master node to the masters file while add hostname(s) of DataNode(s) to the workers file.

sudo vim $HADOOP_HOME/etc/hadoop/masters
sudo vim $HADOOP_HOME/etc/hadoop/workers

Change the owner of $HADOOP_HOME

sudo chown -R ubuntu $HADOOP_HOME

Setup the MapReduce configuration.

sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following properties:


  
    mapreduce.jobtracker.address
    :54311
  
  
    mapreduce.framework.name
    yarn
  
  
    yarn.app.mapreduce.am.env
    HADOOP_MAPRED_HOME=$HADOOP_HOME
  
  
    mapreduce.map.env
    HADOOP_MAPRED_HOME=$HADOOP_HOME
  
  
    mapreduce.reduce.env
    HADOOP_MAPRED_HOME=$HADOOP_HOME
  

Create an AMI

Now lets create an AMI (Amazon Machine Image) to simplify provisioning of the worker nodes.

  • Go to your EC2 console.
  • Select the instance (master).
  • From the Actions menu, Image and then Create Image.

Launch Worker Nodes

Launch a worker node from the AMI created in the previous step. Now we need to configure access between nodes so that the nodes can communicate with each other.

To launch an instance from an AMI:

  • Click on AMI under IMAGES in the left menu.
  • Select the AMI you want to launch the instance from.
  • From the Actions menu, hit Launch

Configure Access between Cluster Nodes

We can achieve this by generating a public/private key pair on the master.

ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""

Here, we used ssh-keygen to generate the key pair. -f option is used to specify the output directory while the -t is used to specify the type of the key to be generated. -P indicates the old passphrase which is empty here hence the “”.

Append the generated public key to all the worker nodes. This is necessary for your master nodes to be able to communicate to the workers.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Configure SSH config

To make things a bit more easier in the long run, lets configure SSH config to allow seamless login into the worker nodes from the master. SSH programs receive their configuration either from the command line or from configuration files. In general, the command line parameters take precedence over configuration files. The ~/.ssh/config takes precedence next while the global /etc/ssh/ssh_config is checked the last. You can read more about this here.

Host nnode
  HostName 
  User ubuntu
  IdentityFile ~/.ssh/id_rsa 

Host dnode1
  HostName 
  User ubuntu
  IdentityFile ~/.ssh/id_rsa

You should be able to login without password now by simply doing

ssh namenode
ssh datanode1

The first time you bring up HDFS, it must be formatted. Format a new distributed filesystem as hdfs:

$ hdfs namenode -format

On the worker node, create a directory and change its owner to ubuntu.

sudo mkdir -p $HADOOP_HOME/data/hdfs/datanode
sudo chown -R ubuntu $HADOOP_HOME

Launch Hadoop Cluster

To start a Hadoop cluster you will need to start both the HDFS and YARN cluster. Run the following commands on the master.

$HADOOP_HOME/sbin/start-dfs.sh

or you can simply do start-dfs.sh if you export the $HADOOP_HOME to the $PATH.

$HADOOP_HOME/sbin/start-yarn.sh

Lets also start the mapReduce job history daemon since we’ll run a simple example soon to test that everything is working…

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

We can run Java Virtual Machine Process Status Tool (jps) to make sure that the services we want are up and running.

jps

And the output should be similar to the screenshot below on master and worker respectively.

screen-shot-2018-01-04-at-2-52-49-pm.png
Master
screen-shot-2018-01-04-at-2-53-20-pm.png
Worker

If you are missing some services, its a good time to dig into logs and see what wen wrong. Hadoop stores a number of log files in the logs directory in $HADOOP_HOME.

$HADOOP_HOME/logs/

If you have successfully completed all the steps above and can see the hadoop and MapReduce processes running in jps output then let’s now run a simple mapreduce example to see things in action.

Lets generate 200MB of data using teragen and store the output to hdfs in /teraInput.

hadoop jar $HADOOP_MAPREDUCE_EXAMPLES/hadoop-mapreduce-examples-3.0.0.jar teragen 200000 /teraInput

Sort the data generated in the last step.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput

Validate the results

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate /teraOutput /teraValidate

List the content on hdfs.

hadoop fs -ls hdfs:///teraValidate

and thats all for this post…

Cheers!!!

Advertisements

One thought on “How to setup a Hadoop (v. 3.0.0) Cluster using AWS EC2?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s