How to move a file (from S3 to HDFS) with S3DistCP?

This tutorial details the steps needed to move a file from S3 to HDFS with S3DistCP. It shows you how to accomplish this using the Management Console as well as through the AWS CLI.

Using the AWS Management Console

Add a step to your cluster through the console as follows:

  1. Go to ServicesEMR > Clusters
  2. Click on your Cluster Name
  3. Go to Steps tab
  4. Click Add Step
    • Step Type: Custom JAR
    • Name: S3DistCP
    • JAR Location: command-runner.jar
    • Arguments: s3-dist-cp –s3Endpoint=s3.amazonaws.com –src=s3://bucket_name/path/to/src_folder/ –dest=hdfs:///dataset/copied_data –srcPattern=.*[a-zA-Z,]+

Using the AWS CLI

Alternatively, you can add a step using the AWS CLI following the steps below:

  1. Create a json file to store the step information. Let’s name it step.json.
  2. Add the following to step.json and save:
  3. [
        {
            "Name":"S3DistCp step",
            "Args":["s3-dist-cp","--s3Endpoint=s3.amazonaws.com","--src=s3://bucket_name/path/to/folder/","--dest=hdfs:///output","--srcPattern=.*[a-zA-Z,]+"],
            "ActionOnFailure":"CONTINUE",
            "Type":"CUSTOM_JAR",
            "Jar":"command-runner.jar"        
        }
    ]

    Issue the following command to add the step to your cluster.

    aws emr add-steps --cluster-id j-XXXXXXXXXXXXX --steps file://./step.json

This would return the step id as shown below:

step-id

You can check the progress of your step in the EMR Management Console.

  1. Go to Services > EMR > Clusters > Your Cluster Name.
  2. Select the Steps tab.
  3. If the step is still running, the Status will be set to Running.

step-running

Once the step completes its execution successfully, the Status will be set to Completed.

step-completed

If something went wrong, the status would be set to Failed. In that case, please refer to the Reason provided and/or the Log File.

step-failed

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s