MOBI BOOT CAMP CORP. logoLearning Buddy
  • SIGN IN
  • Foundations
  • The Hadoop Ecosystem: Batch at Scale
    • Hadoop
    • HDFS
    • MapReduce
    • YARN
    • Apache Hive
    • Hands-on Hadoop
    • Slides
  • The Spark Ecosystem: In-Memory Processing
  • Data Pipelines and Transport
  • Search & Information Retrieval
  • The Modern Data Stack
  • Glossary

Hands-on with Hadoop

This guide provides a practical walkthrough for running a standalone Hadoop MapReduce job. By default, Hadoop is configured to run in a non-distributed mode on a single machine, which is perfect for learning and testing.

Prerequisites:

  • Java 8 or higher must be installed.
  • The JAVA_HOME environment variable must be set to your JDK's installation path.

You can verify your JAVA_HOME setting with: echo $JAVA_HOME

1. Running a Pre-packaged Hadoop Example

Hadoop comes with several pre-packaged example jobs, making it easy to test your installation. We'll run a grep (search) example.

Steps:

  1. Download and Extract Hadoop: First, download a recent Hadoop binary release and extract it.

    # Download Hadoop (check for the latest version on the Apache Hadoop website)
    wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
    
    # Extract the archive
    tar xzf hadoop-3.3.1.tar.gz
    
    # For convenience, let's create a variable for the Hadoop home directory
    HADOOP_HOME=$(pwd)/hadoop-3.3.1
    
  2. Prepare Input Data: We'll use Hadoop's own configuration files as our sample data.

    # Create a project directory and an input directory
    mkdir my-hadoop-project && cd my-hadoop-project
    mkdir input
    
    # Copy Hadoop's XML configuration files to use as input
    cp $HADOOP_HOME/etc/hadoop/*.xml input
    
  3. Run the MapReduce Job: Now, execute the example grep job. This job will search through all the input files for a specific regular expression.

    # Run the job
    # Usage: hadoop jar <jar_file> <main_class> <input_dir> <output_dir> <regex>
    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'
    
  4. View the Results: The output will be in the output directory. You can view it with cat.

    # View the output
    cat output/*
    

2. Writing and Running Your Own Word Count Program

Now, let's build and run the classic "Hello, World!" of Big Data: the Word Count program.

The Java Code (WordCount.java):

The program consists of three main parts:

  • TokenizerMapper: This class reads lines of text, splits them into words (tokens), and emits a (word, 1) pair for each word.
  • IntSumReducer: This class receives a word and a list of all its associated counts (e.g., ("hello", [1, 1, 1])) and sums them up to get the final count.
  • main method: This configures and launches the MapReduce job.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class); // Optional optimization
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Steps:

  1. Compile and Package the Code: Save the code above as WordCount.java. Now, compile it and package it into a JAR file. You'll need to include Hadoop's libraries in the classpath.

    # Set the Hadoop classpath
    export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
    
    # Compile the Java file
    javac -classpath $CLASSPATH WordCount.java
    
    # Create a JAR file
    jar cf wc.jar WordCount*.class
    
  2. Prepare Input Data: Create an input directory and add some text files.

    mkdir -p wc-input
    echo "Hello World hello hello there" > wc-input/doc1.txt
    echo "Hi hi there World" > wc-input/doc2.txt
    
  3. Run the Job: Submit your custom JAR to Hadoop.

    # Usage: hadoop jar <your_jar> <main_class> <input_dir> <output_dir>
    $HADOOP_HOME/bin/hadoop jar wc.jar WordCount wc-input wc-output
    
  4. Check the Output: The results will be in a file named part-r-00000 inside the wc-output directory.

    cat wc-output/part-r-00000
    

    You should see the final word counts.

Privacy Policy | Terms & Conditions