Hands-on with Hadoop

This guide provides a practical walkthrough for running a standalone Hadoop MapReduce job. By default, Hadoop is configured to run in a non-distributed mode on a single machine, which is perfect for learning and testing.

Prerequisites:

Java 8 or higher must be installed.
The JAVA_HOME environment variable must be set to your JDK's installation path.

You can verify your JAVA_HOME setting with: echo $JAVA_HOME

1. Running a Pre-packaged Hadoop Example

Hadoop comes with several pre-packaged example jobs, making it easy to test your installation. We'll run a grep (search) example.

Steps:

Download and Extract Hadoop: First, download a recent Hadoop binary release and extract it.

# Download Hadoop (check for the latest version on the Apache Hadoop website)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

# Extract the archive
tar xzf hadoop-3.3.1.tar.gz

# For convenience, let's create a variable for the Hadoop home directory
HADOOP_HOME=$(pwd)/hadoop-3.3.1

Prepare Input Data: We'll use Hadoop's own configuration files as our sample data.

# Create a project directory and an input directory
mkdir my-hadoop-project && cd my-hadoop-project
mkdir input

# Copy Hadoop's XML configuration files to use as input
cp $HADOOP_HOME/etc/hadoop/*.xml input

Run the MapReduce Job: Now, execute the example grep job. This job will search through all the input files for a specific regular expression.

# Run the job
# Usage: hadoop jar <jar_file> <main_class> <input_dir> <output_dir> <regex>
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'

View the Results: The output will be in the output directory. You can view it with cat.
```
# View the output
cat output/*
```

2. Writing and Running Your Own Word Count Program

Now, let's build and run the classic "Hello, World!" of Big Data: the Word Count program.

The Java Code (WordCount.java):

The program consists of three main parts:

TokenizerMapper: This class reads lines of text, splits them into words (tokens), and emits a (word, 1) pair for each word.
IntSumReducer: This class receives a word and a list of all its associated counts (e.g., ("hello", [1, 1, 1])) and sums them up to get the final count.
main method: This configures and launches the MapReduce job.

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class); // Optional optimization
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Steps:

Compile and Package the Code: Save the code above as WordCount.java. Now, compile it and package it into a JAR file. You'll need to include Hadoop's libraries in the classpath.

# Set the Hadoop classpath
export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)

# Compile the Java file
javac -classpath $CLASSPATH WordCount.java

# Create a JAR file
jar cf wc.jar WordCount*.class

Prepare Input Data: Create an input directory and add some text files.

mkdir -p wc-input
echo "Hello World hello hello there" > wc-input/doc1.txt
echo "Hi hi there World" > wc-input/doc2.txt

Run the Job: Submit your custom JAR to Hadoop.

# Usage: hadoop jar <your_jar> <main_class> <input_dir> <output_dir>
$HADOOP_HOME/bin/hadoop jar wc.jar WordCount wc-input wc-output

Check the Output: The results will be in a file named part-r-00000 inside the wc-output directory.
```
cat wc-output/part-r-00000
```
You should see the final word counts.