Hands-on with Hadoop
This guide provides a practical walkthrough for running a standalone Hadoop MapReduce job. By default, Hadoop is configured to run in a non-distributed mode on a single machine, which is perfect for learning and testing.
Prerequisites:
- Java 8 or higher must be installed.
- The
JAVA_HOMEenvironment variable must be set to your JDK's installation path.
You can verify your JAVA_HOME setting with:
echo $JAVA_HOME
1. Running a Pre-packaged Hadoop Example
Hadoop comes with several pre-packaged example jobs, making it easy to test your installation. We'll run a grep (search) example.
Steps:
Download and Extract Hadoop: First, download a recent Hadoop binary release and extract it.
# Download Hadoop (check for the latest version on the Apache Hadoop website) wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz # Extract the archive tar xzf hadoop-3.3.1.tar.gz # For convenience, let's create a variable for the Hadoop home directory HADOOP_HOME=$(pwd)/hadoop-3.3.1Prepare Input Data: We'll use Hadoop's own configuration files as our sample data.
# Create a project directory and an input directory mkdir my-hadoop-project && cd my-hadoop-project mkdir input # Copy Hadoop's XML configuration files to use as input cp $HADOOP_HOME/etc/hadoop/*.xml inputRun the MapReduce Job: Now, execute the example
grepjob. This job will search through all the input files for a specific regular expression.# Run the job # Usage: hadoop jar <jar_file> <main_class> <input_dir> <output_dir> <regex> $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'View the Results: The output will be in the
outputdirectory. You can view it withcat.# View the output cat output/*
2. Writing and Running Your Own Word Count Program
Now, let's build and run the classic "Hello, World!" of Big Data: the Word Count program.
The Java Code (WordCount.java):
The program consists of three main parts:
TokenizerMapper: This class reads lines of text, splits them into words (tokens), and emits a(word, 1)pair for each word.IntSumReducer: This class receives a word and a list of all its associated counts (e.g.,("hello", [1, 1, 1])) and sums them up to get the final count.mainmethod: This configures and launches the MapReduce job.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class); // Optional optimization
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Steps:
Compile and Package the Code: Save the code above as
WordCount.java. Now, compile it and package it into a JAR file. You'll need to include Hadoop's libraries in the classpath.# Set the Hadoop classpath export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath) # Compile the Java file javac -classpath $CLASSPATH WordCount.java # Create a JAR file jar cf wc.jar WordCount*.classPrepare Input Data: Create an input directory and add some text files.
mkdir -p wc-input echo "Hello World hello hello there" > wc-input/doc1.txt echo "Hi hi there World" > wc-input/doc2.txtRun the Job: Submit your custom JAR to Hadoop.
# Usage: hadoop jar <your_jar> <main_class> <input_dir> <output_dir> $HADOOP_HOME/bin/hadoop jar wc.jar WordCount wc-input wc-outputCheck the Output: The results will be in a file named
part-r-00000inside thewc-outputdirectory.cat wc-output/part-r-00000You should see the final word counts.