Google Cloud Dataproc
Google Cloud Dataproc is a fully managed cloud service for running Apache Spark, Apache Hadoop, and other open-source data processing frameworks. It allows you to create and manage clusters quickly, so you can focus on your data processing jobs rather than on infrastructure administration.
Key Features of Dataproc
- Fast and Easy: You can spin up a complete, ready-to-use Hadoop and Spark cluster in minutes.
- Managed Service: Google handles the complexities of cluster provisioning, management, monitoring, and teardown.
- Cost-Effective: Dataproc is priced per-second, and it supports ephemeral clusters (creating a cluster for a specific job and deleting it upon completion) and autoscaling, which helps you manage costs effectively.
- Integrated: It is tightly integrated with other Google Cloud services like Google Cloud Storage (GCS), BigQuery, and Cloud Logging. It's common practice to store your data in a GCS bucket and use a Dataproc cluster to process it.
- Customizable: You can easily customize your cluster's hardware (machine types, disk size) and software components to fit your specific workload.
Interacting with a Dataproc Cluster
Once you have created a Dataproc cluster, you can connect to the master node via SSH directly from the Google Cloud Console or using the gcloud command-line tool. From the master node's terminal, you can access the entire big data ecosystem using familiar commands:
pyspark: Starts the interactive Python shell for Spark.spark-shell: Starts the interactive Scala shell for Spark.spark-submit <your-script.py>: Submits a Spark application as a batch job.hdfs dfs -ls /: Interacts with the Hadoop Distributed File System.hive: Starts the interactive Hive shell for running SQL-like queries.pig: Starts the interactive Pig shell.
Example: Running a Word Count Job on Dataproc
Here’s how you would run the classic Hadoop MapReduce word count example on a live Dataproc cluster.
SSH into the Master Node: Use the Google Cloud Console to SSH into your Dataproc cluster's master node.
Create a Sample File: Create a text file on the master node's local file system.
echo "Hello Spark Hello Hadoop" > sample.txtUpload the File to HDFS: Dataproc clusters come with HDFS. Copy the local file into HDFS so it can be accessed by the distributed job.
hdfs dfs -put sample.txt /Run the Hadoop MapReduce Job: Dataproc includes the pre-packaged Hadoop example jobs. Execute the
wordcountjob, pointing it to your input file in HDFS and specifying an output directory.# The JAR path may vary slightly with different Dataproc image versions yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /sample.txt /outputView the Results: The job will run, and the results will be stored in the
/outputdirectory in HDFS. You can view the output file's contents.# View the output hdfs dfs -cat /output/part-r-00000You should see the following result:
Hadoop 1 Hello 2 Spark 1