MOBI BOOT CAMP CORP. logoLearning Buddy
  • SIGN IN
  • Foundations
  • The Hadoop Ecosystem: Batch at Scale
  • The Spark Ecosystem: In-Memory Processing
  • Data Pipelines and Transport
  • Search & Information Retrieval
  • The Modern Data Stack
    • Big Data Platforms
    • Google Cloud Dataproc
    • The Lakehouse
    • dbt (Data Build Tool)
    • Real-Time Analytics
    • MLOps & Data Governance
    • Slides
  • Glossary

Google Cloud Dataproc

Google Cloud Dataproc is a fully managed cloud service for running Apache Spark, Apache Hadoop, and other open-source data processing frameworks. It allows you to create and manage clusters quickly, so you can focus on your data processing jobs rather than on infrastructure administration.

Key Features of Dataproc

  • Fast and Easy: You can spin up a complete, ready-to-use Hadoop and Spark cluster in minutes.
  • Managed Service: Google handles the complexities of cluster provisioning, management, monitoring, and teardown.
  • Cost-Effective: Dataproc is priced per-second, and it supports ephemeral clusters (creating a cluster for a specific job and deleting it upon completion) and autoscaling, which helps you manage costs effectively.
  • Integrated: It is tightly integrated with other Google Cloud services like Google Cloud Storage (GCS), BigQuery, and Cloud Logging. It's common practice to store your data in a GCS bucket and use a Dataproc cluster to process it.
  • Customizable: You can easily customize your cluster's hardware (machine types, disk size) and software components to fit your specific workload.

Interacting with a Dataproc Cluster

Once you have created a Dataproc cluster, you can connect to the master node via SSH directly from the Google Cloud Console or using the gcloud command-line tool. From the master node's terminal, you can access the entire big data ecosystem using familiar commands:

  • pyspark: Starts the interactive Python shell for Spark.
  • spark-shell: Starts the interactive Scala shell for Spark.
  • spark-submit <your-script.py>: Submits a Spark application as a batch job.
  • hdfs dfs -ls /: Interacts with the Hadoop Distributed File System.
  • hive: Starts the interactive Hive shell for running SQL-like queries.
  • pig: Starts the interactive Pig shell.

Example: Running a Word Count Job on Dataproc

Here’s how you would run the classic Hadoop MapReduce word count example on a live Dataproc cluster.

  1. SSH into the Master Node: Use the Google Cloud Console to SSH into your Dataproc cluster's master node.

  2. Create a Sample File: Create a text file on the master node's local file system.

    echo "Hello Spark Hello Hadoop" > sample.txt
    
  3. Upload the File to HDFS: Dataproc clusters come with HDFS. Copy the local file into HDFS so it can be accessed by the distributed job.

    hdfs dfs -put sample.txt /
    
  4. Run the Hadoop MapReduce Job: Dataproc includes the pre-packaged Hadoop example jobs. Execute the wordcount job, pointing it to your input file in HDFS and specifying an output directory.

    # The JAR path may vary slightly with different Dataproc image versions
    yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /sample.txt /output
    
  5. View the Results: The job will run, and the results will be stored in the /output directory in HDFS. You can view the output file's contents.

    # View the output
    hdfs dfs -cat /output/part-r-00000
    

    You should see the following result:

    Hadoop  1
    Hello   2
    Spark   1
    
Privacy Policy | Terms & Conditions