YARN: The Operating System for Hadoop

If HDFS is the distributed file system for a Hadoop cluster, then YARN (Yet Another Resource Negotiator) is its distributed operating system. Introduced in Hadoop 2.0, YARN is the architectural center of Hadoop, responsible for managing and allocating cluster resources like CPU, memory, and disk I/O.

YARN the distributed OS

What is YARN?

Before YARN, Hadoop was limited to running only MapReduce batch jobs. YARN transformed Hadoop into a true multi-purpose data platform by decoupling resource management from the processing framework.

Think of it like this: a computer's OS (like Windows or Linux) manages which applications get to use the CPU and memory. YARN does the same for a cluster, allowing many different applications—batch (MapReduce), interactive (Spark), streaming (Storm), and services (HBase)—to run simultaneously on the same cluster, sharing the same data in HDFS.

This capability is what enables the modern "data lake" architecture, where a single, central Hadoop cluster can serve a wide variety of workloads and users.

YARN's Role in the Hadoop Ecosystem

YARN's Core Components

YARN has a master/slave architecture with three primary components:

ResourceManager (Master): There is one ResourceManager per cluster. It is the ultimate authority that manages the global allocation of resources among all the applications. It has two main sub-components: the Scheduler (which allocates resources) and the ApplicationManager (which accepts job submissions).
NodeManager (Slave): A NodeManager runs on each worker node in the cluster. It is responsible for launching and monitoring the actual containers where the application's work is done. It constantly communicates its resource status to the ResourceManager.
ApplicationMaster (Per-Application Master): Each application that runs on YARN gets its own dedicated ApplicationMaster. This component is responsible for negotiating resources from the ResourceManager and working with the NodeManagers to execute and monitor the application's tasks.

YARN's Major Components

How a YARN Application Runs

Here is a step-by-step breakdown of a typical job execution flow in YARN:

Job Submission: A client submits an application (e.g., a MapReduce job) to the ResourceManager.
ApplicationMaster Launch: The ResourceManager allocates a container (a specific amount of resources on a worker node) and instructs a NodeManager to launch the ApplicationMaster in it.
Resource Negotiation: The ApplicationMaster, now running, calculates the resources it needs for its tasks and requests them from the ResourceManager.
Container Allocation: The ResourceManager's Scheduler finds available resources on various NodeManagers and grants the containers to the ApplicationMaster.
Task Execution: The ApplicationMaster then contacts the NodeManagers directly to launch its tasks within the allocated containers.
Monitoring: Throughout the process, the ApplicationMaster monitors the status of its tasks and reports progress back to the client. The NodeManagers monitor their resource usage and report back to the ResourceManager.

YARN Application Flow

This architecture is highly scalable because the central ResourceManager only handles scheduling, while the application-specific logic and task management are distributed to the individual ApplicationMasters.

Key Improvements in Hadoop 3

YARN Timeline Server v2: Provides a more scalable system for storing and retrieving application information, making it easier to debug and monitor a wide variety of frameworks running on YARN.
Opportunistic Containers: Allows YARN to schedule containers that can be preempted by higher-priority jobs. This improves overall cluster utilization by using idle resources without compromising the guarantees for critical applications.