Apache Hive: SQL on Hadoop

Apache Hive is a data warehouse infrastructure built on top of Hadoop. Its primary purpose is to provide a familiar, SQL-like interface, called HiveQL, for querying and analyzing massive datasets stored in distributed file systems like HDFS.

In essence, Hive projects a table-like structure onto data that may be stored in flat files (like CSVs or JSON). This allows analysts and developers who are comfortable with SQL to run ad-hoc queries, data summarization, and analysis on big data without having to write complex MapReduce or Spark code in Java or Python.

Hive provides a SQL interface for Hadoop

How Hive Works: Metastore and Execution Engines

Hive's architecture consists of two key components:

The Hive Metastore: This is the central repository for all of Hive's metadata. When you define a table in Hive, the schema (column names, data types, etc.), the location of the data files, and the storage format are all stored in the Metastore. This separation of metadata from data is what allows Hive to manage data stored in various formats across HDFS.
Execution Engines: Hive itself does not process the data. Instead, it acts as a translator. When you submit a HiveQL query, Hive's driver parses it, optimizes it, and then translates it into a job for a distributed execution engine. Users can choose which engine to use:
- Apache Hadoop MapReduce (Legacy): The original, batch-oriented engine.
- Apache Tez: A more advanced, faster framework that improves upon MapReduce by using Directed Acyclic Graphs (DAGs).
- Apache Spark: Hive can also use Spark as its execution engine, leveraging Spark's in-memory processing for faster query performance.

Hive translates SQL to jobs for different execution engines

The process looks like this: a user submits a HiveQL query, and under the hood, Hive generates and executes a series of MapReduce, Tez, or Spark tasks to get the result.

A Hive query being executed as a MapReduce job

Hive Tables: Managed vs. External

This is the most critical concept in Hive. When you define a table, you must decide if Hive will manage the data's lifecycle.

Managed (Internal) Tables

This is the default table type. When you create a managed table, you are telling Hive to take ownership of the data.

Data Location: Hive moves the data into its own warehouse directory (e.g., /user/hive/warehouse/).
Lifecycle: If you DROP a managed table, Hive deletes both the metadata and the underlying data files.

CREATE TABLE users (
  user_id INT,
  username STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Defining a managed table

External Tables

With an external table, you are telling Hive that you want to manage the data's lifecycle yourself.

Data Location: The data can reside in any location in HDFS that you specify with the LOCATION keyword. Hive does not move it.
Lifecycle: If you DROP an external table, Hive deletes only the metadata. The underlying data files remain untouched in their original location.

This makes external tables ideal for pointing Hive at data that is generated by another process or is shared among multiple tools.

CREATE EXTERNAL TABLE logs (
  timestamp BIGINT,
  level STRING,
  message STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/data/application_logs/';

Defining an external table with a location

HCatalog: The Shared Metastore

Apache HCatalog is a component of Hive that exposes the Hive Metastore as a central, shareable service for the entire Hadoop ecosystem. By using HCatalog, other tools like Spark, Pig, and custom MapReduce jobs can easily read and write data to Hive tables without having to worry about the underlying file format or storage location. It effectively turns the Hive Metastore into a universal metadata catalog for your data lake.

HCatalog provides a shared table abstraction