Normalization

Normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and improve data integrity. It involves dividing larger tables into smaller, well-structured tables and defining relationships between them.

The primary goals of normalization are to:

Eliminate Redundant Data: Storing the same piece of data in multiple places is wasteful and can lead to inconsistencies.
Prevent Data Anomalies: Anomalies are problems that can occur during data insertion, updates, or deletion. For example, if a customer's address is stored with every order they place, changing their address would require updating multiple records, and it would be easy to miss one.
Improve Data Integrity: By ensuring that data is stored logically and consistently, normalization makes the database more reliable.

Normalization is achieved by following a set of rules called "normal forms." Let's walk through the most common ones with a Example.

Example: A Sales Order Table

Imagine we start with a single, unnormalized table to track sales orders.

Unnormalized Table

OrderID	OrderDate	CustomerID	CustomerName	CustomerEmail	ProductID, ProductName, Quantity, Price
101	2025-09-24	55	Alice Smith	alice@email.com	(P1, Apple, 10, 0.50), (P2, Banana, 5, 0.25)
102	2025-09-24	82	Bob Johnson	bob@email.com	(P3, Carrot, 20, 0.10)
103	2025-09-25	55	Alice Smith	alice@email.com	(P1, Apple, 15, 0.50)

This table has several problems:

The last column contains multiple values (a repeating group), which is hard to query.
Customer information (Name, Email) is repeated for every order they place.

First Normal Form (1NF): Eliminate Repeating Groups

Rule: Each cell must hold a single, atomic value, and each record must be unique.

To achieve 1NF, we break the repeating product information into separate rows.

1NF Orders Table

OrderID	OrderDate	CustomerID	CustomerName	CustomerEmail	ProductID	ProductName	Quantity	Price
101	2025-09-24	55	Alice Smith	alice@email.com	P1	Apple	10	0.50
101	2025-09-24	55	Alice Smith	alice@email.com	P2	Banana	5	0.25
102	2025-09-24	82	Bob Johnson	bob@email.com	P3	Carrot	20	0.10
103	2025-09-25	55	Alice Smith	alice@email.com	P1	Apple	15	0.50

Problem Solved: Each cell now has a single value. New Problem: We have a lot of redundant data. CustomerName, OrderDate, etc., are repeated for each product in an order. Also, ProductName and Price are repeated every time a product is ordered.

Second Normal Form (2NF): Remove Partial Dependencies

Rule: Must be in 1NF, and all non-key attributes must be fully dependent on the entire primary key. (This rule only applies when there is a composite primary key).

Our primary key for the 1NF table is (OrderID, ProductID).

OrderDate, CustomerID, CustomerName, and CustomerEmail depend only on OrderID (a part of the primary key). This is a partial dependency.
ProductName and Price depend only on ProductID (another part of the primary key). This is also a partial dependency.

To achieve 2NF, we split the table into three separate tables:

Orders Table

OrderID	OrderDate	CustomerID
101	2025-09-24	55
102	2025-09-24	82
103	2025-09-25	55

Order_Items Table

OrderID	ProductID	Quantity
101	P1	10
101	P2	5
102	P3	20
103	P1	15

Products Table

ProductID	ProductName	Price
P1	Apple	0.50
P2	Banana	0.25
P3	Carrot	0.10

Problem Solved: We've eliminated the partial dependencies. New Problem: In the Orders table, we still have a transitive dependency.

Third Normal Form (3NF): Remove Transitive Dependencies

Rule: Must be in 2NF, and every non-key attribute must depend only on the primary key, not on other non-key attributes.

In our new Orders table, the primary key is OrderID.

CustomerID depends on OrderID.
However, CustomerName and CustomerEmail depend on CustomerID, which is a non-key attribute. This is a transitive dependency.

To achieve 3NF, we split the Orders table again:

Customers Table

CustomerID	CustomerName	CustomerEmail
55	Alice Smith	alice@email.com
82	Bob Johnson	bob@email.com

Orders Table (Final)

OrderID	OrderDate	CustomerID
101	2025-09-24	55
102	2025-09-24	82
103	2025-09-25	55

Our database is now in 3NF, with four well-structured tables (Customers, Products, Orders, Order_Items). Data redundancy is minimized.

ERD for Normalized Database (3NF)

The Modern Trend: Denormalization for Performance

Historically, the primary driver for normalization was to save disk space, which was incredibly expensive. Today, storage is cheap, but computational power (CPU) and query response time are often more critical.

A highly normalized database often requires complex JOIN operations to retrieve data. For example, to display a full order summary, we would need to join all four of our tables. For a database under heavy load, performing thousands of these complex joins per second can be very slow.

This has led to the practice of denormalization: the process of intentionally adding redundant data back into a database to improve query performance.

Why Denormalize? The goal is to reduce the number of joins needed for frequent queries.

Example: When displaying a customer's order history, we almost always need the product's name and price. In our 3NF design, getting this requires joining Order_Items with the Products table.

A denormalized approach might be to add ProductName and Price back into the Order_Items table:

Order_Items Table (Denormalized)

OrderID	ProductID	ProductName	Quantity	PriceAtTimeOfSale
101	P1	Apple	10	0.50
101	P2	Banana	5	0.25

Advantages:

Faster Reads: To get an order's details, we now only need to query this one table. No join is needed, which is a huge performance win for a very common operation.
Historical Accuracy: Storing the price in this table is actually a good thing. What if the price of an Apple changes in the Products table tomorrow? This denormalized table preserves the price at the time of the sale.

Disadvantages:

Increased Storage: We are now storing the product name and price multiple times.
More Complex Updates: If a product's name is corrected (e.g., a typo fix), we would theoretically need to update it in every Order_Items record. In practice, this is often deemed an acceptable trade-off.

Today, database design is a balancing act. While normalization provides a crucial foundation for data integrity, designers often selectively denormalize their databases to optimize for the most frequent and critical read operations, accepting a small amount of redundancy in exchange for speed. This philosophy is also a core principle behind many NoSQL database designs.