Introduction

The relational database management system (RDBMS) was proposed by Edgar Codd of IBM Research around 1969. For several decades after that, RDBMS became the de facto standard for storing and retrieving data. Today, there are many RDBMS products like Oracle, IBM DB2, and Microsoft SQL Server that are commercial RDBMS solutions. There are also many open-source solutions, like MySQL, and also embedded databases like SQLite and Apache Derby.

Why is it called Relational?

A relational database organizes data in tables. A table is made up of rows and columns. A table is also called an entity. Here is an example of a database table called 'Customers' which has two rows and four columns:

Table (Entity) name: Customers

firstname	lastname	address	gender
John	Doe	123, anywhere street, anycity, anycountry	male
Jane	Doe	123, anywhere street, anycity, anycountry	female

A row is also called a record, and a column is also called a field. The four columns are 'firstname', 'lastname', 'address', and 'gender'.

What is SQL?

The language to create, read, update, and delete (a.k.a. CRUD) data and tables is called Structured Query Language (SQL).

Throughout the years, SQL has evolved, and to ensure some standards in SQL across the different vendors and to define some best practices, the American National Standards Institute (ANSI) established standards for SQL, which is called ANSI SQL. If a vendor is ANSI SQL-compliant, then a query that runs and fetches results on one database, like DB2, will also run the same way against the same table in another ANSI SQL-compliant database, like MySQL. Reference: https://www.ansi.org/

How are relationships formed?

Multiple tables can have relationships with each other. Establishing relationships between tables helps to store data efficiently. Let us take an example. Let us assume the customer table that we created earlier belongs to an eCommerce website like Amazon, and we also need to hold the order information of their purchases. Let us say the first customer purchased two orders over a period of a month, and the second customer placed 3 orders during the same time. So let us add the extra fields to hold the order information:

firstName	lastName	address	gender	orderID	itemsPurchased	date
John	Doe	123, anywhere street, anycity, anycountry	male	1	'sql book'	12/01/2019
Jane	Doe	123, anywhere street, anycity, anycountry	female	3	'html book'	12/02/2019
John	Doe	123, anywhere street, anycity, anycountry	male	2	'java book'	12/05/2019
Jane	Doe	123, anywhere street, anycity, anycountry	female	4	'css book'	12/06/2019
Jane	Doe	123, anywhere street, anycity, anycountry	female	5	'javascript book'	12/08/2019

Is this an efficient way of storing data? What is the problem?

The firstName, lastName, address, and gender are repeating in multiple rows. For a few rows, it does not matter if the data is repeating, but for large datasets, it is indeed not an efficient way of storing the data. How can we solve this? We can solve this by creating another table called 'Orders' and then creating a relationship between the two tables. And this is how you would do it:

Table: Customers

customerID	firstName	lastName	address	gender
1	John	Doe	123, anywhere street, anycity, anycountry	male
3	Jane	Doe	123, anywhere street, anycity, anycountry	female

Table: Orders

customerID	orderID	itemsPurchased	date
1	1	'sql book'	12/01/2019
2	3	'html book'	12/02/2019
1	2	'java book'	12/05/2019
2	4	'css book'	12/06/2019
2	5	'javascript book'	12/08/2019

By arranging the tables this way, we do not have the customer's firstName, lastName, address, and gender repeating in multiple rows. And we had to add an extra column called customerID that corresponds to a unique ID of a specific customer. And this ID is again used in the Orders table to identify which order belongs to which customer.

In other words, we created two tables and formed a relationship between them. The customerID is a column that is identified as the Primary Key. A Primary Key column is not allowed to have duplicate values. Each primary key identifies a unique customer record. This primary key, which identifies a unique customer record, is used in the Orders table to help associate a specific order that the customer with that primary key placed. The customerID column in the Orders table can have duplicate values as the same customer may be placing multiple orders. When this customerID in the Orders table is constructed such that it is restricted to only contain one of the values of the customerID key in the Customers table, we say that there is a relationship formed between the two tables. The customerID key in the Orders table is called the 'foreign key' as it contains only those values that are present in the customerID column of the Customers table.

erd diagram

The above diagram is called an Entity Relationship Diagram (ERD). This relationship is a 'one-to-many' type of a relationship because for every one primary key value in the Customers table, there may be zero to many foreign keys associated with it in the Orders table.

This way of arranging the table and data has reduced the need to store the same piece of data in more than one record or field. Duplicate data not only needs unnecessary extra storage space but can also give room for data inconsistencies, as it is hard to maintain changes in all copies without missing any.

Properties of the Primary Key

In the relational model, a table cannot contain duplicate rows, because that would create ambiguities in retrieval. To ensure uniqueness, each table should have a column (or a set of columns), called the primary key, that uniquely identifies every record of the table. For example, a unique number customerID can be used as the primary key for the Customers table; productID for the Products table; isbn for the Books table. A primary key is called a simple key if it is a single column; it is called a composite key if it is made up of several columns.

When a column is identified as a primary key or a unique key, an index is built to help in search and retrieval.

A designer chooses the primary key based on these principles:

The values of the primary key must be unique (i.e., no duplicate values). For example, customerName may not be appropriate to be used as the primary key for the Customers table, as there could be two customers with the same name.
The primary key must always have a valid value and should not contain NULL.

Typically:

The primary key should be simple and familiar, e.g., employeeID for an employees table and isbn for a books table.
The value of the primary key should not change. Primary key values are used as Foreign Keys in other child tables. Changing a primary key value will necessitate changing values in the child table. For example, phoneNumber may not be appropriate to be used as a primary key to identify a unique customer in a Customers table, because a phone number can be reassigned to another person by the telephone company in the future.
The primary key often uses an integer (or number) type. But it could also be other types, such as texts. However, it is best to use a numeric column as the primary key for efficiency.
Most databases support auto-increment, a built-in function that generates a contiguous number that can be used for Primary Keys, where the current value + 1 is assigned to the new record. This arbitrary number does not have a domain significance, unlike domain-specific values like a phone number. Although autoincrement numbers are still a convenient way of assigning primary keys.
A simple primary key is best, although composite primary keys are also in vogue. You should use as few columns as possible.

What is Normalization?

The above step, where we reduced data redundancy while still ensuring the database is structurally correct but optimally stored, is called Normalizing a database.

When you go down the path of Normalization, there are several forms that you would encounter, and here are the traditionally recognized normal forms:

First Normal Form (1NF): A table is 1NF if every cell contains a single value, not a list of values. 1NF also prohibits a repeating group of columns such as item1, item2,.., itemN. Instead, you should create another table using a one-to-many relationship.

Second Normal Form (2NF): A table is 2NF if it is 1NF and every non-key column is fully dependent on the primary key. If the primary key is made up of several columns, every non-key column must depend on the entire set and not part of it. For example, if the primary key of the Orders table is made up of the orderID and productID columns, and if the discountRate on a product is dependent only on productID and not on productID and orderID, then the discountRate column must not be kept in the Orders table and should be moved to the Products table.

Third Normal Form (3NF): A table is 3NF if it is 2NF and the non-key columns are independent of each other and are only dependent on the primary key and nothing else. For example, suppose that we have a Products table with columns productID (primary key), name, and unitPrice. The column discountRate must not belong to the Products table if it is also dependent on the unitPrice, which is not part of the primary key.

Higher Normal Forms: While it is more popular to keep tables up to the 3rd Normal Form, you can still take it beyond the 3rd Normal Form, although that is beyond the scope of this course.

Relational Database Design Process

Database design is more art than science and will take a few iterations before you can narrow down on the best design. Since every application is different, there is no standard way of designing your database. Every ERD will be uniquely addressing the specific business entities.

The following steps broadly outline the steps in the process:

Step 1: Organize Data in Tables by Identifying the Primary Keys for Each Table

Organize the data into entity-based tables. Choose one or more columns as the primary key, that uniquely identifies every row.

Step 2: Create Relationships among Tables

Now that we know the concept of a Primary Key and the corresponding Foreign key is the fundamental technique to create relationships between tables, let us now understand the different types of relationships:

Some of the most common relationships are:

one-to-many

one to many

We have already seen this relationship in the example above. For every primary key value in the Customers table, we can have more than one row associated with that primary key in the Orders table, as a particular Customer may place more than one order over a period of time.

many-to-many In our ecommerce database, a customer may have placed an order for one or more products, and at the same time, a particular product can appear in multiple orders of the same or different customers. This kind of relationship between the Orders table and the Products table is a many-to-many relationship. However, this kind of many-to-many relationship cannot be shown directly. Instead, you have to create an 'associative' or 'junction' table which has a one-to-many relationship with both these tables. This one-to-many relationship is through a primary-key/foreign-key tie-up, just like the one-to-many relationship shown above.

many-to-many

In the above diagram, OrderDetails is the junction table that has a one-to-many relationship with both the Products and Orders tables.

one-to-one With a one-to-one relationship, as the name suggests, there is exactly one foreign key value in the child table for every one primary key value in the parent table. This is useful when you want to split the row into two tables because of security reasons or if one set of data pertaining to the row is often queried compared to other fields, etc.

In the below example, the LoginDetails table has the sensitive password information which should not appear in every select query of the customer. Although the password fields would be encrypted and then saved in the database, it is still not information that needs to be in the query results for every customer query, and hence a separate LoginDetails table that holds the user's login information is more apt.

You could use a one-to-one relationship to split the data into two tables. A one-to-one relationship is also useful for storing certain sensitive data in a secure table, while the non-sensitive ones are in the main table.

one to one

Final SQL Script

use ecommerce;
create table Customers (
    customerID int NOT NULL auto_increment,
    firstName varchar(100),
    lastName varchar(100),
    gender varchar(1),
    PRIMARY KEY (customerID)
);

create table Orders (
    orderID int NOT NULL auto_increment,
    customerID int NOT NULL,
    purchasedDate date,
    price int NOT NULL,
    PRIMARY KEY (orderID),
    FOREIGN KEY (customerID) REFERENCES Customers(customerID)
);
create table Products (
    productID int NOT NULL auto_increment,
    productName varchar(255) NOT NULL,
    availableStockQty int,
    PRIMARY KEY (productID)
);

create table OrderDetails (
    orderID int NOT NULL,
    productID int NOT NULL,
    qty int NOT NULL,
    PRIMARY KEY (orderID, productID),
    FOREIGN KEY (orderID) REFERENCES Orders(orderID),
    FOREIGN KEY (productID) REFERENCES Products(productID)
);

create table LoginDetails (
    customerID int NOT NULL,
    loginId varchar(45) NOT NULL,
    password varchar(255) NOT NULL,
	PRIMARY KEY (customerID),
    FOREIGN KEY (customerID) REFERENCES Customers(customerID)
);

create table Address (
    addressID int NOT NULL auto_increment,
    address varchar(255),
	customerID int NOT NULL,
    addressType varchar(1),
    PRIMARY KEY (addressID),
    FOREIGN KEY (customerID) REFERENCES Customers(customerID)
);

final erd

Integrity Rules

You should also apply the integrity rules to check the integrity of your design:

Entity Integrity Rule: The primary key cannot contain NULL. Otherwise, it cannot uniquely identify the row. For a composite key made up of several columns, none of the columns can contain NULL. Most RDBMSs check and enforce this rule.

Referential Integrity Rule: Each foreign key value must be matched to a primary key value in the table referenced (or parent table).

You can insert a row with a foreign key in the child table only if the value exists in the parent table. If the value of the key changes in the parent table (e.g., the row is updated or deleted), all rows with this foreign key in the child table(s) must be handled accordingly. You could either (a) disallow the changes; (b) cascade the change (or delete the records) in the child tables accordingly; or (c) set the key value in the child tables to NULL. Most RDBMSs can be set up to perform the check and ensure referential integrity in the specified manner.

Business Logic Integrity: Besides the above two general integrity rules, there could be integrity (validation) pertaining to the business logic, e.g., a zip code must be 5 digits within a certain range, delivery date and time must fall in the business hours; quantity ordered must be equal to or less than the quantity in stock, etc. These could be carried out in a validation rule (for the specific column) or in the programming logic.

Column Indexing

You can create an index on selected column(s) to facilitate data searching and retrieval. An index is a structured file that speeds up data access for SELECT but may slow down INSERT, UPDATE, and DELETE. Without an index structure, to process a SELECT query with a matching criterion (e.g., SELECT * FROM Customers WHERE name='Tan Ah Teck'), the database engine needs to compare every record in the table. A specialized index (e.g., in a B-TREE structure) can reach the record without comparing every record. However, the index needs to be rebuilt whenever a record is changed, which results in overhead associated with using indexes.

An index can be defined on a single column, a set of columns (called a concatenated index), or part of a column (e.g., the first 10 characters of a VARCHAR(100)) (called a partial index). You can build more than one index in a table. For example, if you often search for a customer using either customerName or phoneNumber, you could speed up the search by building an index on the column customerName, as well as phoneNumber. Most RDBMSs build an index on the primary key automatically.

Interactive SQL references

Basic concepts: https://mystery.knightlab.com/walkthrough.html
Advanced concepts: https://selectstarsql.com/
More practice: https://mystery.knightlab.com/