MOBI BOOT CAMP CORP. logoLearning Buddy
  • SIGN IN
  • Introduction
  • 1: Data Visualization with ggplot2
  • 2: Data Transformation with dplyr
  • 3: Data Tidying & Joins
    • Reshaping Tables: Pivots
    • Relational Data: Joins
  • 4: Exploratory Data Analysis
  • 5: Statistical Modeling
  • 6: Database Queries & SQL
  • 7: Interactive Dashboards
  • 8. Bad Visualization Examples
  • 9. Glossary

Relational Data: Joins

Why Learn Joins in Exploratory Data Analysis?

In professional databases, data is rarely stored in a single table. To prevent duplication and improve speed, databases normalize data across multiple tables.

Imagine you are analyzing flights:

  • Table A (Flights): Contains flight details like flight number, carrier code (e.g. "UA"), and departure delay.
  • Table B (Airlines): Maps carrier codes to full names (e.g. "UA" maps to "United Airlines").

To present a readable report, you need to combine these tables so you can print the full airline name next to the flight delay.

In dplyr, we combine tables using Joins. Joins use matching values in shared columns (known as Keys) to align rows. Let's learn mutating joins, filtering joins, and set operations.


1. Keys and Relational Data

  • Primary Key: A column (or set of columns) that uniquely identifies each observation in its own table (e.g. Carrier in the Airlines table).
  • Foreign Key: A column in one table that references the primary key of another table (e.g. carrier column in the Flights table).

2. Mutating Joins

Mutating joins add columns from a secondary table y to a primary table x based on matching keys.

Left Join: left_join(x, y, by = "key")

Keeps all observations in the left table x, regardless of whether they have a match in the right table y. This is the most commonly used join in data science. Mismatches are filled with NA.

library(tidyverse)

# Create two tables
flights <- tibble(flight_id = 1:3, carrier = c("UA", "AA", "DL"), delay = c(15, 30, 0))
airlines <- tibble(carrier = c("UA", "AA"), name = c("United Airlines", "American Airlines"))

# Join full names into flight records
flights_joined <- left_join(flights, airlines, by = "carrier")
print(flights_joined)
# Note: DL flight has NA in the 'name' column since it wasn't present in airlines!

Inner Join: inner_join(x, y, by = "key")

Keeps only rows that have matching keys in both tables. Rows with no matches are completely dropped:

# Drops the DL flight because "DL" is missing from the airlines table
inner_joined <- inner_join(flights, airlines, by = "carrier")
print(inner_joined)

Full Join: full_join(x, y, by = "key")

Keeps all rows from both tables. Mismatches on either side are filled with NA:

# Keeps all flight records and all airline records, regardless of matches
full_joined <- full_join(flights, airlines, by = "carrier")

Right Join: right_join(x, y, by = "key")

Keeps all observations in the right table y. It is equivalent to a left join with the tables reversed: right_join(x, y) is identical to left_join(y, x).


3. Filtering Joins

Filtering joins check for matches between tables, but they never add columns from the second table. They simply filter rows of the first table.

Semi Join: semi_join(x, y, by = "key")

Keeps all rows in x that have a match in y. It is useful to filter a dataset based on whether records exist in a secondary lookup table:

# Keeps only flights operated by carriers present in the airlines table (UA, AA)
flights_valid <- semi_join(flights, airlines, by = "carrier")
print(flights_valid)

Anti Join: anti_join(x, y, by = "key")

Keeps all rows in x that do not have a match in y. This is an extremely powerful debugging tool to find mismatch anomalies or invalid foreign keys:

# Find all flights that have a carrier code not defined in our airlines database
mismatched_flights <- anti_join(flights, airlines, by = "carrier")
print(mismatched_flights) # DL flight

4. Set Operations

Set operations compare rows of two tables. Both tables must have the exact same columns:

  • intersect(x, y): Returns only rows that appear in both tables.
  • union(x, y): Returns all unique rows from both tables, removing duplicate rows.
  • setdiff(x, y): Returns rows that are present in x but not in y.
t1 <- tibble(Name = c("Alice", "Bob"), Age = c(25, 30))
t2 <- tibble(Name = c("Bob", "Charlie"), Age = c(30, 22))

print(intersect(t1, t2)) # Returns Bob (30)
print(union(t1, t2))     # Returns Alice, Bob, Charlie (3 rows)
print(setdiff(t1, t2))   # Returns Alice (1 row)

Hands-on Exercises

Exercise 1: Mapping Flight Airports

You have two tables: flights <- tibble(ID = 101:103, Origin = c("JFK", "LAX", "ORD"), Dep_Time = c(900, 1430, 1800)) airports <- tibble(Code = c("JFK", "ORD"), Airport_Name = c("John F. Kennedy", "O'Hare")) Write R code using left_join() to:

  1. Join the airports table into the flights table based on origin codes. (Hint: Since columns have different names, use by = join_by(Origin == Code) or by = c("Origin" = "Code")).
  2. Print the joined table.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)

flights <- tibble(ID = 101:103, Origin = c("JFK", "LAX", "ORD"), Dep_Time = c(900, 1430, 1800))
airports <- tibble(Code = c("JFK", "ORD"), Airport_Name = c("John F. Kennedy", "O'Hare"))

left_join(flights, airports, by = c("Origin" = "Code"))

Exercise 2: Auditing Missing Registrations

Find all students who submitted homework but have not registered in the class database. homeworks <- tibble(Email = c("alice@umich.edu", "bob@umich.edu", "charlie@umich.edu")) roster <- tibble(Email = c("alice@umich.edu", "charlie@umich.edu"), Registered = TRUE) Write R code using an anti_join() to:

  1. Identify students in the homeworks table whose emails are not present in the roster table.
  2. Print the resulting audited email vector.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)

homeworks <- tibble(Email = c("alice@umich.edu", "bob@umich.edu", "charlie@umich.edu"))
roster <- tibble(Email = c("alice@umich.edu", "charlie@umich.edu"), Registered = TRUE)

# Anti-join filters homework entries that have no match in roster
unregistered <- anti_join(homeworks, roster, by = "Email")
print(unregistered) # bob@umich.edu
Privacy Policy | Terms & Conditions