Reshaping Tables: Pivots
Why Learn Reshaping in Exploratory Data Analysis?
In real-world data collection, tables are often designed for humans to read easily, not for computers to analyze.
Imagine you have a dataset of monthly company profits:
- Human-readable layout (Wide): Columns are named
Company,Jan_Profit,Feb_Profit,Mar_Profit, and so on. - Problem: If you want to plot a line chart of profit over time,
ggplot2requires a single column forMonthand a single column forProfit. In this wide format, you cannot easily map these columns to your visual variables!
You need to reshape the table, turning column headers (Jan, Feb, Mar) into row values in a single Month column.
In the Tidyverse, we use the tidyr package to transform data between wide and long structures and tidy up messy cells. Let's learn how to restructure our tables.
1. The Rules of Tidy Data
Hadley Wickham defined three rules that make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
2. Pivoting Longer: pivot_longer()
pivot_longer() reduces the number of columns and increases the number of rows. It "melts" wide tables into long tables.
Key Arguments
cols: The columns you want to pivot (can use selection helpers).names_to: The name of the new column that will store the pivoted column headers (as character strings).values_to: The name of the new column that will store the cell values inside those headers.
library(tidyverse)
# Create a wide table
wide_data <- tibble(
Country = c("USA", "Canada"),
`2020` = c(45000, 38000),
`2021` = c(48000, 41000)
)
print(wide_data)
# Reshape into tidy (long) format
long_data <- wide_data |>
pivot_longer(
cols = c(`2020`, `2021`),
names_to = "Year",
values_to = "GDP"
)
print(long_data)
3. Pivoting Wider: pivot_wider()
pivot_wider() increases the number of columns and decreases the number of rows. It is the opposite of pivot_longer(). Use it when a single observation is scattered across multiple rows.
Key Arguments
names_from: The column that contains the category values which will become the new column headers.values_from: The column containing the cell values that will populate those new columns.
# Create a long table where variables (TMAX, TMIN) are values in a column
long_weather <- tibble(
City = c("Boston", "Boston", "Chicago", "Chicago"),
Metric = c("TMAX", "TMIN", "TMAX", "TMIN"),
Temp = c(75, 45, 82, 50)
)
print(long_weather)
# Reshape back into wide format where TMAX and TMIN are distinct columns
wide_weather <- long_weather |>
pivot_wider(
names_from = Metric,
values_from = Temp
)
print(wide_weather)
4. Splitting and Merging Columns: separate_wider_*
Sometimes a single column contains multiple variables merged together (e.g. a date column containing "2026/06/20", or a location coordinate "42.36,-71.05").
Splitting Columns: separate_wider_delim()
separate_wider_delim(cols, delim, names) splits a column into multiple columns based on a separator delimiter:
# Table containing combined coordinates
gps_data <- tibble(City = "Boston", LatLong = "42.36,-71.05")
# Separate into Latitude and Longitude columns
clean_gps <- gps_data |>
separate_wider_delim(
cols = LatLong,
delim = ",",
names = c("Latitude", "Longitude")
)
print(clean_gps)
Expanding Cells to Rows: separate_longer_delim()
If a single cell contains a comma-separated list of items, you can split it, creating a separate row for each item:
# One row containing multiple tags
tags_data <- tibble(Article = "Intro to R", Tags = "DataWrangling,ggplot2,Tidyverse")
# Expand to multiple rows
expanded_tags <- tags_data |>
separate_longer_delim(cols = Tags, delim = ",")
print(expanded_tags)
5. Handling Missing Values in Tables
Wrangling tables often exposes missing values (NA). tidyr provides functions to clean them:
drop_na(..., cols): Removes all rows that containNAin the specified columns. If no columns are specified, it drops rows with anyNAanywhere in the table.replace_na(replace = list(col = value)): ReplacesNAvalues in specific columns with a constant default:
messy_table <- tibble(
ID = 1:3,
Sales = c(100, NA, 300),
Region = c("East", "West", NA)
)
# 1. Drop rows where Sales is missing
clean_sales <- messy_table |> drop_na(Sales)
# 2. Replace missing Region with "Unknown" and missing Sales with 0
filled_table <- messy_table |>
replace_na(list(
Sales = 0,
Region = "Unknown"
))
print(filled_table)
Hands-on Exercises
Exercise 1: Reshaping Sales Years
A company records sales for different divisions over three years in wide format:
divisions <- tibble(Division = c("HR", "Tech"), Yr2023 = c(500, 1200), Yr2024 = c(550, 1350))
Write R code using pivot_longer() to:
- Reshape the table into long format.
- Group the years (
Yr2023,Yr2024) into a new column calledYear. - Save the values into a column called
Revenue. - Print the final reshaped long table.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
divisions <- tibble(
Division = c("HR", "Tech"),
Yr2023 = c(500, 1200),
Yr2024 = c(550, 1350)
)
divisions |>
pivot_longer(
cols = starts_with("Yr"),
names_to = "Year",
values_to = "Revenue"
)
Exercise 2: Splitting Combined Codes
A dataset contains transaction keys that pack product ID and region code together with an underscore:
logs <- tibble(Log_ID = c(101, 102), Product_Region = c("PROD1_East", "PROD2_West"))
Write R code to:
- Separate
Product_Regioninto two new columns named"ProductID"and"Region". - Filter the result to show only rows where
Regionis"West".
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
logs <- tibble(
Log_ID = c(101, 102),
Product_Region = c("PROD1_East", "PROD2_West")
)
logs |>
separate_wider_delim(
cols = Product_Region,
delim = "_",
names = c("ProductID", "Region")
) |>
filter(Region == "West")