Grouping & Summarizing
Why Learn Grouping and Summarization in Exploratory Data Analysis?
Summary metrics for an entire column (like the average price of all transactions) are helpful, but they don't reveal comparative structures.
Imagine you are analyzing credit card defaults across different age brackets.
- If you calculate the average default rate across the entire table, you get a single number.
- To discover risk profiles, you need to group the data by age bracket, and then summarize the default rate for each bracket.
In dplyr, we perform this "split-apply-combine" pattern using the group_by() and summarize() verbs. Let's learn how to compute statistics across subgroups, count categories, and apply calculations to multiple columns at once using the across() helper.
1. Summarizing Columns: summarize()
summarize() collapses a data frame containing many rows into a single row of statistical values.
library(tidyverse)
# Calculate summary stats for the whole mpg dataset
mpg |>
summarize(
avg_hwy = mean(hwy),
max_hwy = max(hwy),
total_cars = n() # Counts the number of rows
)
Core Aggregation Functions
mean(x)/median(x): Central tendencies.sd(x)/var(x): Standard deviation and variance.min(x)/max(x): Range limits.n(): Returns the count of rows/elements.n_distinct(x): Returns the count of unique values.
If a column contains NA values, standard statistical functions like mean() will automatically return NA. To compute the statistic while ignoring missing values, always pass the argument na.rm = TRUE (e.g. mean(amount, na.rm = TRUE)).
2. Grouped Aggregations: group_by()
When combined with group_by(), summarize() performs calculations within each group/category instead of the entire table:
# Group by car class, then calculate the average highway mileage for each class
mpg |>
group_by(class) |>
summarize(
avg_hwy = mean(hwy),
count = n()
)
You can group by multiple variables, which creates a multi-layered classification hierarchy:
# Group by manufacturer AND class
manufacturer_summary <- mpg |>
group_by(manufacturer, class) |>
summarize(avg_cty = mean(cty))
The Golden Rule: Always ungroup()
When you use group_by(), the resulting data frame remains "grouped" in memory. If you perform subsequent modifications or filtering steps, they will continue to evaluate within groups rather than across the whole table, leading to hard-to-detect bugs.
Always add ungroup() at the end of your pipeline once you are done with grouped calculations:
# Correct Practice: Add ungroup() to release the group constraints
class_summary <- mpg |>
group_by(class) |>
summarize(avg_hwy = mean(hwy)) |>
ungroup() # Releases groups
Shortcut for Categorical Counting: count()
If you only need to count the frequency of categories, you can bypass group_by() and summarize(n()) by using the shortcut verb count():
# Count the number of cars per class, sorted with the most common class first
mpg |>
count(class, sort = TRUE)
3. Operations Across Columns: across()
If you want to apply the same calculation (like mean()) to several columns at once, instead of copying and pasting, use the across() helper function inside summarize() or mutate().
Syntax
across(columns_to_select, function_to_apply)
Examples
# Calculate the mean of city and highway mileage columns at the same time
mpg |>
group_by(class) |>
summarize(across(c(cty, hwy), mean)) |>
ungroup()
# Apply mean() to all numeric columns in the dataset
mpg |>
group_by(class) |>
summarize(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) |>
ungroup()
Hands-on Exercises
Exercise 1: Fuel Performance by Engine Size
Analyze the performance of vehicles grouped by their number of cylinders (cyl).
Write R code to:
- Start with the
mpgdataset. - Group the data by
cyl. - Summarize:
- Calculate the average displacement
avg_displ. - Calculate the average highway mileage
avg_hwy. - Count the total number of cars
car_count.
- Calculate the average displacement
- Ungroup the data.
- Print the summary table.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
mpg |>
group_by(cyl) |>
summarize(
avg_displ = mean(displ),
avg_hwy = mean(hwy),
car_count = n()
) |>
ungroup()
Exercise 2: Multi-Column Summary
Using the mpg dataset:
Write R code using across() to:
- Group the data by
manufacturer. - Compute the maximum value of both
ctyandhwymileage columns simultaneously. - Ungroup the data.
- Print the top 6 rows of the resulting table.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
mpg |>
group_by(manufacturer) |>
summarize(across(c(cty, hwy), max)) |>
ungroup() |>
head()