Grouping & Summarizing

Why Learn Grouping and Summarization in Exploratory Data Analysis?

Summary metrics for an entire column (like the average price of all transactions) are helpful, but they don't reveal comparative structures.

Imagine you are analyzing credit card defaults across different age brackets.

If you calculate the average default rate across the entire table, you get a single number.
To discover risk profiles, you need to group the data by age bracket, and then summarize the default rate for each bracket.

In dplyr, we perform this "split-apply-combine" pattern using the group_by() and summarize() verbs. Let's learn how to compute statistics across subgroups, count categories, and apply calculations to multiple columns at once using the across() helper.

1. Summarizing Columns: summarize()

summarize() collapses a data frame containing many rows into a single row of statistical values.

library(tidyverse)

# Calculate summary stats for the whole mpg dataset
mpg |>
  summarize(
    avg_hwy = mean(hwy),
    max_hwy = max(hwy),
    total_cars = n() # Counts the number of rows
  )

Core Aggregation Functions

mean(x) / median(x): Central tendencies.
sd(x) / var(x): Standard deviation and variance.
min(x) / max(x): Range limits.
n(): Returns the count of rows/elements.
n_distinct(x): Returns the count of unique values.

Always handle Missing Values in Summaries

If a column contains NA values, standard statistical functions like mean() will automatically return NA. To compute the statistic while ignoring missing values, always pass the argument na.rm = TRUE (e.g. mean(amount, na.rm = TRUE)).

2. Grouped Aggregations: group_by()

When combined with group_by(), summarize() performs calculations within each group/category instead of the entire table:

# Group by car class, then calculate the average highway mileage for each class
mpg |>
  group_by(class) |>
  summarize(
    avg_hwy = mean(hwy),
    count = n()
  )

You can group by multiple variables, which creates a multi-layered classification hierarchy:

# Group by manufacturer AND class
manufacturer_summary <- mpg |>
  group_by(manufacturer, class) |>
  summarize(avg_cty = mean(cty))

The Golden Rule: Always ungroup()

When you use group_by(), the resulting data frame remains "grouped" in memory. If you perform subsequent modifications or filtering steps, they will continue to evaluate within groups rather than across the whole table, leading to hard-to-detect bugs.

Always add ungroup() at the end of your pipeline once you are done with grouped calculations:

# Correct Practice: Add ungroup() to release the group constraints
class_summary <- mpg |>
  group_by(class) |>
  summarize(avg_hwy = mean(hwy)) |>
  ungroup() # Releases groups

Shortcut for Categorical Counting: count()

If you only need to count the frequency of categories, you can bypass group_by() and summarize(n()) by using the shortcut verb count():

# Count the number of cars per class, sorted with the most common class first
mpg |>
  count(class, sort = TRUE)

3. Operations Across Columns: across()

If you want to apply the same calculation (like mean()) to several columns at once, instead of copying and pasting, use the across() helper function inside summarize() or mutate().

Syntax

across(columns_to_select, function_to_apply)

Examples

# Calculate the mean of city and highway mileage columns at the same time
mpg |>
  group_by(class) |>
  summarize(across(c(cty, hwy), mean)) |>
  ungroup()

# Apply mean() to all numeric columns in the dataset
mpg |>
  group_by(class) |>
  summarize(across(where(is.numeric), ~ mean(.x, na.rm = TRUE))) |>
  ungroup()

Hands-on Exercises

Exercise 1: Fuel Performance by Engine Size

Analyze the performance of vehicles grouped by their number of cylinders (cyl). Write R code to:

Start with the mpg dataset.
Group the data by cyl.
Summarize:
- Calculate the average displacement avg_displ.
- Calculate the average highway mileage avg_hwy.
- Count the total number of cars car_count.
Ungroup the data.
Print the summary table.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

mpg |>
  group_by(cyl) |>
  summarize(
    avg_displ = mean(displ),
    avg_hwy = mean(hwy),
    car_count = n()
  ) |>
  ungroup()

Exercise 2: Multi-Column Summary

Using the mpg dataset: Write R code using across() to:

Group the data by manufacturer.
Compute the maximum value of both cty and hwy mileage columns simultaneously.
Ungroup the data.
Print the top 6 rows of the resulting table.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

mpg |>
  group_by(manufacturer) |>
  summarize(across(c(cty, hwy), max)) |>
  ungroup() |>
  head()