Window Functions & Ranks
Why Learn Advanced Transformations in Exploratory Data Analysis?
Sometimes you need to compare observations not against an aggregate summary, but against other neighboring observations.
Imagine you are analyzing stock prices over a 12-month period:
- You want to calculate the price change from yesterday to today. To do this, you must shift your columns vertically to align successive dates.
- You want to identify the top 3 highest-valued stock days. You must rank the rows in order, handling instances where two days have the exact same price.
- You want to compute a running total of transactions over time.
In dplyr, we achieve this using Window Functions. Unlike aggregation functions (which collapse many rows into one), window functions take values and return values, operating on elements relative to their positions. Let's learn sorting, relocating, ranking, and lag operations.
1. Sorting Rows: arrange()
arrange() reorders the rows of a table based on values in one or more columns. By default, it sorts in ascending order. Wrap the column in desc() to sort in descending order.
library(tidyverse)
# Sort cars by highway mileage (ascending)
sorted_hwy <- mpg |> arrange(hwy)
# Sort cars by displacement (descending)
sorted_displ <- mpg |> arrange(desc(displ))
# Sort by manufacturer (alphabetical), then by hwy mileage (highest first)
complex_sort <- mpg |> arrange(manufacturer, desc(hwy))
2. Reordering and Renaming Columns: relocate() and rename()
To make tables more readable in reports or console views:
Moving Columns: relocate()
Use relocate() to shift columns to the front, or position them relative to other variables using .before or .after:
# Move 'class' and 'hwy' to the very beginning of the table
mpg |> relocate(class, hwy)
# Move 'displ' to immediately follow 'manufacturer'
mpg |> relocate(displ, .after = manufacturer)
Renaming Columns: rename()
To rename columns, use the syntax rename(new_name = old_name):
# Rename hwy to highway_mpg
mpg |> rename(highway_mpg = hwy)
3. Ranking Functions: min_rank(), dense_rank(), and row_number()
Ranking functions assign integers representing ordering positions to values in a vector. R provides different ranking options to handle duplicate values (ties):
min_rank(): Standard competitive ranking (e.g., 1st, 1st, 3rd—skips a rank for ties). This is the default in R4DS.dense_rank(): Dense ranking (e.g., 1st, 1st, 2nd—no gaps in rank numbers).row_number(): Assigns a sequential row index (ties are ordered arbitrarily based on their physical row sequence).
# Illustrating tie handling
scores <- c(90, 90, 80, 70)
print(min_rank(desc(scores))) # 1 1 3 4 (Standard)
print(dense_rank(desc(scores))) # 1 1 2 3 (Dense)
print(row_number(desc(scores))) # 1 2 3 4 (Arbitrary tie breaker)
We can combine ranking with filter() to find top items:
# Find the top 3 most fuel-efficient cars in the dataset
mpg |>
mutate(rank = min_rank(desc(hwy))) |>
filter(rank <= 3) |>
arrange(rank)
4. Shifts and Offsets: lead() and lag()
To compare a value in a row with values in preceding or succeeding rows (highly common in time-series stock tracking or daily weather changes):
lag(x, n): Shifts elements down, returning the valuensteps before (default is 1). The first element becomesNA.lead(x, n): Shifts elements up, returning the valuensteps after. The last element becomesNA.
values <- c(10, 15, 20, 25)
print(lag(values)) # NA 10 15 20
print(lead(values)) # 15 20 25 NA
5. Cumulative Summaries
Cumulative functions compute running statistics along a sequence:
cumsum(x): Running total sum.cummean(x): Running average.cummax(x)/cummin(x): Running limits.
sales <- c(100, 150, 200)
print(cumsum(sales)) # 100 250 450
Hands-on Exercises
Exercise 1: Finding the Least Efficient Models
Identify the cars with the lowest city mileage (cty).
Write R code using pipes to:
- Sort the
mpgdataset byctyin ascending order. - Select only columns:
manufacturer,model,cty. - Filter the rows to print only the top 5 records showing the lowest city fuel mileage.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
mpg |>
arrange(cty) |>
select(manufacturer, model, cty) |>
head(5)
Exercise 2: Ranks within Subgroups
For each car manufacturer, find their single most fuel-efficient highway car model. Write R code using pipes to:
- Group the data by
manufacturer. - Create a rank column
hwy_rankusingmin_rank(desc(hwy))to rank highway mileage (highest mileage gets rank 1). - Filter the rows to keep only models with
hwy_rank == 1. - Relocate the
hwy_rankandhwycolumns to the beginning of the table. - Ungroup the data.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
mpg |>
group_by(manufacturer) |>
mutate(hwy_rank = min_rank(desc(hwy))) |>
filter(hwy_rank == 1) |>
relocate(hwy_rank, hwy) |>
ungroup()