Window Functions & Ranks

Why Learn Advanced Transformations in Exploratory Data Analysis?

Sometimes you need to compare observations not against an aggregate summary, but against other neighboring observations.

Imagine you are analyzing stock prices over a 12-month period:

You want to calculate the price change from yesterday to today. To do this, you must shift your columns vertically to align successive dates.
You want to identify the top 3 highest-valued stock days. You must rank the rows in order, handling instances where two days have the exact same price.
You want to compute a running total of transactions over time.

In dplyr, we achieve this using Window Functions. Unlike aggregation functions (which collapse many rows into one), window functions take $N$ values and return $N$ values, operating on elements relative to their positions. Let's learn sorting, relocating, ranking, and lag operations.

1. Sorting Rows: arrange()

arrange() reorders the rows of a table based on values in one or more columns. By default, it sorts in ascending order. Wrap the column in desc() to sort in descending order.

library(tidyverse)

# Sort cars by highway mileage (ascending)
sorted_hwy <- mpg |> arrange(hwy)

# Sort cars by displacement (descending)
sorted_displ <- mpg |> arrange(desc(displ))

# Sort by manufacturer (alphabetical), then by hwy mileage (highest first)
complex_sort <- mpg |> arrange(manufacturer, desc(hwy))

2. Reordering and Renaming Columns: relocate() and rename()

To make tables more readable in reports or console views:

Moving Columns: relocate()

Use relocate() to shift columns to the front, or position them relative to other variables using .before or .after:

# Move 'class' and 'hwy' to the very beginning of the table
mpg |> relocate(class, hwy)

# Move 'displ' to immediately follow 'manufacturer'
mpg |> relocate(displ, .after = manufacturer)

Renaming Columns: rename()

To rename columns, use the syntax rename(new_name = old_name):

# Rename hwy to highway_mpg
mpg |> rename(highway_mpg = hwy)

3. Ranking Functions: min_rank(), dense_rank(), and row_number()

Ranking functions assign integers representing ordering positions to values in a vector. R provides different ranking options to handle duplicate values (ties):

min_rank(): Standard competitive ranking (e.g., 1st, 1st, 3rd—skips a rank for ties). This is the default in R4DS.
dense_rank(): Dense ranking (e.g., 1st, 1st, 2nd—no gaps in rank numbers).
row_number(): Assigns a sequential row index (ties are ordered arbitrarily based on their physical row sequence).

# Illustrating tie handling
scores <- c(90, 90, 80, 70)

print(min_rank(desc(scores)))   # 1 1 3 4 (Standard)
print(dense_rank(desc(scores))) # 1 1 2 3 (Dense)
print(row_number(desc(scores))) # 1 2 3 4 (Arbitrary tie breaker)

We can combine ranking with filter() to find top items:

# Find the top 3 most fuel-efficient cars in the dataset
mpg |>
  mutate(rank = min_rank(desc(hwy))) |>
  filter(rank <= 3) |>
  arrange(rank)

4. Shifts and Offsets: lead() and lag()

To compare a value in a row with values in preceding or succeeding rows (highly common in time-series stock tracking or daily weather changes):

lag(x, n): Shifts elements down, returning the value n steps before (default is 1). The first element becomes NA.
lead(x, n): Shifts elements up, returning the value n steps after. The last element becomes NA.

values <- c(10, 15, 20, 25)

print(lag(values))  # NA 10 15 20
print(lead(values)) # 15 20 25 NA

5. Cumulative Summaries

Cumulative functions compute running statistics along a sequence:

cumsum(x): Running total sum.
cummean(x): Running average.
cummax(x) / cummin(x): Running limits.

sales <- c(100, 150, 200)
print(cumsum(sales)) # 100 250 450

Hands-on Exercises

Exercise 1: Finding the Least Efficient Models

Identify the cars with the lowest city mileage (cty). Write R code using pipes to:

Sort the mpg dataset by cty in ascending order.
Select only columns: manufacturer, model, cty.
Filter the rows to print only the top 5 records showing the lowest city fuel mileage.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

mpg |>
  arrange(cty) |>
  select(manufacturer, model, cty) |>
  head(5)

Exercise 2: Ranks within Subgroups

For each car manufacturer, find their single most fuel-efficient highway car model. Write R code using pipes to:

Group the data by manufacturer.
Create a rank column hwy_rank using min_rank(desc(hwy)) to rank highway mileage (highest mileage gets rank 1).
Filter the rows to keep only models with hwy_rank == 1.
Relocate the hwy_rank and hwy columns to the beginning of the table.
Ungroup the data.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

mpg |>
  group_by(manufacturer) |>
  mutate(hwy_rank = min_rank(desc(hwy))) |>
  filter(hwy_rank == 1) |>
  relocate(hwy_rank, hwy) |>
  ungroup()