Core Verbs: Select, Filter & Mutate
Why Learn Data Transformation in Exploratory Data Analysis?
A raw dataset is almost never in the exact shape you need for plotting or modeling.
Imagine you are analyzing a massive airline flights dataset containing 30 columns and 300,000 rows. You are tasked with answering a specific question: "What is the average departure delay of United Airlines (UA) flights departing from JFK airport in January?"
To answer this question, you need to:
- Narrow down the table from 30 columns to just a few relevant columns (like flight number, carrier, origin airport, and departure delay).
- Sift through 300,000 rows to extract only JFK departures operated by United Airlines in January.
- Compute the delays.
If you write this in base R using row indices and dollar-sign indicators, your code will be cluttered and hard to maintain. In the Tidyverse, we use the dplyr package, which provides a cohesive grammar for data transformation. Let's learn the three fundamental verbs: select(), filter(), and mutate().
1. Selecting Columns: select()
select() lets you pick specific columns from a data frame, reducing the table's width.
library(tidyverse)
# Select columns by name
small_table <- select(mpg, model, year, hwy)
print(head(small_table))
Advanced Selection Options
Columns Range: Select all columns from
modeltohwyinclusive:select(mpg, model:hwy)Excluding Columns: Drop specific columns using the exclamation mark
!or minus sign-:select(mpg, !c(model, year))Selection Helpers: Find columns matching string patterns:
starts_with("abc"): Columns starting with "abc".ends_with("xyz"): Columns ending with "xyz".contains("efg"): Columns containing "efg".everything(): Selects all remaining columns (useful for relocating columns to the front).
# Select columns starting with "c" and everything else select(mpg, starts_with("c"), everything())
2. Filtering Rows: filter()
filter() extracts a subset of rows based on one or more logical conditions.
# Filter for cars with city mileage greater than 30 MPG
efficient_cars <- filter(mpg, cty > 30)
print(efficient_cars)
Combining Conditions
You can combine multiple criteria using logical operators:
- AND (
&or comma,): Both conditions must beTRUE. - OR (
|): At least one condition must beTRUE. - NOT (
!): Reverses the condition.
# AND: Cars that are compact AND have engine displacement greater than 2L
filter(mpg, class == "compact", displ > 2)
# OR: Cars that have 4 cylinders OR 5 cylinders
filter(mpg, cyl == 4 | cyl == 5)
Set Membership: %in%
To check if a column's value matches any element in a list/vector, use the %in% operator:
# Filter for cars that are compact, subcompact, or 2seater
filter(mpg, class %in% c("compact", "subcompact", "2seater"))
Handling Missing Values (NA)
filter() automatically drops rows where the condition evaluates to NA (missing). If you want to explicitly check for missing values, use is.na() or !is.na():
# Filter for rows where city mileage is NOT missing
# filter(mpg, !is.na(cty))
3. Creating/Modifying Columns: mutate()
mutate() adds new columns to your dataset while preserving the existing columns. If the new column name matches an existing name, it overwrites it.
# Create a new column "displacement_to_hwy_ratio" by dividing displ by hwy
mpg_updated <- mutate(mpg, displ_hwy_ratio = displ / hwy)
print(head(select(mpg_updated, displ, hwy, displ_hwy_ratio)))
You can define multiple columns in a single mutate() statement, and refer to columns you created earlier in that same statement:
# Create two columns sequentially
mpg_converted <- mutate(mpg,
hwy_metric = hwy * 0.425, # converts miles/gal to km/litre
hwy_rounded = round(hwy_metric, 1)
)
4. Combining Operations: The Pipe Operator (|>)
Instead of writing nested functions (like mutate(filter(select(...)))) or saving intermediate data variables, we use R's native pipe operator |> (or the magrittr pipe %>%).
The pipe takes the output of the expression on its left and passes it as the first argument to the function on its right:
# Without pipes (hard to read)
step1 <- select(mpg, manufacturer, model, hwy)
step2 <- filter(step1, hwy > 30)
final_result <- mutate(step2, hwy_doubled = hwy * 2)
# With pipes (clean, reads left-to-right / top-to-bottom)
final_result <- mpg |>
select(manufacturer, model, hwy) |>
filter(hwy > 30) |>
mutate(hwy_doubled = hwy * 2)
print(final_result)
Hands-on Exercises
Exercise 1: Narrowing Down the Fleet
Using the mpg dataset:
Write R code to:
- Start with the
mpgdataset and pipe it. - Select only the
manufacturer,model,cty, andhwycolumns. - Filter the rows to keep only cars manufactured by
"honda"or"toyota". - Filter for cars where highway mileage is greater than 30.
- Print the final filtered table.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
mpg |>
select(manufacturer, model, cty, hwy) |>
filter(manufacturer %in% c("honda", "toyota"), hwy > 30)
Exercise 2: Fuel Cost Calculator
Assume the cost of fuel is $3.50 per gallon. Calculate the average fuel cost for driving 100 miles in the city.
Formula: Fuel_Cost = (100 / cty) * 3.50
Write R code using pipes to:
- Start with the
mpgtable. - Filter for cars with an engine size (
displ) less than or equal to 2.0L. - Create a new column called
fuel_cost_100using the formula above. - Select only the columns:
model,displ,cty, andfuel_cost_100. - Print the top 6 rows.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
mpg |>
filter(displ <= 2.0) |>
mutate(fuel_cost_100 = (100 / cty) * 3.50) |>
select(model, displ, cty, fuel_cost_100) |>
head()