Core Verbs: Select, Filter & Mutate

Why Learn Data Transformation in Exploratory Data Analysis?

A raw dataset is almost never in the exact shape you need for plotting or modeling.

Imagine you are analyzing a massive airline flights dataset containing 30 columns and 300,000 rows. You are tasked with answering a specific question: "What is the average departure delay of United Airlines (UA) flights departing from JFK airport in January?"

To answer this question, you need to:

Narrow down the table from 30 columns to just a few relevant columns (like flight number, carrier, origin airport, and departure delay).
Sift through 300,000 rows to extract only JFK departures operated by United Airlines in January.
Compute the delays.

If you write this in base R using row indices and dollar-sign indicators, your code will be cluttered and hard to maintain. In the Tidyverse, we use the dplyr package, which provides a cohesive grammar for data transformation. Let's learn the three fundamental verbs: select(), filter(), and mutate().

1. Selecting Columns: select()

select() lets you pick specific columns from a data frame, reducing the table's width.

library(tidyverse)

# Select columns by name
small_table <- select(mpg, model, year, hwy)
print(head(small_table))

Advanced Selection Options

Columns Range: Select all columns from model to hwy inclusive:
```
select(mpg, model:hwy)
```
Excluding Columns: Drop specific columns using the exclamation mark ! or minus sign -:
```
select(mpg, !c(model, year))
```
Selection Helpers: Find columns matching string patterns:
- starts_with("abc"): Columns starting with "abc".
- ends_with("xyz"): Columns ending with "xyz".
- contains("efg"): Columns containing "efg".
- everything(): Selects all remaining columns (useful for relocating columns to the front).
```
# Select columns starting with "c" and everything else
select(mpg, starts_with("c"), everything())
```

2. Filtering Rows: filter()

filter() extracts a subset of rows based on one or more logical conditions.

# Filter for cars with city mileage greater than 30 MPG
efficient_cars <- filter(mpg, cty > 30)
print(efficient_cars)

Combining Conditions

You can combine multiple criteria using logical operators:

AND (& or comma ,): Both conditions must be TRUE.
OR (|): At least one condition must be TRUE.
NOT (!): Reverses the condition.

# AND: Cars that are compact AND have engine displacement greater than 2L
filter(mpg, class == "compact", displ > 2)

# OR: Cars that have 4 cylinders OR 5 cylinders
filter(mpg, cyl == 4 | cyl == 5)

Set Membership: %in%

To check if a column's value matches any element in a list/vector, use the %in% operator:

# Filter for cars that are compact, subcompact, or 2seater
filter(mpg, class %in% c("compact", "subcompact", "2seater"))

Handling Missing Values (NA)

filter() automatically drops rows where the condition evaluates to NA (missing). If you want to explicitly check for missing values, use is.na() or !is.na():

# Filter for rows where city mileage is NOT missing
# filter(mpg, !is.na(cty))

3. Creating/Modifying Columns: mutate()

mutate() adds new columns to your dataset while preserving the existing columns. If the new column name matches an existing name, it overwrites it.

# Create a new column "displacement_to_hwy_ratio" by dividing displ by hwy
mpg_updated <- mutate(mpg, displ_hwy_ratio = displ / hwy)
print(head(select(mpg_updated, displ, hwy, displ_hwy_ratio)))

You can define multiple columns in a single mutate() statement, and refer to columns you created earlier in that same statement:

# Create two columns sequentially
mpg_converted <- mutate(mpg,
  hwy_metric = hwy * 0.425, # converts miles/gal to km/litre
  hwy_rounded = round(hwy_metric, 1)
)

4. Combining Operations: The Pipe Operator (|>)

Instead of writing nested functions (like mutate(filter(select(...)))) or saving intermediate data variables, we use R's native pipe operator |> (or the magrittr pipe %>%).

The pipe takes the output of the expression on its left and passes it as the first argument to the function on its right:

# Without pipes (hard to read)
step1 <- select(mpg, manufacturer, model, hwy)
step2 <- filter(step1, hwy > 30)
final_result <- mutate(step2, hwy_doubled = hwy * 2)

# With pipes (clean, reads left-to-right / top-to-bottom)
final_result <- mpg |>
  select(manufacturer, model, hwy) |>
  filter(hwy > 30) |>
  mutate(hwy_doubled = hwy * 2)

print(final_result)

Hands-on Exercises

Exercise 1: Narrowing Down the Fleet

Using the mpg dataset: Write R code to:

Start with the mpg dataset and pipe it.
Select only the manufacturer, model, cty, and hwy columns.
Filter the rows to keep only cars manufactured by "honda" or "toyota".
Filter for cars where highway mileage is greater than 30.
Print the final filtered table.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

mpg |>
  select(manufacturer, model, cty, hwy) |>
  filter(manufacturer %in% c("honda", "toyota"), hwy > 30)

Exercise 2: Fuel Cost Calculator

Assume the cost of fuel is $3.50 per gallon. Calculate the average fuel cost for driving 100 miles in the city. Formula: Fuel_Cost = (100 / cty) * 3.50 Write R code using pipes to:

Start with the mpg table.
Filter for cars with an engine size (displ) less than or equal to 2.0L.
Create a new column called fuel_cost_100 using the formula above.
Select only the columns: model, displ, cty, and fuel_cost_100.
Print the top 6 rows.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

mpg |>
  filter(displ <= 2.0) |>
  mutate(fuel_cost_100 = (100 / cty) * 3.50) |>
  select(model, displ, cty, fuel_cost_100) |>
  head()