Linear & Multiple Regression
Why Learn Statistical Modeling in Exploratory Data Analysis?
Visualization reveals patterns, but statistical models quantify the strength and predictability of those patterns.
Imagine you are studying house prices:
- You notice that houses with larger areas generally cost more.
- To make a business decision (like estimating the value of a new house), you need to answer a precise question: "For every additional square foot of space, how many dollars does the price increase?"
- You also want to check if the location (a categorical variable) changes this growth rate.
In R, statistical modeling is built-in. We can fit linear trends using the lm() function and binary classification trends using the glm() function. Let's learn how to fit models, inspect summaries, make predictions, and diagnose errors.
1. Simple Linear Regression: lm()
Linear regression models the relationship between a numerical response variable y and a predictor variable x with a straight line: .
R Formula Syntax
In R, formulas are written as response ~ predictor (read as "response is modeled by predictor").
library(tidyverse)
# Fit a model predicting highway mileage (hwy) based on engine size (displ)
model <- lm(hwy ~ displ, data = mpg)
# Inspect the coefficients and statistics of the fit
summary(model)
Understanding summary() Output
- Coefficients:
(Intercept): The value of when .displ: The slope. A value of-3.53means that for every 1-litre increase in engine size, highway mileage decreases by 3.53 MPG on average.
- Pr(>|t|): The p-value. If it is less than
0.05, the relationship is statistically significant. - Multiple R-squared: The proportion of variance explained by the model (e.g.
0.586means the engine size explains 58.6% of the variation in highway mileage).
2. Multiple Linear Regression
To improve our model, we can add multiple predictors using the plus symbol +:
# Predict highway mileage based on engine size AND cylinder count
multi_model <- lm(hwy ~ displ + cyl, data = mpg)
summary(multi_model)
Handling Categorical Predictors
If you pass a categorical character or factor column (like class) into lm(), R automatically converts it into a set of numeric indicator variables (known as dummy variables) behind the scenes:
# R automatically creates dummy variables for each vehicle class
class_model <- lm(hwy ~ class, data = mpg)
summary(class_model)
Interaction Terms (*)
If you believe the effect of one variable depends on the value of another (e.g. the effect of displacement on mileage changes depending on whether the car is front-wheel or 4WD), model this interaction using the asterisk * operator:
# Fits model: hwy = Intercept + displ + drv + (displ * drv)
interaction_model <- lm(hwy ~ displ * drv, data = mpg)
3. Predictions, Residuals & modelr Helpers
Once a model is fit, we can extract its predictions and errors (residuals):
predict(model): Computes the predicted response values.residuals(model): Computes the actual value minus the predicted value.
The modelr Pipeline Helpers
In a piped workflow, using base functions inside mutate() can be verbose. The tidyverse modelr package provides helpers to append columns directly:
add_predictions(df, model): Appends a column of predicted values (defaults topred).add_residuals(df, model): Appends a column of residuals (defaults toresid).
library(modelr)
# Add both predictions and residuals to our data frame in a single pipeline
mpg_diagnostics <- mpg |>
add_predictions(model) |>
add_residuals(model)
# Plot residuals vs. predicted values to verify that errors look like pure noise
ggplot(mpg_diagnostics, aes(x = pred, y = resid)) +
geom_point(position = "jitter", alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Residuals vs. Fitted Values", x = "Predicted MPG", y = "Error (Residual)") +
theme_minimal()
3B. Model Selection & Feature Engineering
When building multiple regression models, how do we choose which predictors to keep?
The Significance Rule
Start with a list of predictors selected based on domain knowledge. Then:
- Examine the coefficients' summary.
- Drop any predictor that is insignificant (p-value ).
- Re-fit the model and repeat until all remaining predictors are significant.
Updating Models: update()
Instead of writing the entire lm() formula from scratch, use the update(model, . ~ . - variable) function. The . symbol means "keep current side of formula":
# Fit a model with three predictors
full_model <- lm(hwy ~ displ + cyl + year, data = mpg)
# If 'year' has an insignificant p-value, drop it using update()
# (. ~ . - year means: keep hwy on the left, keep everything except year on the right)
updated_model <- update(full_model, . ~ . - year)
summary(updated_model)
4. Logistic Regression for Classification: glm()
If your response variable is binary (e.g. 0 or 1, TRUE or FALSE—such as default vs. non-default, or survived vs. died), you must use Logistic Regression instead of linear regression.
We fit this using the Generalized Linear Model glm() function and specify family = "binomial":
# Predict whether a car is fuel-efficient (hwy > 25) based on engine size
# 1. Create a binary logical column
mpg_binary <- mpg |>
mutate(is_efficient = if_else(hwy > 25, 1, 0))
# 2. Fit logistic model
logit_model <- glm(is_efficient ~ displ, family = "binomial", data = mpg_binary)
summary(logit_model)
Extracting Probabilities
By default, predict(logit_model) returns values on the log-odds scale. To get actual probability values (between 0 and 1), pass the argument type = "response":
# Predict probability of being efficient
mpg_binary <- mpg_binary |>
mutate(prob_efficient = predict(logit_model, type = "response"))
head(select(mpg_binary, model, displ, is_efficient, prob_efficient))
Hands-on Exercises
Exercise 1: Estimating Mileage Decline
Fit a linear regression model using the mpg dataset to measure how city mileage (cty) declines as engine size (displ) increases.
Write R code to:
- Fit a linear model predicting
ctybased ondisplusinglm(). - Save the model in a variable called
city_model. - Print the
summary(city_model). - Read the coefficient slope of
displin the output to check the MPG rate decline per litre.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
# Fit linear model
city_model <- lm(cty ~ displ, data = mpg)
# Inspect summary
summary(city_model)
# The coefficient slope of displ is roughly -2.63 MPG decline per litre.
Exercise 2: Binary Drivetrain Classifier
Predict whether a vehicle is a front-wheel drive (drv == "f") based on its city mileage (cty) and engine displacement (displ).
Write R code to:
- Create a binary indicator
is_front <- if_else(drv == "f", 1, 0)inside thempgtable usingmutate(). Save this asmpg_logit. - Fit a logistic regression model predicting
is_frontbased onctyanddisplusingglm(..., family = "binomial"). - Print the summary of the model.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
# Prepare binary column
mpg_logit <- mpg |>
mutate(is_front = if_else(drv == "f", 1, 0))
# Fit model
classifier <- glm(is_front ~ cty + displ, family = "binomial", data = mpg_logit)
# Inspect model summary
summary(classifier)