Linear & Multiple Regression

Why Learn Statistical Modeling in Exploratory Data Analysis?

Visualization reveals patterns, but statistical models quantify the strength and predictability of those patterns.

Imagine you are studying house prices:

You notice that houses with larger areas generally cost more.
To make a business decision (like estimating the value of a new house), you need to answer a precise question: "For every additional square foot of space, how many dollars does the price increase?"
You also want to check if the location (a categorical variable) changes this growth rate.

In R, statistical modeling is built-in. We can fit linear trends using the lm() function and binary classification trends using the glm() function. Let's learn how to fit models, inspect summaries, make predictions, and diagnose errors.

1. Simple Linear Regression: lm()

Linear regression models the relationship between a numerical response variable y and a predictor variable x with a straight line: $y = \beta_0 + \beta_1 x + \epsilon$ .

R Formula Syntax

In R, formulas are written as response ~ predictor (read as "response is modeled by predictor").

library(tidyverse)

# Fit a model predicting highway mileage (hwy) based on engine size (displ)
model <- lm(hwy ~ displ, data = mpg)

# Inspect the coefficients and statistics of the fit
summary(model)

Understanding summary() Output

Coefficients:
- (Intercept): The value of $y$ when $x = 0$ .
- displ: The slope. A value of -3.53 means that for every 1-litre increase in engine size, highway mileage decreases by 3.53 MPG on average.
Pr(>|t|): The p-value. If it is less than 0.05, the relationship is statistically significant.
Multiple R-squared: The proportion of variance explained by the model (e.g. 0.586 means the engine size explains 58.6% of the variation in highway mileage).

2. Multiple Linear Regression

To improve our model, we can add multiple predictors using the plus symbol +:

# Predict highway mileage based on engine size AND cylinder count
multi_model <- lm(hwy ~ displ + cyl, data = mpg)
summary(multi_model)

Handling Categorical Predictors

If you pass a categorical character or factor column (like class) into lm(), R automatically converts it into a set of numeric indicator variables (known as dummy variables) behind the scenes:

# R automatically creates dummy variables for each vehicle class
class_model <- lm(hwy ~ class, data = mpg)
summary(class_model)

Interaction Terms (*)

If you believe the effect of one variable depends on the value of another (e.g. the effect of displacement on mileage changes depending on whether the car is front-wheel or 4WD), model this interaction using the asterisk * operator:

# Fits model: hwy = Intercept + displ + drv + (displ * drv)
interaction_model <- lm(hwy ~ displ * drv, data = mpg)

3. Predictions, Residuals & modelr Helpers

Once a model is fit, we can extract its predictions and errors (residuals):

predict(model): Computes the predicted response values.
residuals(model): Computes the actual value minus the predicted value.

The modelr Pipeline Helpers

In a piped workflow, using base functions inside mutate() can be verbose. The tidyverse modelr package provides helpers to append columns directly:

add_predictions(df, model): Appends a column of predicted values (defaults to pred).
add_residuals(df, model): Appends a column of residuals (defaults to resid).

library(modelr)

# Add both predictions and residuals to our data frame in a single pipeline
mpg_diagnostics <- mpg |>
  add_predictions(model) |>
  add_residuals(model)

# Plot residuals vs. predicted values to verify that errors look like pure noise
ggplot(mpg_diagnostics, aes(x = pred, y = resid)) +
  geom_point(position = "jitter", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residuals vs. Fitted Values", x = "Predicted MPG", y = "Error (Residual)") +
  theme_minimal()

3B. Model Selection & Feature Engineering

When building multiple regression models, how do we choose which predictors to keep?

The Significance Rule

Start with a list of predictors selected based on domain knowledge. Then:

Examine the coefficients' summary.
Drop any predictor that is insignificant (p-value $> 0.05$ ).
Re-fit the model and repeat until all remaining predictors are significant.

Updating Models: update()

Instead of writing the entire lm() formula from scratch, use the update(model, . ~ . - variable) function. The . symbol means "keep current side of formula":

# Fit a model with three predictors
full_model <- lm(hwy ~ displ + cyl + year, data = mpg)

# If 'year' has an insignificant p-value, drop it using update()
# (. ~ . - year means: keep hwy on the left, keep everything except year on the right)
updated_model <- update(full_model, . ~ . - year)
summary(updated_model)

4. Logistic Regression for Classification: glm()

If your response variable is binary (e.g. 0 or 1, TRUE or FALSE—such as default vs. non-default, or survived vs. died), you must use Logistic Regression instead of linear regression.

We fit this using the Generalized Linear Model glm() function and specify family = "binomial":

# Predict whether a car is fuel-efficient (hwy > 25) based on engine size
# 1. Create a binary logical column
mpg_binary <- mpg |>
  mutate(is_efficient = if_else(hwy > 25, 1, 0))

# 2. Fit logistic model
logit_model <- glm(is_efficient ~ displ, family = "binomial", data = mpg_binary)
summary(logit_model)

Extracting Probabilities

By default, predict(logit_model) returns values on the log-odds scale. To get actual probability values (between 0 and 1), pass the argument type = "response":

# Predict probability of being efficient
mpg_binary <- mpg_binary |>
  mutate(prob_efficient = predict(logit_model, type = "response"))

head(select(mpg_binary, model, displ, is_efficient, prob_efficient))

Hands-on Exercises

Exercise 1: Estimating Mileage Decline

Fit a linear regression model using the mpg dataset to measure how city mileage (cty) declines as engine size (displ) increases. Write R code to:

Fit a linear model predicting cty based on displ using lm().
Save the model in a variable called city_model.
Print the summary(city_model).
Read the coefficient slope of displ in the output to check the MPG rate decline per litre.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

# Fit linear model
city_model <- lm(cty ~ displ, data = mpg)

# Inspect summary
summary(city_model)
# The coefficient slope of displ is roughly -2.63 MPG decline per litre.

Exercise 2: Binary Drivetrain Classifier

Predict whether a vehicle is a front-wheel drive (drv == "f") based on its city mileage (cty) and engine displacement (displ). Write R code to:

Create a binary indicator is_front <- if_else(drv == "f", 1, 0) inside the mpg table using mutate(). Save this as mpg_logit.
Fit a logistic regression model predicting is_front based on cty and displ using glm(..., family = "binomial").
Print the summary of the model.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

# Prepare binary column
mpg_logit <- mpg |>
  mutate(is_front = if_else(drv == "f", 1, 0))

# Fit model
classifier <- glm(is_front ~ cty + displ, family = "binomial", data = mpg_logit)

# Inspect model summary
summary(classifier)