Hypothesis Testing & Statistical Tests
Introduction to Hypothesis Testing
Exploratory Data Analysis (EDA) allows you to discover patterns, but Statistical Testing allows you to prove whether those patterns are statistically significant or merely random noise.
In statistical testing, we formulate two statements:
- Null Hypothesis (): A statement that there is no effect or no difference in the population (e.g. "Diamond cut has no effect on price").
- Alternative Hypothesis (): A statement that there is an effect or difference (e.g. "Diamond cut significantly affects price").
The p-value
We execute a statistical test to calculate a p-value (probability value).
- Definition: The probability of observing our sample data (or something more extreme) assuming the Null Hypothesis () is true.
- Interpretation: A small p-value (typically ) provides strong evidence against the null hypothesis, allowing us to reject in favor of the alternative hypothesis.
1. Comparing Two Means: The T-test (t.test())
Use a two-sample t-test to compare the means of a continuous variable between two distinct groups.
Example: Prices of Ideal vs. Premium Diamonds
Let's test if there is a significant difference in average price between "Ideal" and "Premium" cut diamonds in the diamonds dataset.
library(tidyverse)
# Filter for the two groups
ideal_prices <- diamonds |> filter(cut == "Ideal") |> pull(price)
premium_prices <- diamonds |> filter(cut == "Premium") |> pull(price)
# Perform a two-sample t-test (two-sided, unequal variances)
t_result <- t.test(ideal_prices, premium_prices)
print(t_result)
The output displays the t-statistic, degrees of freedom, p-value, 95% confidence interval, and sample means. A p-value less than indicates a statistically significant difference in mean prices.
2. Comparing Three or More Means: ANOVA (aov())
When you want to compare means of a continuous variable across three or more groups (e.g., comparing diamond prices across all 5 cuts: Fair, Good, Very Good, Premium, and Ideal), a t-test is insufficient. Running multiple pairwise t-tests increases the chance of a false positive. Instead, use ANOVA (Analysis of Variance).
In R, fit an ANOVA model using aov() and inspect it with summary():
library(tidyverse)
# Fit ANOVA model: numeric_var ~ categorical_group_var
anova_model <- aov(price ~ cut, data = diamonds)
# Inspect the ANOVA table
summary(anova_model)
Reading the ANOVA Table:
- F value: The ratio of variance between groups to variance within groups. A high F-value suggests the group means differ significantly.
- Pr(>F): The p-value. If this is , at least one group mean is significantly different from the others.
3. Relationships Between Categorical Variables: Chi-squared Test (chisq.test())
The Chi-squared () test of independence is used to determine if there is a significant association between two categorical variables.
It works by comparing:
- Observed Frequencies: The actual counts in your dataset cross-tabulated.
- Expected Frequencies: The counts we would expect to observe if the two variables were completely independent.
How Expected Frequencies are Calculated
Assuming independence, the joint probability of two events is the product of their individual probabilities:
Example in R:
Let's build a contingency table of diamond cut vs. color and run a Chi-squared test:
library(tidyverse)
# 1. Create a contingency table of observed counts
contingency_table <- table(diamonds$cut, diamonds$color)
print(contingency_table)
# 2. Perform the Chi-squared test
chi_result <- chisq.test(contingency_table)
print(chi_result)
# 3. View expected counts generated by the test
print(chi_result$expected)
4. Testing Linear Relationships: Correlation Test (cor.test())
To evaluate the strength and direction of a linear relationship between two continuous variables, use a correlation test. It calculates Pearson's correlation coefficient () and tests the null hypothesis that (no correlation).
library(tidyverse)
# Test correlation between diamond carat and price
cor_result <- cor.test(diamonds$carat, diamonds$price)
print(cor_result)
- Correlation (): Varies from (perfect negative correlation) to (perfect positive correlation).
- p-value: If , the correlation is statistically significant.
Summary of Test Selection
| Variables Type | Goal | R Function |
|---|---|---|
| 1 Continuous vs. 2 Groups | Compare Means | t.test(group1, group2) |
| 1 Continuous vs. 3+ Groups | Compare Means | aov(numeric ~ category) |
| 2 Categorical Variables | Test Association | chisq.test(table(var1, var2)) |
| 2 Continuous Variables | Test Linear Relation | cor.test(var1, var2) |
Hands-on Exercises
Exercise 1: T-test on Vehicle Drivetrain
Determine if there is a statistically significant difference in city fuel efficiency (cty) between front-wheel drive ("f") and rear-wheel drive ("r") vehicles in the mpg dataset.
Write R code to:
- Filter the
mpgdataset to extractctyfor front-wheel drive (drv == "f"). - Filter the
mpgdataset to extractctyfor rear-wheel drive (drv == "r"). - Perform a two-sample
t.test()and print the results.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
# Filter groups
f_drive <- mpg |> filter(drv == "f") |> pull(cty)
r_drive <- mpg |> filter(drv == "r") |> pull(cty)
# Perform T-test
drv_t_test <- t.test(f_drive, r_drive)
print(drv_t_test)
Exercise 2: Association of Vehicle Class and Drivetrain
Test if there is a significant association between the type of vehicle class (e.g. compact, SUV) and its drivetrain layout drv in the mpg dataset.
Write R code to:
- Create a contingency table of
classvs.drvusingtable(). - Run
chisq.test()on the contingency table and print the result. - Check the p-value to determine if they are independent.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
# Create table
vehicle_table <- table(mpg$class, mpg$drv)
print(vehicle_table)
# Run Chi-squared test
vehicle_chi <- chisq.test(vehicle_table)
print(vehicle_chi)
Exercise 3: Engine Displacement vs. City Mileage
Test if there is a significant linear correlation between a vehicle's engine size (displ) and its city fuel efficiency (cty).
Write R code to:
- Perform a correlation test using
cor.test()onmpg$displandmpg$cty. - Print the results to examine the correlation coefficient () and p-value.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
# Perform correlation test
engine_cor <- cor.test(mpg$displ, mpg$cty)
print(engine_cor)