Hypothesis Testing & Statistical Tests

Introduction to Hypothesis Testing

Exploratory Data Analysis (EDA) allows you to discover patterns, but Statistical Testing allows you to prove whether those patterns are statistically significant or merely random noise.

In statistical testing, we formulate two statements:

Null Hypothesis ( $H_0$ ): A statement that there is no effect or no difference in the population (e.g. "Diamond cut has no effect on price").
Alternative Hypothesis ( $H_1$ ): A statement that there is an effect or difference (e.g. "Diamond cut significantly affects price").

The p-value

We execute a statistical test to calculate a p-value (probability value).

Definition: The probability of observing our sample data (or something more extreme) assuming the Null Hypothesis ( $H_0$ ) is true.
Interpretation: A small p-value (typically $< 0.05$ ) provides strong evidence against the null hypothesis, allowing us to reject $H_0$ in favor of the alternative hypothesis.

1. Comparing Two Means: The T-test (t.test())

Use a two-sample t-test to compare the means of a continuous variable between two distinct groups.

Example: Prices of Ideal vs. Premium Diamonds

Let's test if there is a significant difference in average price between "Ideal" and "Premium" cut diamonds in the diamonds dataset.

library(tidyverse)

# Filter for the two groups
ideal_prices <- diamonds |> filter(cut == "Ideal") |> pull(price)
premium_prices <- diamonds |> filter(cut == "Premium") |> pull(price)

# Perform a two-sample t-test (two-sided, unequal variances)
t_result <- t.test(ideal_prices, premium_prices)
print(t_result)

The output displays the t-statistic, degrees of freedom, p-value, 95% confidence interval, and sample means. A p-value less than $0.05$ indicates a statistically significant difference in mean prices.

2. Comparing Three or More Means: ANOVA (aov())

When you want to compare means of a continuous variable across three or more groups (e.g., comparing diamond prices across all 5 cuts: Fair, Good, Very Good, Premium, and Ideal), a t-test is insufficient. Running multiple pairwise t-tests increases the chance of a false positive. Instead, use ANOVA (Analysis of Variance).

In R, fit an ANOVA model using aov() and inspect it with summary():

library(tidyverse)

# Fit ANOVA model: numeric_var ~ categorical_group_var
anova_model <- aov(price ~ cut, data = diamonds)

# Inspect the ANOVA table
summary(anova_model)

Reading the ANOVA Table:

F value: The ratio of variance between groups to variance within groups. A high F-value suggests the group means differ significantly.
Pr(>F): The p-value. If this is $< 0.05$ , at least one group mean is significantly different from the others.

3. Relationships Between Categorical Variables: Chi-squared Test (chisq.test())

The Chi-squared ( $x^2$ ) test of independence is used to determine if there is a significant association between two categorical variables.

It works by comparing:

Observed Frequencies: The actual counts in your dataset cross-tabulated.
Expected Frequencies: The counts we would expect to observe if the two variables were completely independent.

How Expected Frequencies are Calculated

Assuming independence, the joint probability of two events is the product of their individual probabilities: $\text{Expected Count} = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}$

Example in R:

Let's build a contingency table of diamond cut vs. color and run a Chi-squared test:

library(tidyverse)

# 1. Create a contingency table of observed counts
contingency_table <- table(diamonds$cut, diamonds$color)
print(contingency_table)

# 2. Perform the Chi-squared test
chi_result <- chisq.test(contingency_table)
print(chi_result)

# 3. View expected counts generated by the test
print(chi_result$expected)

4. Testing Linear Relationships: Correlation Test (cor.test())

To evaluate the strength and direction of a linear relationship between two continuous variables, use a correlation test. It calculates Pearson's correlation coefficient ( $r$ ) and tests the null hypothesis that $r = 0$ (no correlation).

library(tidyverse)

# Test correlation between diamond carat and price
cor_result <- cor.test(diamonds$carat, diamonds$price)
print(cor_result)

Correlation ( $r$ ): Varies from $-1$ (perfect negative correlation) to $+1$ (perfect positive correlation).
p-value: If $< 0.05$ , the correlation is statistically significant.

Summary of Test Selection

Variables Type	Goal	R Function
1 Continuous vs. 2 Groups	Compare Means	`t.test(group1, group2)`
1 Continuous vs. 3+ Groups	Compare Means	`aov(numeric ~ category)`
2 Categorical Variables	Test Association	`chisq.test(table(var1, var2))`
2 Continuous Variables	Test Linear Relation	`cor.test(var1, var2)`

Hands-on Exercises

Exercise 1: T-test on Vehicle Drivetrain

Determine if there is a statistically significant difference in city fuel efficiency (cty) between front-wheel drive ("f") and rear-wheel drive ("r") vehicles in the mpg dataset. Write R code to:

Filter the mpg dataset to extract cty for front-wheel drive (drv == "f").
Filter the mpg dataset to extract cty for rear-wheel drive (drv == "r").
Perform a two-sample t.test() and print the results.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

# Filter groups
f_drive <- mpg |> filter(drv == "f") |> pull(cty)
r_drive <- mpg |> filter(drv == "r") |> pull(cty)

# Perform T-test
drv_t_test <- t.test(f_drive, r_drive)
print(drv_t_test)

Exercise 2: Association of Vehicle Class and Drivetrain

Test if there is a significant association between the type of vehicle class (e.g. compact, SUV) and its drivetrain layout drv in the mpg dataset. Write R code to:

Create a contingency table of class vs. drv using table().
Run chisq.test() on the contingency table and print the result.
Check the p-value to determine if they are independent.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

# Create table
vehicle_table <- table(mpg$class, mpg$drv)
print(vehicle_table)

# Run Chi-squared test
vehicle_chi <- chisq.test(vehicle_table)
print(vehicle_chi)

Exercise 3: Engine Displacement vs. City Mileage

Test if there is a significant linear correlation between a vehicle's engine size (displ) and its city fuel efficiency (cty). Write R code to:

Perform a correlation test using cor.test() on mpg$displ and mpg$cty.
Print the results to examine the correlation coefficient ( $r$ ) and p-value.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

# Perform correlation test
engine_cor <- cor.test(mpg$displ, mpg$cty)
print(engine_cor)