EDA Case Study: Variation & Covariation
What is Exploratory Data Analysis (EDA)?
EDA is not a formal set of rules; it is a creative mindset. Your goal is to understand the structure of your data, find anomalies, and formulate questions.
To explore data systematically, we investigate two primary behaviors:
- Variation: How a single variable behaves within itself (its range, distribution, and patterns).
- Covariation: How different variables behave in relation to one another (does one change when the other changes?).
Let's explore techniques to analyze variation and covariation using visualization and data transformation together.
1. Exploring Variation
Variation is the tendency of a variable's values to change from measurement to measurement.
Continuous Variables: Histograms & Densities
To see the distribution of continuous variables, use a Histogram (geom_histogram) or a Density Plot (geom_density):
library(tidyverse)
# Histogram of engine size (displ) with custom binwidth
ggplot(mpg, aes(x = displ)) +
geom_histogram(binwidth = 0.5, fill = "steelblue", color = "white")
Always try different binwidth settings. A binwidth that is too wide hides detailed trends, while a binwidth that is too narrow creates visual noise.
Categorical Variables: Bar Charts
To see the distribution of categorical variables, use a Bar Chart (geom_bar):
# Count of vehicles in each class
ggplot(mpg, aes(x = class)) +
geom_bar(fill = "darkgreen")
Finding Anomalies (Outliers)
Anomalies are data points that lie far outside the main cluster of values. We can find them visually or filter them with logical thresholds:
# Filter for extreme outliers (mileage greater than 40 MPG)
mpg |>
filter(hwy > 40) |>
select(manufacturer, model, hwy)
2. Exploring Covariation
Covariation describes the association between two or more variables. We look at different variable combinations:
A. Categorical vs. Continuous Variables
Compare distributions of a continuous variable across categorical groups using Boxplots (geom_boxplot) or Violin Plots (geom_violin):
# Comparing highway mileage distribution across drive trains (drv)
ggplot(mpg, aes(x = drv, y = hwy, fill = drv)) +
geom_violin(alpha = 0.6)
B. Categorical vs. Categorical Variables
To see how observations are distributed across two categorical columns, use geom_count() or calculate a grouped frequency and plot it as a Heatmap (geom_tile()):
# Method 1: geom_count (size of point represents frequency)
ggplot(mpg, aes(x = drv, y = class)) +
geom_count()
# Method 2: Grouped summary with geom_tile (Heatmap)
mpg |>
count(drv, class) |>
ggplot(aes(x = drv, y = class, fill = n)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "darkblue")
C. Continuous vs. Continuous Variables
Use a Scatter Plot (geom_point) to inspect correlations, and add a straight regression line using geom_smooth(method = "lm") to see the direction of association:
# City mileage vs. Highway mileage with a linear trend line
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point(position = "jitter", alpha = 0.5) +
geom_smooth(method = "lm", color = "red") # Adds a straight linear trend line
Hands-on Exercises
Exercise 1: Fuel Volatility Check
Investigate the distribution of city mileage (cty).
Write R code to:
- Create a histogram of
ctyusinggeom_histogram(). - Set the
binwidthto2. - Set the bar fill color to
"purple"and borders to"white". - Apply
theme_minimal().
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
ggplot(mpg, aes(x = cty)) +
geom_histogram(binwidth = 2, fill = "purple", color = "white") +
theme_minimal()
Exercise 2: Heatmap of Cylinders vs. Drive Trains
Determine how cylinder counts (cyl) are distributed across drive train types (drv).
Write R code to:
- Group and count observations by
drvandcyl. (Hint: Usecount(drv, cyl)). - Pipe the result into
ggplot(). - Create a heatmap using
geom_tile(), mappingdrvto the x-axis,factor(cyl)to the y-axis, and the countnto the fill color. - Scale the fill color using
scale_fill_gradient(low = "yellow", high = "red").
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
mpg |>
count(drv, cyl) |>
ggplot(aes(x = drv, y = factor(cyl), fill = n)) +
geom_tile() +
scale_fill_gradient(low = "yellow", high = "red") +
labs(title = "Car Count by Drivetrain and Cylinders", y = "Cylinder Count")