EDA Case Study: Variation & Covariation

What is Exploratory Data Analysis (EDA)?

EDA is not a formal set of rules; it is a creative mindset. Your goal is to understand the structure of your data, find anomalies, and formulate questions.

To explore data systematically, we investigate two primary behaviors:

Variation: How a single variable behaves within itself (its range, distribution, and patterns).
Covariation: How different variables behave in relation to one another (does one change when the other changes?).

Let's explore techniques to analyze variation and covariation using visualization and data transformation together.

1. Exploring Variation

Variation is the tendency of a variable's values to change from measurement to measurement.

Continuous Variables: Histograms & Densities

To see the distribution of continuous variables, use a Histogram (geom_histogram) or a Density Plot (geom_density):

library(tidyverse)

# Histogram of engine size (displ) with custom binwidth
ggplot(mpg, aes(x = displ)) +
  geom_histogram(binwidth = 0.5, fill = "steelblue", color = "white")

Experiment with Binwidths

Always try different binwidth settings. A binwidth that is too wide hides detailed trends, while a binwidth that is too narrow creates visual noise.

Categorical Variables: Bar Charts

To see the distribution of categorical variables, use a Bar Chart (geom_bar):

# Count of vehicles in each class
ggplot(mpg, aes(x = class)) +
  geom_bar(fill = "darkgreen")

Finding Anomalies (Outliers)

Anomalies are data points that lie far outside the main cluster of values. We can find them visually or filter them with logical thresholds:

# Filter for extreme outliers (mileage greater than 40 MPG)
mpg |>
  filter(hwy > 40) |>
  select(manufacturer, model, hwy)

2. Exploring Covariation

Covariation describes the association between two or more variables. We look at different variable combinations:

A. Categorical vs. Continuous Variables

Compare distributions of a continuous variable across categorical groups using Boxplots (geom_boxplot) or Violin Plots (geom_violin):

# Comparing highway mileage distribution across drive trains (drv)
ggplot(mpg, aes(x = drv, y = hwy, fill = drv)) +
  geom_violin(alpha = 0.6)

B. Categorical vs. Categorical Variables

To see how observations are distributed across two categorical columns, use geom_count() or calculate a grouped frequency and plot it as a Heatmap (geom_tile()):

# Method 1: geom_count (size of point represents frequency)
ggplot(mpg, aes(x = drv, y = class)) +
  geom_count()

# Method 2: Grouped summary with geom_tile (Heatmap)
mpg |>
  count(drv, class) |>
  ggplot(aes(x = drv, y = class, fill = n)) +
  geom_tile() +
  scale_fill_gradient(low = "lightblue", high = "darkblue")

C. Continuous vs. Continuous Variables

Use a Scatter Plot (geom_point) to inspect correlations, and add a straight regression line using geom_smooth(method = "lm") to see the direction of association:

# City mileage vs. Highway mileage with a linear trend line
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(position = "jitter", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") # Adds a straight linear trend line

Hands-on Exercises

Exercise 1: Fuel Volatility Check

Investigate the distribution of city mileage (cty). Write R code to:

Create a histogram of cty using geom_histogram().
Set the binwidth to 2.
Set the bar fill color to "purple" and borders to "white".
Apply theme_minimal().

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

ggplot(mpg, aes(x = cty)) +
  geom_histogram(binwidth = 2, fill = "purple", color = "white") +
  theme_minimal()

Exercise 2: Heatmap of Cylinders vs. Drive Trains

Determine how cylinder counts (cyl) are distributed across drive train types (drv). Write R code to:

Group and count observations by drv and cyl. (Hint: Use count(drv, cyl)).
Pipe the result into ggplot().
Create a heatmap using geom_tile(), mapping drv to the x-axis, factor(cyl) to the y-axis, and the count n to the fill color.
Scale the fill color using scale_fill_gradient(low = "yellow", high = "red").

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

mpg |>
  count(drv, cyl) |>
  ggplot(aes(x = drv, y = factor(cyl), fill = n)) +
  geom_tile() +
  scale_fill_gradient(low = "yellow", high = "red") +
  labs(title = "Car Count by Drivetrain and Cylinders", y = "Cylinder Count")