Aesthetic Mappings & Geoms

Why Learn Data Visualization in Exploratory Data Analysis?

Before running statistical tests or building machine learning models, you must look at your data.

Imagine you are analyzing a dataset of used cars. You want to understand if larger engines (larger displacement) always result in poorer fuel efficiency (lower highway miles per gallon). If you look at a spreadsheet of 10,000 rows, your brain cannot identify the trend.

If you construct a scatter plot placing engine size on the x-axis and miles per gallon on the y-axis, the relationship becomes instantly clear: as engine size increases, fuel efficiency decreases.

In R, data visualization is built on the Grammar of Graphics, implemented in the ggplot2 package (part of the tidyverse). In this lesson, we will explore how to map data columns to visual attributes and create core geometric charts.

1. The Grammar of Graphics

The Grammar of Graphics is a systematic way of building charts layer-by-layer. A ggplot2 chart consists of:

Data: The data frame/tibble containing the variables.
Aesthetic Mapping (aes): Connecting columns to visual variables (x position, y position, color, shape, size).
Geometric Layers (geom_*): The actual marks drawn on the screen (points, lines, bars, boxes).
Statistical Transformations (stat_*): Calculations performed on the data before plotting (e.g. counting rows for bars, calculating quartiles for boxplots).
Coordinate Systems: How points are arranged spatially (e.g. Cartesian, Polar).
Facets: Breaking a single plot into multiple side-by-side subplots.
Themes: Text, labels, background colors, and style properties.

2. Creating Your First ggplot: Scatter Plots

To create a plot, call ggplot(data, mapping = aes(x, y)) to initialize the canvas, then add a geometric layer using the + operator.

Use the Plus `+` Operator

In R-EDA visualization, we use the + operator to add layers to a ggplot object, not the pipe operator |>.

library(tidyverse)

# Initialize canvas with mpg dataset (bundled with ggplot2)
# and add a scatter plot layer of engine size (displ) vs highway mileage (hwy)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

Alternative Syntax Options

R developers often write the data first or pipe it directly into ggplot():

# Positional arguments: data is first, aes mapping is second
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()

# Piping data frame into ggplot
mpg |> 
  ggplot(aes(x = displ, y = hwy)) + 
  geom_point()

3. Aesthetic Mappings: Color, Shape, and Size

Aesthetic mappings are placed inside aes(). They instruct R to vary the color, shape, or size of points based on values in a specific column.

# Color points based on the car class (categorical variable)
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

# Vary size of points based on the number of cylinders (cyl)
ggplot(mpg, aes(x = displ, y = hwy, color = class, size = cyl)) +
  geom_point()

The Categorical Factor Level Rule

By default, R treats numbers (like cyl which has values 4, 6, 8) as continuous and displays them on a gradient scale. To treat a numerical column as distinct categorical classes, convert it to a factor using factor():

# Treat cylinders (cyl) as discrete categories rather than a numeric scale
ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
  geom_point()

Global Aesthetics vs. Constant Overrides

If you want to apply a style to all points (e.g., making all dots green and larger), pass these style parameters outside of the aes() parentheses directly inside the geom function:

# Correct: Constant overrides belong outside aes()
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "darkgreen", size = 3, alpha = 0.5)

# Incorrect: Do not place constants inside aes()
# ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) + geom_point()
# (This creates a dummy category named "blue" rather than coloring points blue!)

4. Jittering: Handling Overplotting

When plotting datasets with integer values or overlapping coordinates (e.g., engine sizes rounded to decimals), points often stack directly on top of each other. This is called overplotting.

To reveal overlapping data points, add random noise using position = "jitter" or the geom_jitter() geom:

# Jittered scatter plot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(position = "jitter", alpha = 0.6)

5. Other Key Geoms: Lines, Boxplots, and Bar Charts

Depending on the types of columns (categorical vs. quantitative), you need different geometries:

Line Charts (geom_line)

Ideal for trends over time.

# Line plot of average highway mileage across engine displacement values
# (Sorting or grouping over time or continuous values)
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_line()

Box Plots (geom_boxplot)

Perfect for comparing the distribution of a continuous variable across different categories:

# Highway mileage distribution by car class
ggplot(mpg, aes(x = class, y = hwy)) +
  geom_boxplot()

Bar Charts (geom_bar)

By default, geom_bar counts the frequency of categorical items (its default statistical transformation is stat_count):

# Counts of cars in each class category
ggplot(mpg, aes(x = class)) +
  geom_bar(fill = "steelblue")

6. On-the-fly Statistical Summaries

ggplot2 allows you to overlay computed statistics directly onto the chart.

Trend Lines: geom_smooth()

To draw a fitted trend line (regression line or smoothed local regression line):

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(position = "jitter", alpha = 0.3) +
  geom_smooth() # Adds a smooth trend line with confidence intervals

Custom Summaries: stat_summary()

Instead of using boxplots, you can display custom math summaries (like plotting a single large red point representing the mean value of each category):

ggplot(mpg, aes(x = class, y = hwy)) +
  geom_boxplot() +
  # Add mean points computed dynamically on the fly
  stat_summary(fun = mean, size = 3, color = "red", geom = "point")

Hands-on Exercises

Exercise 1: Fuel Economy Exploration

Use the mpg dataset. Write R code to:

Initialize a ggplot using the mpg table.
Create a scatter plot of city mileage (cty on the x-axis) vs. highway mileage (hwy on the y-axis).
Map the points' color to the car's drive train type (drv column, representing front-wheel, rear-wheel, or 4WD).
Increase the size of all points to 3 and set transparency (alpha) to 0.7 to prevent overplotting.

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

ggplot(mpg, aes(x = cty, y = hwy, color = drv)) +
  geom_point(size = 3, alpha = 0.7)

Exercise 2: Comparing Distributions

Compare highway mileage (hwy) across different car drive trains (drv) using boxplots. Write R code to:

Create a boxplot chart showing drv on the x-axis and hwy on the y-axis.
Fill the boxes with colors matching the drive train category (fill = drv).
Add a dynamic statistical summary layer overlaying a red point for the median value of each drive train. (Hint: Use stat_summary(fun = median, color = "red", size = 2, geom = "point")).

# Write your code below and click Run Code

Click to view Answer

library(tidyverse)

ggplot(mpg, aes(x = drv, y = hwy, fill = drv)) +
  geom_boxplot() +
  stat_summary(fun = median, color = "red", size = 2, geom = "point")