Aesthetic Mappings & Geoms
Why Learn Data Visualization in Exploratory Data Analysis?
Before running statistical tests or building machine learning models, you must look at your data.
Imagine you are analyzing a dataset of used cars. You want to understand if larger engines (larger displacement) always result in poorer fuel efficiency (lower highway miles per gallon). If you look at a spreadsheet of 10,000 rows, your brain cannot identify the trend.
If you construct a scatter plot placing engine size on the x-axis and miles per gallon on the y-axis, the relationship becomes instantly clear: as engine size increases, fuel efficiency decreases.
In R, data visualization is built on the Grammar of Graphics, implemented in the ggplot2 package (part of the tidyverse). In this lesson, we will explore how to map data columns to visual attributes and create core geometric charts.
1. The Grammar of Graphics
The Grammar of Graphics is a systematic way of building charts layer-by-layer. A ggplot2 chart consists of:
- Data: The data frame/tibble containing the variables.
- Aesthetic Mapping (
aes): Connecting columns to visual variables (x position, y position, color, shape, size). - Geometric Layers (
geom_*): The actual marks drawn on the screen (points, lines, bars, boxes). - Statistical Transformations (
stat_*): Calculations performed on the data before plotting (e.g. counting rows for bars, calculating quartiles for boxplots). - Coordinate Systems: How points are arranged spatially (e.g. Cartesian, Polar).
- Facets: Breaking a single plot into multiple side-by-side subplots.
- Themes: Text, labels, background colors, and style properties.
2. Creating Your First ggplot: Scatter Plots
To create a plot, call ggplot(data, mapping = aes(x, y)) to initialize the canvas, then add a geometric layer using the + operator.
In R-EDA visualization, we use the + operator to add layers to a ggplot object, not the pipe operator |>.
library(tidyverse)
# Initialize canvas with mpg dataset (bundled with ggplot2)
# and add a scatter plot layer of engine size (displ) vs highway mileage (hwy)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
Alternative Syntax Options
R developers often write the data first or pipe it directly into ggplot():
# Positional arguments: data is first, aes mapping is second
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
# Piping data frame into ggplot
mpg |>
ggplot(aes(x = displ, y = hwy)) +
geom_point()
3. Aesthetic Mappings: Color, Shape, and Size
Aesthetic mappings are placed inside aes(). They instruct R to vary the color, shape, or size of points based on values in a specific column.
# Color points based on the car class (categorical variable)
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# Vary size of points based on the number of cylinders (cyl)
ggplot(mpg, aes(x = displ, y = hwy, color = class, size = cyl)) +
geom_point()
The Categorical Factor Level Rule
By default, R treats numbers (like cyl which has values 4, 6, 8) as continuous and displays them on a gradient scale. To treat a numerical column as distinct categorical classes, convert it to a factor using factor():
# Treat cylinders (cyl) as discrete categories rather than a numeric scale
ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
geom_point()
Global Aesthetics vs. Constant Overrides
If you want to apply a style to all points (e.g., making all dots green and larger), pass these style parameters outside of the aes() parentheses directly inside the geom function:
# Correct: Constant overrides belong outside aes()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color = "darkgreen", size = 3, alpha = 0.5)
# Incorrect: Do not place constants inside aes()
# ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) + geom_point()
# (This creates a dummy category named "blue" rather than coloring points blue!)
4. Jittering: Handling Overplotting
When plotting datasets with integer values or overlapping coordinates (e.g., engine sizes rounded to decimals), points often stack directly on top of each other. This is called overplotting.
To reveal overlapping data points, add random noise using position = "jitter" or the geom_jitter() geom:
# Jittered scatter plot
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "jitter", alpha = 0.6)
5. Other Key Geoms: Lines, Boxplots, and Bar Charts
Depending on the types of columns (categorical vs. quantitative), you need different geometries:
Line Charts (geom_line)
Ideal for trends over time.
# Line plot of average highway mileage across engine displacement values
# (Sorting or grouping over time or continuous values)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_line()
Box Plots (geom_boxplot)
Perfect for comparing the distribution of a continuous variable across different categories:
# Highway mileage distribution by car class
ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot()
Bar Charts (geom_bar)
By default, geom_bar counts the frequency of categorical items (its default statistical transformation is stat_count):
# Counts of cars in each class category
ggplot(mpg, aes(x = class)) +
geom_bar(fill = "steelblue")
6. On-the-fly Statistical Summaries
ggplot2 allows you to overlay computed statistics directly onto the chart.
Trend Lines: geom_smooth()
To draw a fitted trend line (regression line or smoothed local regression line):
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "jitter", alpha = 0.3) +
geom_smooth() # Adds a smooth trend line with confidence intervals
Custom Summaries: stat_summary()
Instead of using boxplots, you can display custom math summaries (like plotting a single large red point representing the mean value of each category):
ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot() +
# Add mean points computed dynamically on the fly
stat_summary(fun = mean, size = 3, color = "red", geom = "point")
Hands-on Exercises
Exercise 1: Fuel Economy Exploration
Use the mpg dataset. Write R code to:
- Initialize a ggplot using the
mpgtable. - Create a scatter plot of city mileage (
ctyon the x-axis) vs. highway mileage (hwyon the y-axis). - Map the points' color to the car's drive train type (
drvcolumn, representing front-wheel, rear-wheel, or 4WD). - Increase the size of all points to
3and set transparency (alpha) to0.7to prevent overplotting.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
ggplot(mpg, aes(x = cty, y = hwy, color = drv)) +
geom_point(size = 3, alpha = 0.7)
Exercise 2: Comparing Distributions
Compare highway mileage (hwy) across different car drive trains (drv) using boxplots.
Write R code to:
- Create a boxplot chart showing
drvon the x-axis andhwyon the y-axis. - Fill the boxes with colors matching the drive train category (
fill = drv). - Add a dynamic statistical summary layer overlaying a red point for the median value of each drive train. (Hint: Use
stat_summary(fun = median, color = "red", size = 2, geom = "point")).
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
ggplot(mpg, aes(x = drv, y = hwy, fill = drv)) +
geom_boxplot() +
stat_summary(fun = median, color = "red", size = 2, geom = "point")