Bad Visualization Examples
Why Learn Data Ethics and Visualization Pitfalls?
Data visualization is a double-edged sword. It has the power to simplify complex trends, but it also has the power to mislead, distort, and bias the reader's interpretation.
Sometimes, charts are designed poorly due to mechanical mistakes (like forgetting to treat category numbers as factors). Other times, charts are manipulated intentionally (like starting a bar chart's axis at 80% to exaggerate a minor 2% sales gain).
In data science, we must build charts that are both honest and readable. In this chapter, we will inspect six common visualization mistakes taught in the course and learn how to write correct ggplot2 code to fix them.
1. Truncated Y-Axis (Misleading Heights)
The Mistake
In a bar chart, the height of the bar represents the magnitude of the value. If you truncate the vertical axis (e.g. starting the y-axis at 80 instead of 0), you distort the proportion, making a tiny difference look massive.
library(tidyverse)
# Misleading: Exaggerates differences by starting y-axis at 20
# ggplot(mpg, aes(x = class, y = hwy)) +
# geom_bar(stat = "summary", fun = "mean") +
# coord_cartesian(ylim = c(20, 35)) # BAD! Exaggerates heights
The Rule
- Bar charts must always start at
0on the value axis. - Scatter plots or line charts can start at non-zero values if the goal is to examine fine variation, but bar charts are strict.
2. Pie Chart Abuse (Visual Overload)
The Mistake
Human eyes are poor at comparing angles and area sizes, especially when a pie chart is split into more than 3 or 4 slices. If you have 10 categories, a pie chart becomes a colorful, unreadable wheel.
# Bad: Pie chart showing many slices
# (Do not use coord_polar on stacked bars with high cardinality!)
The Fix
Replace pie charts with sorted horizontal bar charts. Bar charts line up categories along a single straight baseline, allowing the eye to immediately compare lengths:
# Good: Horizontal bar chart sorted by frequency
mpg |>
count(class) |>
ggplot(aes(x = reorder(class, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(x = "Vehicle Class", y = "Count")
3. Spaghetti Plots (Line Overlap)
The Mistake
Plotting 15 lines of time-series data on a single line chart creates a messy web. The reader cannot track individual trends.
# Bad: Too many lines smudged together
# ggplot(many_countries_data, aes(x = Year, y = GDP, color = Country)) + geom_line()
The Fix
- Faceting: Split the lines into a grid of small subplots using
facet_wrap(). - Highlighting: Draw the target line in a bold, vibrant color and draw all other background comparison lines in light grey (
color = "grey80").
# Good: Use faceting to split lines cleanly
# ggplot(many_countries_data, aes(x = Year, y = GDP)) +
# geom_line() +
# facet_wrap(~ Country)
4. Continuous Scales on Categorical Numbers
The Mistake
If you map a category represented by numbers (like cylinder count 4, 5, 6, 8) to color, R treats it as a continuous gradient. This implies that a cylinder count of 7 exists on the color spectrum, which is false.
# Bad: Continuous gradient legend for categorical numbers
ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +
geom_point(size = 3)
The Fix
Wrap the column in factor() to force discrete category legend groupings:
# Good: Distinct color levels for categories
ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
geom_point(size = 3) +
labs(color = "Cylinders")
5. Misleading Stacked Bar Baselines
The Mistake
Stacked bar charts stack segments on top of each other. Except for the very bottom segment (which rests on the zero line), the baseline for the middle and top segments shifts constantly. This makes it impossible to compare their heights visually.
# Bad: Hard to compare the middle 'class' segments across years
ggplot(mpg, aes(x = year, fill = class)) +
geom_bar(position = "stack")
The Fix
Use side-by-side bar charts by setting position = "dodge":
# Good: Elements stand next to each other on a shared baseline
ggplot(mpg, aes(x = factor(year), fill = class)) +
geom_bar(position = "dodge") +
labs(x = "Year")
6. Double Y-Axes (Implied Associations)
The Mistake
Creating a single chart with two separate vertical axes scales the data arbitrarily. You can manipulate the scaling factors to make two completely unrelated trends line up, implying a false causal relationship.
The Fix
Draw two separate subplots stacked vertically, sharing a common horizontal axis. This allows the reader to inspect correlation without misleading scaling overlaps.
Hands-on Exercises
Exercise 1: Fixing a Gradient Legend
Identify the mistake in the code below and write the corrected ggplot2 code.
# Messy: color gradient scale for cylinders
ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) + geom_point()
Fix it by treating the cylinder column as a discrete category.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
# Fix: Wrap cyl in factor()
ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) +
geom_point() +
labs(color = "Cylinder Count")
Exercise 2: Dodging Stacked Bars
A bar chart stacks vehicle classes.
ggplot(mpg, aes(x = factor(year), fill = drv)) + geom_bar(position = "stack")
Write R code to modify this chart, positioning the drive train bars side-by-side (position = "dodge") to allow direct visual comparison of their counts.
# Write your code below and click Run Code
Click to view Answer
library(tidyverse)
ggplot(mpg, aes(x = factor(year), fill = drv)) +
geom_bar(position = "dodge") +
labs(x = "Year", fill = "Drivetrain")