Visualizations

'A picture is worth a thousand words' is not just a hearsay idiom; it is a fact. Nothing speaks louder and clearer than pictures when it comes to data analytics. A complex relationship between multiple variants can be conveyed with just a single plot. A correlation graph conveys the meaning or essence of a relationship more effectively than a lengthy description does. In this lesson, you will explore how to create meaningful visualizations using the Matplotlib and Seaborn libraries.

Simple Visualizations With Matplotlib

Although we tend to use Seaborn for sophisticated visualizations, you can use Matplotlib for simple diagrams or when you need more flexibility with your diagrams. Matplotlib is a low-level visualization library for Python that provides more fine-grained access to functions to help customize your diagrams.

Here are a few examples of simple plots using Matplotlib:

Line charts

A basic line plot just needs the 'x' and/or 'y' numerical values for plotting. If only 'y' is given, the index values of 'y' are plotted on the 'x'-axis. Here is an example where we explicitly define both to show a simple mathematical relationship:

import matplotlib.pyplot as plt
import numpy as np

# Data
x = np.arange(1, 10)
y = x * 2

plt.figure(figsize=(10, 6))
plt.plot(x, y, marker='o', color='#3498db', linewidth=2, linestyle='-')
plt.title('Linear Growth: y = 2x')
plt.xlabel('Input Value (x)')
plt.ylabel('Doubled Value (2x)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In the code above, we first import the `pyplot` module. We set a custom color and marker to make the data points distinct. The title and axis labels clearly explain the relationship shown.

To change the line style further, check the examples in the 'Styling in Matplotlib' section below.

Histogram

Suppose you want to see the frequency distribution of the continuous values of age in the sample data of Titanic travelers. You could plot a histogram and add a text annotation to highlight the most significant age group:

import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
titanic = sns.load_dataset('titanic')

plt.figure(figsize=(10, 6))
# Capture return values to customize if needed
n, bins, patches = plt.hist(titanic.age.dropna(), bins=20, color='#2ecc71', edgecolor='white', alpha=0.8)

plt.title('Age Distribution of Travelers')
plt.xlabel("Age")
plt.ylabel("Number of Travelers")

# Storytelling: Highlight the peak
plt.text(20, 45, 'Peak: Young Adults\n(20-30 years)', 
         fontsize=12, bbox=dict(facecolor='white', alpha=0.9, edgecolor='gray', boxstyle='round,pad=0.5'))
plt.show()

Output: histogram

Note that we dropped all the NaN values of age by invoking the dropna() function on the age column. By adding a text box, we immediately guide the viewer to the most important part of the distribution. The bin size is automatically calculated by the library; however, we can specify the bin size if required. Refer to the official documentation for details: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html

Bar chart

Suppose you want to see the total number of people who embarked from different towns. You can use different colors for each bar and label the maximum value directly on the chart:

# Explicitly calculating counts
count_by_embark = titanic.groupby('embark_town').size()

# Distinct colors
colors = ['#34495e', '#e74c3c', '#3498db']

plt.figure(figsize=(10, 6))
bars = plt.bar(count_by_embark.index, count_by_embark.values, color=colors)

plt.title('Passenger Count by Embarkation Town')
plt.ylabel("Total Travelers")
plt.xlabel("Embarkation Town")

# Annotate the max
max_val = count_by_embark.max()
max_town = count_by_embark.idxmax()

# Find the bar corresponding to max_town
for rect, label in zip(bars, count_by_embark.index):
    if label == max_town:
            height = rect.get_height()
            plt.text(rect.get_x() + rect.get_width()/2.0, height + 5, f"{int(height)}", ha='center', va='bottom', fontweight='bold', fontsize=12)

plt.show()

Output: embarked towns

This visual quickly shows that Southampton was by far the busiest embarkation point. The code for the bar chart is very similar to the histogram. Note that first, you use the groupby function to help group the distinct values of embark_town, and then use the index values for the 'x'-axis and the grouped values for the 'y'-axis. Reference the documentation for details: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html

Instead of using groupby, you can also use the value_counts() function on the embark_town column. Here is an alternative way of getting the same diagram:

plt.bar(titanic['embark_town'].dropna().unique(), titanic['embark_town'].value_counts())
plt.title('Embarked Towns')
plt.ylabel("total of travelers")
plt.show()

However, note that if you do not drop the na values for the 'x'-axis, it gives you an error as it tries to convert NaN to a string type and fails.

Scatter plot

Scatter plots are excellent for identifying outliers. For instance, you can plot the fare paid by passengers older than 50 and use an annotation to point out the highest spenders:

subset = titanic[titanic.age > 50]

plt.figure(figsize=(10, 6))
plt.scatter(subset.age, subset.fare, alpha=0.7, color='#9b59b6', edgecolors='white', s=80)
plt.title('Fare vs Age (Passengers > 50)')
plt.ylabel("Fare ($)")
plt.xlabel("Age")

# Highlight wealthy outlier
wealthy = subset[subset.fare > 200]
if not wealthy.empty:
    hf_pt = wealthy.iloc[0]
    plt.annotate('Wealthy Passengers', xy=(hf_pt.age, hf_pt.fare), xytext=(65, 400),
                    arrowprops=dict(facecolor='#34495e', shrink=0.05),
                    fontsize=12, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="#34495e", alpha=0.9))

plt.show()

Output: scatter plot

Subplots in Matplotlib

Plots in Matplotlib are drawn within a 'Figure' object. To create multiple subplots (a grid of charts), we first get a handle on this Figure object using plt.figure().

Here is an example showing how to define a 2x2 layout:

fig = plt.figure(figsize=(10, 8))

for i in range(1, 5):
    # (rows, cols, index)
    ax = fig.add_subplot(2, 2, i)
    ax.text(0.5, 0.5, f'Subplot {i}', ha='center', va='center', fontsize=18, color='gray')
    ax.set_xticks([])
    ax.set_yticks([])

plt.suptitle("Matplotlib Subplots Layout", fontsize=20)
plt.show()

Output: subplots

The arguments 2, 2, 1 for the first subplot tell the figure to create a layout of 2 rows and 2 columns and then assign the first position to this subplot.

Now, let's fill these with actual data to tell a complete story about the Titanic dataset:

fig = plt.figure(figsize=(12, 10))

# Subplot 1: Histogram of Age
ax1 = fig.add_subplot(2, 2, 1)
sns.histplot(titanic['age'].dropna(), bins=15, ax=ax1, color='skyblue', kde=True)
ax1.set_title("Age Distribution")

# Subplot 2: Bar Chart of Embarkation
ax2 = fig.add_subplot(2, 2, 2)
counts = titanic['embark_town'].value_counts()
sns.barplot(x=counts.index, y=counts.values, ax=ax2, palette="pastel")
ax2.set_title("Embarkation Points")

# Subplot 3: Box Plot of Age
ax3 = fig.add_subplot(2, 2, 3)
sns.boxplot(x=titanic['age'].dropna(), ax=ax3, color='lightgreen')
ax3.set_title("Age Box Plot")

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

Output: subplotsgraphs

Using tight_layout() ensures that titles and labels don't overlap. This grid gives a comprehensive overview of the passengers' demographics in a single glance.

Styling in Matplotlib

You can specify the color, markers, and line style for your subplots by passing the respective attributes. Here are a few examples:

Given x and y values, to get a dashed green line plot, you could use one of the methods below:

subplot.plot(x, y, 'g--')

subplot.plot(x, y, linestyle='--', color='g')

Here is another example to change the marker to 'x' and the line color to blue with a regular line:

import pandas as pd
import numpy as np
plt.plot(np.random.randn(30).cumsum(), 'bx-')
plt.show()

The plot above can also be drawn by setting the individual attributes for marker, linestyle, color, etc., as shown below:

plt.plot(pd.np.random.randn(30).cumsum(), color='b', linestyle='solid', marker='x')
plt.show()

The randn(30) function returns 30 sample random numbers from a standard normal distribution, which contains both positive and negative floating-point numbers. The cumsum() function derives the cumulative sum of these random numbers. You can also drop cumsum() and just plot the random numbers, in which case the y-axis will have the random value and the x-axis will have the index position of the number.

To apply grid lines for both the 'x' and 'y' axes, you can use the function:

plt.grid()

To change the font of the labels, you can set the fontsize keyword argument when invoking the xlabel() and ylabel() functions.

Simple Visualizations With Pandas

Pandas DataFrame and Series objects come with a built-in .plot() method that makes creating charts very convenient, using Matplotlib under the hood.

Using value_counts() on a Series object

You can directly apply the value_counts() function to a Series containing categorical data to get an aggregation of individual categories. Then, simply call .plot():

# 'rot=0' keeps the x-axis labels horizontal
titanic['embark_town'].value_counts().plot(
    kind='bar', 
    color='#9b59b6', 
    title='Frequency Distribution of Embark Town',
    rot=0
)
plt.ylabel('Frequency')
plt.show()

Using groupby result

You can also plot the result of a groupby operation. For example, let's see the survival rate across different age groups (created via binning):

# Create meaningful age groups
survived_by_age_group = titanic.groupby(
   pd.cut(titanic['age'], bins=[0, 18, 35, 60, 80], labels=['Child', 'Young Adult', 'Adult', 'Senior'])
)['survived'].mean()

# Plot
ax = survived_by_age_group.plot(kind='bar', color='#34495e', rot=0, width=0.7)
plt.title('Survival Rate by Age Group')
plt.ylabel('Survival Rate')
plt.ylim(0, 1) # Set y-axis limit to 100%
plt.grid(axis='y', linestyle='--', alpha=0.5)

# Add value labels
for p in ax.patches:
   ax.annotate(f"{p.get_height():.2f}", (p.get_x() + p.get_width() / 2., p.get_height()),
               ha='center', va='bottom', xytext=(0, 5), textcoords='offset points', fontweight='bold')

plt.show()

Output: survival chart

This clearly shows that children had a higher survival probability than seniors. Note that we can pass Matplotlib styling arguments like color and rot directly to the Pandas function.

Bar chart with Pareto diagram

A Pareto chart combines bars for frequency and a line for cumulative percentage, helping you identify the "vital few" categories:

s1 = titanic.embark_town.value_counts(normalize = True) * 100
fig, ax = plt.subplots(figsize=(10, 6))

# Bar plot for relative frequency
s1.plot(kind='bar', ax=ax, color='#f1c40f', edgecolor='white', alpha=0.8, rot=0)
ax.set_ylabel('Absolute Frequency (%)')
ax.grid(False)

# Line plot for cumulative percentage
s2 = s1.cumsum()
ax2 = ax.twinx()
ax2.plot(s2.index, s2.values, color='#c0392b', marker='D', linewidth=2, markersize=8)
ax2.set_ylabel('Cumulative Relative Frequency (%)')
ax2.set_ylim(0, 115)
ax2.grid(False)

# Annotate values on the line
for index, value in enumerate(s2):
    ax2.text(index, value + 3, f"{value:.0f}%", color='#c0392b', ha='center', fontweight='bold',
             bbox=dict(facecolor='white', edgecolor='none', alpha=0.7))

plt.title('Pareto Analysis of Embarkation', pad=20)
plt.show()

In the example above, you see that we have created a dual-axis chart. The cumsum() function computes the cumulative sum of the frequency values, which is the hallmark of a Pareto chart.

Note: A Pareto chart (named after Vilfredo Pareto) highlights the 80-20 rule, which asserts that 80% of outcomes often result from 20% of the causes. In our dataset, we can quickly see that Southampton accounts for over 70% of the travelers.

Stacked bar

To show the total number of survivors vs. non-survivors across each of the embark towns, you can do a groupby on both the embark_town and survived columns.

# Unstack creates a DataFrame where 'survived' values become columns
unstacked = titanic.groupby(['embark_town', 'survived']).size().unstack()
unstacked.columns = ['Died', 'Survived']

# Plot
unstacked.plot(kind='bar', stacked=True, color=['#e74c3c', '#2ecc71'], rot=0, edgecolor='white')
plt.title('Survival Count by Embarkation Town')
plt.ylabel('Count')
plt.show()

To show the same graph horizontally, you would change the kind to barh. This is often better for long labels.

unstacked.plot(kind='barh', stacked=True, color=['#e74c3c', '#2ecc71'])
plt.title('Survival Count by Embarkation Town (Horizontal)')
plt.xlabel('Count')
plt.show()

Apply plot function on a numerical column

You can directly apply the plot function to any of the numerical columns to get a distribution. Here is an example with pclass:

# Bins are set to center the bars on the integer values
titanic["pclass"].plot(
    kind='hist', 
    bins=[0.5, 1.5, 2.5, 3.5], 
    rwidth=0.8, 
    color='#8e44ad', 
    edgecolor='white'
)
plt.title('Distribution of Passenger Class')
plt.xlabel('Class')
plt.xticks([1, 2, 3])
plt.show()

Output: survival chart

Question: Is this truly a numerical value that should be shown as a histogram? Not really! Even though the column value is a number, 'class' is a categorical value (Ordinal), and as such, a bar chart might be a better representation. However, with custom bins, we can make a histogram work.

Another numerical example that is indeed a candidate for a 'histogram' is age:

titanic["age"].plot(kind='hist', bins=20, color='#1abc9c', edgecolor='white')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.show()

Output: survival chart

Note that Pandas automatically handles the NaN values by excluding them, unlike standard Matplotlib which would require explicit dropping.

Missing values are dropped, left out, or filled depending on the plot type. The way Pandas handles missing values is as follows:

Plot Type	NaN Handling
Line	Leave gaps at NaNs
Line (stacked)	Fill 0’s
Bar	Fill 0’s
Scatter	Drop NaNs
Histogram	Drop NaNs (column-wise)
Box	Drop NaNs (column-wise)
Area	Fill 0’s
KDE	Drop NaNs (column-wise)
Hexbin	Drop NaNs
Pie	Fill 0’s

If any of these defaults are not what you want, or if you want to be explicit about how missing values are handled, consider using fillna() or dropna() before plotting.

Official reference

* https://pandas.pydata.org/pandas-docs/stable/visualization.html

It is always good practice to have a title, x-axis, and y-axis labels in your chart. You can set all these and also change many other visual aids. Please refer to the full documentation on plots: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

Apply box plot function on a numerical column

The box plot (also known as a box and whisker diagram) is a standardized way of displaying the distribution of data. It shows the median, quartiles, and outliers.

titanic.boxplot(
    column='age', 
    vert=False, 
    grid=False,
    patch_artist=True,
    boxprops=dict(facecolor='#f39c12', color='#d35400'),
    medianprops=dict(color='black')
)
plt.title('Age Spread')
plt.yticks([]) # Hide y-axis as we are plotting a single variable
plt.show()

Output: survival chart

Apply pie chart function on categorical values

If you have 2 to 4 unique categories, you can also create a pie chart. Let's visualize the survival proportion:

separate = [0, 0.05]
colors = ['#ff9999','#66b3ff']

titanic['alive'].value_counts().plot.pie(
    explode=separate, 
    colors=colors,
    autopct=lambda p: '{:.1f}%'.format(p),
    shadow=True, 
    startangle=90,
    textprops={'fontsize': 14},
    ylabel='',
    figsize=(6, 6)
)
plt.title('Survival Proportion', fontsize=16)
plt.show()

Output: survival chart

Note that the autopct setting enables you to display the percent value using a Python string formatter applied through a lambda function. We also added a shadow for a 3D effect.

To move the label coordinate

Sometimes the wedge labels overlap the main label. In that case, you can move the label. Here is an example:

pie.yaxis.set_label_coords(-0.25, 0.5)

The label coordinates take an x and y offset from the default axes coordinates (0,0), which is the bottom left. (1, 1) is the top right.

** Making a Donut Chart **

You can add a white circle of the required dimension in the middle of the pie to make it a donut chart, which is often considered more modern and readable:

import matplotlib.pyplot as plt

separate = [0, 0.05]
colors = ['#ff9999','#66b3ff']

plt.figure(figsize=(7, 7))

ax = titanic['alive'].value_counts().plot.pie(
            explode=separate,
            colors=colors,
            autopct=lambda p: '{:.1f}%'.format(p),
            pctdistance=0.80,
            startangle=90,
            textprops={'fontsize': 14, 'weight': 'bold'},
            labels=None, # Hide labels to use legend
            ylabel=''
)

# Create the "donut hole"
centre_circle = plt.Circle((0, 0), 0.60, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
  
plt.title('Alive vs Dead (Donut)', fontsize=18)
plt.legend(labels=titanic['alive'].value_counts().index, loc="center", fontsize=12)
plt.show()

Note that we move the pctdistance to push the percentages towards the edge of the wedges and add a legend in the center for a clean look.

Variations You can also use kind='pie' or kind='bar' as an attribute for the plot() function to get the same effect as a pie and bar chart.

Labels and label orientations

You can change the x-ticks, y-ticks, rotate the ticks, etc., using the subplot object. When you use Pandas to create diagrams, the subplot object is returned by default. You can save this returned object to then tweak the labels.

Change the xtick label and rotate them

# Sort index to ensure order is 1, 2, 3
ax = titanic['pclass'].value_counts().sort_index().plot(
    kind='bar', color='#2980b9', rot=0
)

# Customize tick labels
ax.set_xticklabels(['1st Class', '2nd Class', '3rd Class'], rotation=30, ha='right')
plt.title('Passenger Class Count')
plt.ylabel('Count')
plt.show()

In this example, we rotated the labels by 30 degrees and aligned them to the right for better readability.

Change the xticks

ax = titanic['age'].plot(
    kind='hist', bins=20, color='#27ae60', edgecolor='white'
)
# Manually set the x-ticks
ax.set_xticks([0, 20, 40, 60, 80])
plt.title('Age Distribution (Custom X-Ticks)')
plt.show()

Change the bin size and the xticks

ax = titanic['age'].plot(
    kind='hist', bins=5, color='#c0392b', edgecolor='white'
)
ax.set_xticks([0, 20, 40, 60, 80])
plt.title('Age Distribution (5 Bins)')
plt.show()

Change the x-label and y-label

While you can set the 'title', 'x-label', and 'y-label' directly from the Pandas API, you can also set them directly on the subplots themselves. Here is an example:

ax = titanic['age'].plot(kind='hist')
ax.set(xlabel='Age', ylabel='Count', title='Age Distribution')
plt.show()

Note the two ways of setting the x and y labels: you can set them using the set_ylabel() function, or you can just use the set() function by setting the keyword arguments.

References:

Here is an article on a few more tips and tricks on Matplotlib: https://towardsdatascience.com/all-your-matplotlib-questions-answered-420dd95cb4ff