Titanic Workshop

Welcome to the Titanic Workshop! This hands-on session is designed to help you apply the concepts you've learned so far to a real-world dataset.

The Titanic dataset is one of the most famous datasets in data science. It contains information about the passengers on the ill-fated voyage, including their age, gender, passenger class, and whether they survived. In this workshop, you will use Pandas, Seaborn, and Matplotlib to explore this data and uncover interesting patterns through analysis and visualization.

Quick Recap: What is a module?

A module is a Python program that contains pre-written code you can use in your own projects. By importing modules like pandas and seaborn, you gain access to powerful tools for data manipulation and visualization without having to write everything from scratch.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
titanic_df = sns.load_dataset("titanic")
titanic_df.head(5)

In the code block above, we use two different modules: Pandas (renamed as pd) and Seaborn (renamed as sns). Both modules provide many convenient functions for data analysis.

We use the load_dataset function on sns to load the Titanic dataset. This returns a structure called a DataFrame, which we save in the variable titanic_df. A DataFrame is a table-like structure.

Exploratory Functions

One useful function is head(), which takes an input n and returns the top n records.

Can you print the first 10 records of this data?

The tail() function

Complementary to head() is the tail() function.

Can you apply the tail() function in a new code block and see what it returns?

Accessing Columns

To get the values for a specific column, enclose the column name in quotes inside square brackets:

titanic_df["age"]

Note: NaN stands for "Not a Number," meaning the value is either empty or invalid.

Can you get the values for the embark_town column or any other column?

Calculating the Mean

To find the mean age of all passengers, use the mean() function:

mean_age = titanic_df["age"].mean()
print(f"The mean age is: {mean_age:.2f}")

Other functions include median(), max(), and min().

Can you use these functions to find the oldest and youngest passengers on the ship?

The nlargest() function

To find the top n largest values, use nlargest():

titanic_df["age"].nlargest(4)

This returns the specified number of values from the age column, sorted from largest to smallest. The complementary function is nsmallest().

Can you find the 10 oldest and 10 youngest passengers on the Titanic?

The value_counts() function

The value_counts() function returns the total count of each unique value in a column.

titanic_df["embark_town"].value_counts()

This function can also be applied to continuous columns like age. Apply it to age to see which age group was the most common.

The plot() function

The plot() function allows you to create graphs. Two common charts are the histogram and the bar chart.

For the age column, we can apply a histogram:

titanic_df["age"].plot(kind="hist", title="Age Distribution")
plt.show()

For the value_counts() of categorical values, we can apply a bar chart:

titanic_df["embark_town"].value_counts().plot(kind="bar", title="Embarkation Points")
plt.show()

Advanced Visualizations

1. Survival Proportion (Pie Chart)

A pie chart is a great way to show the proportion of passengers who survived versus those who did not.

titanic_df["alive"].value_counts().plot.pie(
    autopct="%1.1f%%",
    shadow=True,
    colors=["#ff9999", "#66b3ff"],
    title="Survival Proportion",
)
plt.show()

2. Survival by Class (Stacked Bar Chart)

This visualization shows how survival rates differed across different passenger classes.

# Grouping by class and survival status
class_survival = titanic_df.groupby(["pclass", "alive"]).size().unstack()
class_survival.plot(
    kind="bar", stacked=True, color=["#e74c3c", "#2ecc71"], title="Survival by Class"
)
plt.ylabel("Number of Passengers")
plt.show()

3. Fare vs. Age (Scatter Plot)

Scatter plots help identify patterns or outliers, such as very high fares paid by certain age groups.

titanic_df.plot.scatter(
    x="age", y="fare", alpha=0.5, color="purple", title="Fare vs Age"
)
plt.show()