Queries
So far, you have learned some fundamental basic operations on DataFrames. A DataFrame is indeed very powerful due to its ability to run complex queries with a syntax similar to SQL. In this lesson, you will learn some of these queries.
In the previous lesson, you loaded the Titanic dataset and derived summary statistics. You will use the same data to derive more insight.
Simple Queries
Get Unique Values
Are you curious about where all the passengers embarked on the sinking ship? Here is the query to find out:
import pandas as pd
titanic_df = pd.read_csv(
"https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/titanic.csv"
)
print(titanic_df["embark_town"].unique())
Output:
['Southampton' 'Cherbourg' 'Queenstown' nan]
To run this query, we invoked the unique() function on the embark_town column.
Note that titanic_df['embark_town'] is a Series object.
Invoking unique() on the Series returned only the unique values in the embark_town column. Notice that there is also a nan value, which stands for
Not a Number.
There is another way of achieving the same result. You can use the dot operator on the DataFrame, providing the column name instead of enclosing the column name within square brackets. Here is the revised code:
print(titanic_df.embark_town.unique())
Output:
['Southampton' 'Cherbourg' 'Queenstown' nan]
This also gives us the same result.
Get Unique Count
Using nunique() gives you the unique value count, ignoring nan values. So, the statement below gives you a value of 3:
titanic_df.embark_town.nunique()
Get Record Using Min/Max of a column
If you want to get the complete record of the passenger who paid the maximum fare, you could run a query like the one below:
titanic_df.loc[titanic_df.fare.idxmax()]
idxmax() is the equivalent of argmax() in NumPy and gives you the index number of the row that contains the maximum value of the given column.
In the example above, you could have also used iloc instead of loc since this DataFrame has a numerical index.
If you want to get the complete records of both the maximum and minimum fare passengers, you could do so like this:
titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()]]
Notice the double square brackets. The inner bracket references the row indexes. By leaving the column indexes empty, you get all the columns.
But if you only want to see certain columns, you could add the column information as shown below:
titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], "fare"]
This gets you two rows of only the fare column. You can see multiple columns by adding the column names as a list, as shown below:
titanic_df.loc[
[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], ["pclass", "who", "fare"]
]
Filter Queries
You can filter columns, rows, or values in a column using a number of techniques. Here, you will see a few.
Filter Columns
Select all columns that start with 's' and get a new DataFrame of the selected columns.
titanic_df.filter(regex="^s", axis=1)
Changing axis=0 applies the same regex to the row labels.
You can also use like to match a partial string, as shown below:
titanic_df.filter(like="s", axis=1)
The query above returns a DataFrame with all columns that contain 's' in their name.
Select all columns that are of a specific dtype:
titanic_df.select_dtypes(include="float64")
Filter Numerical Values - Range If you want to get only the passengers who are aged between 22 and 24, you could run a query like the one below:
titanic_df[titanic_df.age.between(22, 24)]
Filter String Values
If you want to filter based on a specific value contained in a column, you can use the expressions as shown below:
titanic_df[
titanic_df.embark_town == "Cherbourg"
] # gets all records that match a value equal to 'Cherbourg'
titanic_df[
titanic_df.embark_town != "Cherbourg"
] # gets all records that do not match a value equal to 'Cherbourg'
You can use the contains() function on the str accessor to match full or partial strings of a given pattern.
titanic_df[
titanic_df.embark_town.str.contains("amp", na=False)
] # gets all records that contain 'amp' anywhere in the string value
titanic_df[
~titanic_df.embark_town.str.contains("amp", na=False)
] # gets all records that do not contain 'amp' anywhere in the string value
Note: The expression above will fail for NaN values. Ensure all NaN values are imputed before running the statement above.
If you want to match a regex pattern, you can still use contains() on str by passing in the regex pattern:
titanic_df[titanic_df.embark_town.str.contains(r"ampt[h-p]n", na=False)]
Filter String Values using a List
If you want to filter values in a categorical column, you could use the isin() function on the column, as shown below:
titanic_df[titanic_df.embark_town.isin(["Southampton", "Cherbourg"])]
The statement above returns a DataFrame that contains records of only those passengers who embarked from 'Southampton' or 'Cherbourg'.
Filter String Values by Exclusion using a List
If you want to exclude the rows that belong to a particular set of values, you could apply the ~ operator to the expression above, as shown below:
titanic_df[~titanic_df.embark_town.isin(["Southampton", "Cherbourg"])]
The statement above will get you all records that do not belong to Southampton or Cherbourg.
Using Logical AND Operators
In the last lesson, you learned that the youngest traveler was 0.42 years old. Are you curious to find out how many children below the age of 1 survived the tragedy? Let's build a query to find out:
df = titanic_df[(titanic_df.age < 1) & (titanic_df.survived == 1)]
print(df)
Output:
survived pclass sex age sibsp parch fare embarked class \
78 1 2 male 0.83 0 2 29.0000 S Second
305 1 1 male 0.92 1 2 151.5500 S First
469 1 3 female 0.75 2 1 19.2583 C Third
644 1 3 female 0.75 2 1 19.2583 C Third
755 1 2 male 0.67 1 1 14.5000 S Second
803 1 3 male 0.42 0 1 8.5167 C Third
831 1 2 male 0.83 1 1 18.7500 S Second
who adult_male deck embark_town alive alone
78 child False NaN Southampton yes False
305 child False C8 Southampton yes False
469 child False NaN Cherbourg yes False
644 child False NaN Cherbourg yes False
755 child False NaN Southampton yes False
803 child False NaN Cherbourg yes False
831 child False NaN Southampton yes False
The query above is quite intuitive to understand. You chain the necessary conditions with the & operator. Since we want all children below one year, we start with the query titanic_df.age < 1 and then add the next query to find out how many survived: titanic_df.survived == 1. We combine both of these queries using the AND operator &. The results of this query show us that there were a total of 7 children who survived the tragedy in this sample.
Also, note that the result of this query is also a DataFrame, so you can run describe() or info() on it to find more information.
You can chain as many such queries as you want; there is no limit.
Complex Logical OR/AND Operations
Now, let's find out how many of the travelers were children under the age of 1 and seniors over the age of 60. Let's build a query to do that. But this time, we do not want to see the full results, but instead, only the total count of the number of rows returned.
total_rows = titanic_df[
((titanic_df.age < 1) | (titanic_df.age > 60)) & (titanic_df.survived == 1)
].shape[0]
print(total_rows)
Output:
12
In this query, we joined the two separate age queries with the OR operator using the | symbol. Then, we joined the combined age query with the survived query from the previous example with an AND operator.
groupby function
Sometimes you want to group rows based on a certain value in a specific column. For example, let's say you want to "find the number of passengers from each embark town."
That is when the groupby() function comes in handy. Although the count() function could get us the same result for this specific example, you will soon see in the advanced groupby lesson that you can extend this concept to solve more complex problems.
The groupby() function groups rows that have the same values, and then we apply the size() function to find the total size of each group, as shown below:
groups = titanic_df.groupby(["embark_town"])
print(type(groups))
print(groups.size())
Output:
<class 'pandas.core.groupby.generic.DataFrameGroupBy'> embark_town Cherbourg 168 Queenstown 77 Southampton 644 dtype: int64
The groupby() function will collect all the records in each embark town and create a new DataFrameGroupBy object.
If you are now curious to find the count of the number of 'males' vs. 'females' who came from each town, you would add the 'sex' column to your groupby. Here is the modified query:
groups = titanic_df.groupby(["embark_town", "sex"])
print(groups.size())
Output:
embark_town sex
Cherbourg female 73
male 95
Queenstown female 36
male 41
Southampton female 203
male 441
dtype: int64
To find the percentage of each group compared to the total of the sample, you could run the command below:
100 * groups.size() / len(titanic_df)
Output:
embark_town sex
Cherbourg female 8.193042
male 10.662177
Queenstown female 4.040404
male 4.601571
Southampton female 22.783389
male 49.494949
dtype: float64
More advanced groupby functions are in another chapter.
Pandas Reference: http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.describe.html