Queries

So far, you have learned some fundamental basic operations on DataFrames. A DataFrame is indeed very powerful due to its ability to run complex queries with a syntax similar to SQL. In this lesson, you will learn some of these queries.

In the previous lesson, you loaded the Titanic dataset and derived summary statistics. You will use the same data to derive more insight.

Simple Queries

Get Unique Values

Are you curious about where all the passengers embarked on the sinking ship? Here is the query to find out:

import pandas as pd

titanic_df = pd.read_csv(
    "https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/titanic.csv"
)
print(titanic_df["embark_town"].unique())

Output:

['Southampton' 'Cherbourg' 'Queenstown' nan]

To run this query, we invoked the unique() function on the embark_town column. Note that titanic_df['embark_town'] is a Series object. Invoking unique() on the Series returned only the unique values in the embark_town column. Notice that there is also a nan value, which stands for Not a Number.

There is another way of achieving the same result. You can use the dot operator on the DataFrame, providing the column name instead of enclosing the column name within square brackets. Here is the revised code:

print(titanic_df.embark_town.unique())

Output:

['Southampton' 'Cherbourg' 'Queenstown' nan]

This also gives us the same result.

Get Unique Count

Using nunique() gives you the unique value count, ignoring nan values. So, the statement below gives you a value of 3:

titanic_df.embark_town.nunique()

Get Record Using Min/Max of a column

If you want to get the complete record of the passenger who paid the maximum fare, you could run a query like the one below:

titanic_df.loc[titanic_df.fare.idxmax()]

idxmax() is the equivalent of argmax() in NumPy and gives you the index number of the row that contains the maximum value of the given column.

In the example above, you could have also used iloc instead of loc since this DataFrame has a numerical index.

If you want to get the complete records of both the maximum and minimum fare passengers, you could do so like this:

titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()]]

Notice the double square brackets. The inner bracket references the row indexes. By leaving the column indexes empty, you get all the columns.

But if you only want to see certain columns, you could add the column information as shown below:

titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], "fare"]

This gets you two rows of only the fare column. You can see multiple columns by adding the column names as a list, as shown below:

titanic_df.loc[
    [titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], ["pclass", "who", "fare"]
]

Filter Queries

You can filter columns, rows, or values in a column using a number of techniques. Here, you will see a few.

Filter Columns

Select all columns that start with 's' and get a new DataFrame of the selected columns.

titanic_df.filter(regex="^s", axis=1)

Changing axis=0 applies the same regex to the row labels.

You can also use like to match a partial string, as shown below:

titanic_df.filter(like="s", axis=1)

The query above returns a DataFrame with all columns that contain 's' in their name.

Select all columns that are of a specific dtype:

titanic_df.select_dtypes(include="float64")

Filter Numerical Values - Range If you want to get only the passengers who are aged between 22 and 24, you could run a query like the one below:

titanic_df[titanic_df.age.between(22, 24)]

Filter String Values

If you want to filter based on a specific value contained in a column, you can use the expressions as shown below:

titanic_df[
    titanic_df.embark_town == "Cherbourg"
]  # gets all records that match a value equal to 'Cherbourg'
titanic_df[
    titanic_df.embark_town != "Cherbourg"
]  # gets all records that do not match a value equal to 'Cherbourg'

You can use the contains() function on the str accessor to match full or partial strings of a given pattern.

titanic_df[
    titanic_df.embark_town.str.contains("amp", na=False)
]  # gets all records that contain 'amp' anywhere in the string value
titanic_df[
    ~titanic_df.embark_town.str.contains("amp", na=False)
]  # gets all records that do not contain 'amp' anywhere in the string value

Note: The expression above will fail for NaN values. Ensure all NaN values are imputed before running the statement above.

If you want to match a regex pattern, you can still use contains() on str by passing in the regex pattern:

titanic_df[titanic_df.embark_town.str.contains(r"ampt[h-p]n", na=False)]

Filter String Values using a List

If you want to filter values in a categorical column, you could use the isin() function on the column, as shown below:

titanic_df[titanic_df.embark_town.isin(["Southampton", "Cherbourg"])]

The statement above returns a DataFrame that contains records of only those passengers who embarked from 'Southampton' or 'Cherbourg'.

Filter String Values by Exclusion using a List

If you want to exclude the rows that belong to a particular set of values, you could apply the ~ operator to the expression above, as shown below:

titanic_df[~titanic_df.embark_town.isin(["Southampton", "Cherbourg"])]

The statement above will get you all records that do not belong to Southampton or Cherbourg.

Using Logical AND Operators

In the last lesson, you learned that the youngest traveler was 0.42 years old. Are you curious to find out how many children below the age of 1 survived the tragedy? Let's build a query to find out:

df = titanic_df[(titanic_df.age < 1) & (titanic_df.survived == 1)]
print(df)

Output:

     survived  pclass     sex   age  sibsp  parch      fare embarked   class  \
78          1       2    male  0.83      0      2   29.0000        S  Second   
305         1       1    male  0.92      1      2  151.5500        S   First   
469         1       3  female  0.75      2      1   19.2583        C   Third   
644         1       3  female  0.75      2      1   19.2583        C   Third   
755         1       2    male  0.67      1      1   14.5000        S  Second   
803         1       3    male  0.42      0      1    8.5167        C   Third   
831         1       2    male  0.83      1      1   18.7500        S  Second   

       who  adult_male deck  embark_town alive  alone  
78   child       False  NaN  Southampton   yes  False  
305  child       False   C8  Southampton   yes  False  
469  child       False  NaN    Cherbourg   yes  False  
644  child       False  NaN    Cherbourg   yes  False  
755  child       False  NaN  Southampton   yes  False  
803  child       False  NaN    Cherbourg   yes  False  
831  child       False  NaN  Southampton   yes  False

The query above is quite intuitive to understand. You chain the necessary conditions with the & operator. Since we want all children below one year, we start with the query titanic_df.age < 1 and then add the next query to find out how many survived: titanic_df.survived == 1. We combine both of these queries using the AND operator &. The results of this query show us that there were a total of 7 children who survived the tragedy in this sample. Also, note that the result of this query is also a DataFrame, so you can run describe() or info() on it to find more information.

You can chain as many such queries as you want; there is no limit.

Complex Logical OR/AND Operations

Now, let's find out how many of the travelers were children under the age of 1 and seniors over the age of 60. Let's build a query to do that. But this time, we do not want to see the full results, but instead, only the total count of the number of rows returned.

total_rows = titanic_df[
    ((titanic_df.age < 1) | (titanic_df.age > 60)) & (titanic_df.survived == 1)
].shape[0]
print(total_rows)

Output:

12

In this query, we joined the two separate age queries with the OR operator using the | symbol. Then, we joined the combined age query with the survived query from the previous example with an AND operator.

groupby function

Sometimes you want to group rows based on a certain value in a specific column. For example, let's say you want to "find the number of passengers from each embark town." That is when the groupby() function comes in handy. Although the count() function could get us the same result for this specific example, you will soon see in the advanced groupby lesson that you can extend this concept to solve more complex problems.

The groupby() function groups rows that have the same values, and then we apply the size() function to find the total size of each group, as shown below:

groups = titanic_df.groupby(["embark_town"])
print(type(groups))
print(groups.size())

Output:

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
embark_town
Cherbourg      168
Queenstown      77
Southampton    644
dtype: int64

The groupby() function will collect all the records in each embark town and create a new DataFrameGroupBy object.

If you are now curious to find the count of the number of 'males' vs. 'females' who came from each town, you would add the 'sex' column to your groupby. Here is the modified query:

groups = titanic_df.groupby(["embark_town", "sex"])
print(groups.size())

Output:

embark_town  sex   
Cherbourg    female     73
             male       95
Queenstown   female     36
             male       41
Southampton  female    203
             male      441
dtype: int64

To find the percentage of each group compared to the total of the sample, you could run the command below:

100 * groups.size() / len(titanic_df)

Output:

embark_town  sex   
Cherbourg    female     8.193042
             male      10.662177
Queenstown   female     4.040404
             male       4.601571
Southampton  female    22.783389
             male      49.494949
dtype: float64

More advanced groupby functions are in another chapter.

Pandas Reference: http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.describe.html