Data Transformations

There are a few other functions and techniques that you should know that come in handy when transforming data from one form to another to derive better insights. In this lesson, you will learn a few of these techniques.

Segmentation

Segmentation is a widely used technique in data analysis. For categorical data, the data is already segmented, but for continuous numerical data, you will have to come up with your own formula. Using the segmentation (or binning) technique, you can categorize or discretize a given continuous variable for analysis.

As an example, let's look at the age column of the Titanic data. We have a wide range of travelers, from 0.42-year-old babies to 80-year-old senior citizens. One possible segmentation is to bucket the travelers into 'children' (below the age of 13), 'teenagers' (travelers between 13 and 19 years old), 'adults' (from 20 to 60 years old), and beyond 60, we will label them as 'seniors'. Pandas provides a built-in function, cut, to accomplish this task easily. This function returns a new Series object that will have the categorical values based on our segmentation. We can then attach this Series as another column to our original DataFrame, thereby making it part of the data. Let's do that:

import numpy as np
import pandas as pd

titanic_df = pd.read_csv(
    "https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/titanic.csv"
)
segments = [-1, 12, 19, 60, np.inf]
labels = ["child", "teenager", "adult", "senior"]
age_group = pd.cut(titanic_df.age, bins=segments, labels=labels)
titanic_df["age_group"] = age_group
titanic_df.head(3)

Output:

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   

     who  adult_male deck  embark_town alive  alone age_group  
0    man        True  NaN  Southampton    no  False     adult  
1  woman       False    C    Cherbourg   yes  False     adult  
2  woman       False  NaN  Southampton   yes   True     adult

In the code above, the first line declares the segments with the boundaries. The first array element will be taken as the min value for the first segment, and the second element will be considered the max value. The second segment will consider the second element as the min value and the third element as its max value, and so on for the third and last segments. Notice that the last segment has an upper max value of infinity, which is declared by using np.inf.

Note: The reason we start from -1 and not 0 is to include all values that are '0', as the range starts from the value above the declared min value.

The second line initializes the list of labels for our segments. The third line is where we use the cut function, which takes in the column values that need to be categorized, along with the two arrays for specifying the segments and labels. This function returns a Series object. Then, in the last line, we add the new column to our DataFrame by just defining the new column name and attaching the Series returned by cut. That's it!

When you see the first three rows of your new DataFrame, you will notice that a new column, 'age_group', has been added to the original, thereby successfully converting continuous numerical data to categorical data, ready for deeper analysis. By default, new columns are inserted at the end. Note that the number of values for the new column should match the number of rows, or it can be a single scalar value, in which case the value is broadcast to all rows.

You can get the frequency of the bin labels by using the value_counts() function, as shown below:

pd.value_counts(age_group)

Another way of applying value_counts is shown below:

titanic_df["embark_town"].value_counts()

value_counts returns the frequency counts in descending order.

There are many other arguments you can add and set for the cut function. Refer to the documentation: http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.cut.html

** qcut function **

You can also segment your data using quantiles as boundaries by using the qcut function. Let's see what the segmented counts are if we do this at the 10, 50, and 90 quantiles. Here is an example:

quantiles = pd.qcut(titanic_df["age"], [0, 0.1, 0.5, 0.9, 1.0])
pd.value_counts(quantiles)

You can also just provide the number of buckets instead of specifying the quantiles. You can also specify labels.

Official reference on qcut: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

Using the Map Function

You can apply a map function to individual field values and create a new column. Suppose you want to map the various cities to the countries from where people embarked on the titanic_df. You would create a dictionary with city-to-country entries and then apply this mapping to the individual values of the city column using the map function. Here is the code:

city_to_country = {
    "Southampton": "Great Britain",
    "Cherbourg": "France",
    "Queenstown": "Great Britain",
}

titanic_df["country"] = titanic_df["embark_town"].map(city_to_country)
titanic_df

From the results, you can see that a new column, 'country', has been added, which is derived from applying the map to the individual column values of the 'embark_town' column. Note that while this example uses a dictionary, you could also use a Series object, with the index being the key values of the dictionary shown above.

** Using the replace function**

You can also use replace in place of map above. Here are the same results obtained using replace:

city_to_country = {
    "Southampton": "Great Britain",
    "Cherbourg": "France",
    "Queenstown": "Great Britain",
}

titanic_df["country"] = titanic_df["embark_town"].replace(city_to_country)
titanic_df

However, note the difference: if the city_to_country map has a missing value, the map function adds a NaN value, but in the case of replace, it fills in the original column value as-is. Here is an example in which one map value is missing:

city_to_country = {"Southampton": "Great Britain", "Queenstown": "Great Britain"}

titanic_df["country"] = titanic_df["embark_town"].replace(city_to_country)
titanic_df

city_to_country = {"Southampton": "Great Britain", "Queenstown": "Great Britain"}

titanic_df["country"] = titanic_df["embark_town"].map(city_to_country)
titanic_df

Note that you can also just replace some values in the existing column by providing one or more map values.

Using the shift() function

You can use the shift() function to move a row or column by the desired number of index positions. This function takes a scalar parameter called periods, which represents the number of shifts for the desired axis.

Let's take an example of data showing the total number of infections of a disease tabulated for a few days.

df = pd.DataFrame(
    {
        "date": ["01/01/2020", "01/02/2020", "01/03/2020", "01/04/2020"],
        "total_infections": [49, 55, 77, 99],
    }
)

df.shift(1)

	date	total_infections
0	NaN	            NaN
1	01/01/2020          49.0
2	01/02/2020	    55.0
3	01/03/2020	    77.0

You can fill a default value for the shifted row by adding the fill_value argument.

If you want to have another column that shows the previous day's count next to the present day's count, you could use shift() on the column instead of the entire DataFrame. Here is an example:

df["prev_day"] = df["total_infections"].shift(1, fill_value=0)

df


            date    total_infections	prev_day
0	    01/01/2020	    49	            0
1	    01/02/2020	    55	            49
2	    01/03/2020	    77	            55
3	    01/04/2020	    99	            77

Now you can add another column that computes the average rate of change in infection for three days, as shown below.

Note: The date column should have sorted values; otherwise, your calculations will be incorrect. If the dates are not sorted, you can sort them using the expression below:

df.sort_values(by=['date'], inplace=True)

You can also make the index a time series by setting the date column to be the index. Then you can sort the index using the df.sort_index() method. At that point, you can also apply a lot of time series-related functions that you will explore in Chapter 5.

df["3_day_average"] = (
    df["total_infections"]
    + df["total_infections"].shift(1, fill_value=0)
    + df["total_infections"].shift(2, fill_value=0)
) / 3

df

    date	    total_infections	3_day_average
0	01/01/2020	    49	            16.333333
1	01/02/2020	    55	            34.666667
2	01/03/2020	    77	            60.333333
3	01/04/2020	    99	            77.000000

mask() method

The mask() method can be used in place of complex conditional statements to achieve the same end result. Here is an example of flipping all values less than 60 to "Fail".

df = pd.DataFrame(
    {
        "marks": [70, 55, 100, 88],
        "age": [29, 32, 31, 28],
        "sex": ["F", "M", "F", "F"],
        "name": ["Jane", "John", "Sally", "Sandy"],
        "ssn": ["1234", "3456", "4567", "5678"],
    }
)
df["marks"].mask(df["marks"] < 60, "Fail", inplace=True)
df

      marks	age	sex	name	ssn
0	70	29	F	Jane	1234
1	Fail	32	M	John	3456
2	100	31	F	Sally	4567
3	88	28	F	Sandy	5678

You can apply this function at the DataFrame level as well to apply it to all rows and columns, although the example shown above is for checking the condition on one column.