Data Cleaning Commands for Pandas DataFrame

Depending on the source of the data, a significant amount of time may be spent just cleaning it. Any data analysis without thoroughly cleaning the data will lead to erroneous conclusions.

Sometimes, cleaning and rearranging columns and column names is required so that you can write queries that are more intuitive and less error-prone, thereby improving productivity. In this lesson, you will see some of the common dirty data scenarios and the techniques used for cleaning.

Column name has spaces in between words

Column names having spaces is sometimes seen in datasets. The column names might be meaningful but have spaces between two or more words. This arbitrary number of spaces will stump you while writing queries and waste your time calculating the number of spaces you missed, which causes the query to error out with a message that such a column does not exist, even though you know it does!

In this case, it is better to replace all the spaces with popular replacement characters like an underscore (_) or a hyphen (-). Here is the command you would use to fix this with an underscore:

df.columns = df.columns.str.replace(" ", "_")

To replace multiple characters, you could use a regular expression instead. Here is an example that replaces a space, a comma, and an underscore all in one statement:

df.columns = df.columns.str.replace(r"[(|)| |,]", "_")

Dropping extra columns you do not need

Sometimes your data may have columns that are of no value for your analysis, or you may have columns with duplicate data. In such cases, you would want to delete such columns and keep a smaller dataset. This not only saves space in your memory but also makes it easier to focus on the important data. You can drop such columns with the command below:

df.drop(["col1", "col2", "col3"], axis=1, inplace=True)
del df["col1"]  # another variation
df.pop("col1")  # another variation
df = df.drop(columns="col1")  # another variation

* The same `drop` method can be used to drop both columns and rows, although dropping columns is more popular.

For dropping columns, you set axis=1, and for dropping rows, you set axis=0.
Using inplace=True will ensure that the existing DataFrame df will be used to drop a column instead of creating a new DataFrame to be returned by the method.

Using filter

Instead of providing separate column names as above, you could also provide a filter with a regular expression to define the column names. Here is an example:

df.filter(regex="^C", axis=1)

If you run the query above, you will see that all the columns starting with 'C' are filtered out and returned as a new DataFrame.

Caution

A common error that students make is not using inplace=True, in which case the existing df is not modified, and instead, a new DataFrame with the dropped column is returned. If you do not assign the returned DataFrame to a variable, then the DataFrame that is returned after dropping the column is not reflected in the original DataFrame.

Official reference

Note

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

Inserting a new column

If you want to insert a new column at a specific index position in the columns, you can use the insert method as shown below:

df.insert(4, "fare-in-dollars", df["fare"] / 0.74)

The method above inserts a new column with the label 'fare-in-dollars' at index position 4 in the columns array by calculating the dollar value using the 'fare' column.

Selecting a subset of columns

Instead of dropping columns, you could also select only a subset of columns. Here is how you do that:

df2 = df[["col1", "col2", "col3"]]

In the example above, a new DataFrame df2 is created using only the col1, col2, and col3 columns of the df DataFrame.

Exercise

import pandas as pd

detroit_demolitions = pd.read_csv(
    "https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/datasets/Detroit_Demolitions_withColumns.csv"
)
detroit_demolitions.drop(["Price"], axis=1)
detroit_demolitions.drop([0])

detroit_demolitions

Run the code above, and you will notice that both the 'Price' column and the first row are not dropped from the DataFrame. How will you fix this?

There are two ways of fixing it; try both ways.

Converting column names to lower case

Many a time, column names will have interspersed upper and lower case letters. This will again stump you while writing queries, as you have to be mindful of which letter is in which case. It is better to convert all letters to either upper or lower case. Making it upper case is not very convenient either, as you have to press another key on your keyboard to use upper case. Instead, lower case is a better option, and it is also easier on the eyes. Here is the command to convert all column names to lowercase:

df.columns = df.columns.str.lower()  # preferred way
df.columns = map(str.lower, df.columns)  # using a map function
df.columns = [x.lower() for x in df.columns]  # using a for loop

The code block above shows three ways of getting the same result. The first one is preferred as it is neater, although the performance advantage was negligible for this small number of column names.

To time a statement's execution, you can use the time() function of the time module. Here are the revised statements that print the execution time difference after every statement:

import pandas as pd
import time as time

df = pd.read_csv(
    "https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/datasets/Detroit_Demolitions_withColumns.csv"
)

a = time.time()
df.columns = df.columns.str.lower()  # preferred way
b = time.time()
print(b - a)
df.columns = map(str.lower, df.columns)  # using a map function

c = time.time()
print(c - b)
df.columns = [x.lower() for x in df.columns]  # using a for loop

print(time.time() - c)

Note: The str accessor is applied to the Series object. You can apply this to any Series. As all columns in a DataFrame are Series, you can apply this to a column as well. Any string function can be applied using this accessor.

df['col1'].str.lower()

The statement above will convert all values in 'col1' of the DataFrame to lower case.

Points to note

While we use the datetime module of Python for the most part, you can use the time module to clock statement executions as shown above. time provides Unix timestamps, expressed as a floating-point number representing seconds from the Unix epoch time.
Using str is safe with null values. Null values are ignored when any of the functions are applied, like string manipulations, etc. You can also use list functions like slice, get, etc., on str.

Using the %%time directive to time a block of code

If you want to time an entire code snippet, you can use %%time. Here is an example:

%%time

a=0
while(a<10):
    10*a
    a += 1

Output:

CPU times: user 2.31 ms, sys: 854 µs, total: 3.16 ms Wall time: 2.72 ms

The output above shows the time it took for the entire code cell to execute. This block is not doing anything other than looping a few times so we can measure the time.

Renaming a column label

Sometimes, very long column labels are unwieldy to use in queries. You might want to change the names of such columns. Here is an example to do just that:

df = df.rename(columns={"data_value": "data"})
df.rename(columns={"column1": "col1"}, inplace=True)

Get a subset of columns using difference

To get columns without a subset of columns, you could use difference:

df.columns.difference(["col1", "col2"])

The statement above returns an Index object of all column names excluding col1 and col2.

Deduping

Deduplication, or deduping for short, is a data cleaning step in which you identify and drop duplicate rows. There may be instances when duplicate rows are real observations and are therefore legitimate. For example, if the timestamp is not noted and only the date is recorded for a patient's visit to a hospital, in certain cases, a patient may visit the hospital multiple times a day. However, in most scenarios, duplicate records are erroneous information. The error could be human data entry or a system anomaly. In any case, you have to ask the question and think through the possible scenarios for duplication. Once you are satisfied that the duplicate data should be removed, Pandas provides handy methods on DataFrames to help identify and remove duplicates.

Here are some examples. First, we create a DataFrame with duplicate rows and then query to see if we can indeed identify duplicates by calling the duplicated() method on the DataFrame. This method returns a Series with True/False values for rows that are duplicates or not, respectively.

import pandas as pd

d = {"col1": [1, 1], "col2": [2, 2]}
df = pd.DataFrame(data=d)

print(df.duplicated())

Output:
0 False
1 True
dtype: bool

In the output, the first row is identified as False for duplicate, but the second row is identified as a duplicate. To make it more intuitive, you can add the returned Series as a column, as shown below:

import pandas as pd

d = {"col1": [1, 1], "col2": [2, 2]}
df = pd.DataFrame(data=d)

df["is_duplicate"] = df.duplicated()
df

Output:

	col1	col2	is_duplicate
0	1	    2	    False
1	1	    2	    True

In the next example, we drop duplicates by calling the drop_duplicates() method.

df.drop_duplicates(inplace=True)
df

Output:
col1 col2
0 1 2

Note: inplace=True will drop the duplicates in the existing DataFrame, which is a recommended way to conserve memory by not creating a new DataFrame with the dropped rows.

Replace invalid characters in values

Many a time, you may have some business domain-related characters that get in the way of writing your queries. You might want to replace such characters so that you can convert them to numerical values for your analysis. For example:

having a ` symbol in the column values showing price.
having a comma for prices.
having a hyphen (-) for empty prices.

You can replace such values using the replace method:

df = df.replace("-", 0)  # replace all '-' with 0 across all values of the DataFrame

Another example to replace the $ symbol prepending the values in the 'Price' column:

df["Price"] = df["Price"].str.lstrip("$")

Convert Data From One Type to Another

astype method

You will have instances when you have to convert data from one type to another. In the example above, the Price column is of string type, so to convert that to float (so that you can run your numerical analysis), you would invoke the astype() method, providing the float type as the argument, as shown below:

df["Price"] = df["Price"].astype(float)

In fact, you can do all of this in one statement, as shown below:

df["Price"] = df["Price"].str.lstrip("$").astype(float)

Note

Only a number can be converted to a float. You have to first strip out all the non-numeric characters before applying the astype(float) function. If the number has any non-numeric characters, you will get a ValueError: could not convert string to float: ...

to_numeric method

If a column or DataFrame has numbers for all valid values and non-numeric values for invalid values, you could use the to_numeric() method instead. To ensure that it does not error out for non-numeric values, you can add the errors attribute as shown below. All non-numeric values will get a NaN value, and the rest of the values will be converted to one of the numeric types based on the value:

pd.to_numeric(df["Price"], errors="coerce")

While the function above can be applied to individual columns, the function below can be applied to all columns of a DataFrame that need to be converted to numeric values:

df = df.apply(pd.to_numeric, errors="ignore")

Official reference

Note

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html

converters method in read_csv

You can also clean up your data while reading the data file into a Pandas DataFrame. Here is an example of replacing the $ symbol present in the Price column and converting the resulting stripped value to a float while reading the file into a DataFrame:

import pandas as pd

detroit_demolitions = pd.read_csv(
    "https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/datasets/Detroit_Demolitions_withColumns.csv",
    converters={"Price": lambda s: float(s.lstrip("$"))},
)
print(detroit_demolitions.head(1))

Handling Missing Values

Missing values in datasets are very common. Missing entries are given the value NaN, short for "Not a Number," and are given the float64 dtype for performance reasons. Before you start any analysis, you might want to know which values are missing in the DataFrame. Here is the command to do just that:

df.isnull().values.any()

If there are NaN values, it returns True; otherwise, False.

Another function that does exactly the same is isna(), which works similarly to isnull().

The complement to isnull() or isna() is notnull().

Another way of finding the sum of all null values is:

df.isnull().sum()

This provides you with the count of null values across each column.

To get the total sum across all columns, you could use:

df.isnull().sum().values.sum()

Although using the info() method on the DataFrame will list the non-null values and the total number of rows across all columns.

Get all rows with missing values in a specific column

You can also see the entire rows of columns that have missing values. Here is a query that returns rows that have NaN values in the embark_town column:

titanic_df[titanic_df["embark_town"].isna()].head(10)

Once you identify the missing values, you can either drop them if they are few in number or impute them with other values. Here are some examples below:

Note: In the Pandas world, missing values, na, null, and None all mean the same thing, and they are represented by NaN.

Drop all missing values.

You use the dropna() method to drop all rows that have any field with a null or missing value.

df.dropna(inplace=True)

With the command above, all rows that have any column value that is a missing value are dropped.

However, if you only want to drop rows that have missing values in a specific column, you can still use the same function but apply it to the column of interest. Here is an example of both:

df.dropna(subset=["col1"], inplace=True)

With the command above, you drop only those rows that have missing values in col1.

Variations: If you want to drop those rows that have all missing values, use the keyword argument how='all' and axis=0. Refer to: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

Fill with the most common value across all columns

Sometimes it makes sense to simply fill in the missing value with the most commonly occurring value. Here is the code snippet that helps you do just that:

df.apply(lambda x: x.fillna(x.value_counts().index[0]))

Note that the x variable in the lambda function gets one column at a time.

Fill with a default for all missing values

You use the fillna() method to fill all missing values with a default value. In the example below, you replace all missing values with 0.

df.fillna(0, inplace=True)

You can also selectively fill missing values of a specific column:

df['col1'].fillna('empty', inplace=True)

However, if 'empty' is not one of the values in the column, you will get a:

ValueError: fill value must be in categories

To overcome this, you have to add the 'empty' category using the expression below:

df['col1'].cat.add_categories('empty', inplace=True)

and then run fillna() again.

In the example above, you fill the missing values with the 'empty' string for only the column with the label 'col1'.

You can also impute with the mean, median, etc. Here is an example:

df['age'].fillna(df['age'].mean(), inplace=True)

To impute with the previous row's value, use the statement below:

df['price'].ffill(inplace=True)

The statement above is the same as:

df['price'].fillna(method='ffill', inplace=True)

The complement to ffill is bfill.

Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Imputing with linear regression

If you see a linear regression use case for imputing missing values, you can apply the function below:

df["gdp"].interpolate(method="linear", inplace=True)

The gdp column presumably contains the country's GDP value across many years in ascending order, and using linear regression, you can impute the missing values.

The method parameter can be changed to other values. Refer to: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate

Imputing missing values is a science in itself. There are many other scenarios of dirty data and ways in which you can wrangle data for analysis. Please refer to the various methods as described here: https://ebooks.mobibootcamp.com/eda/datacleaning.html

Replacing one value with another

To replace a specific column value with another, you can use the replace function:

df["embark_town"].replace("Queenstown", "Kingstown", inplace=True)

All the values with 'Queenstown' in the embark_town column will be replaced with 'Kingstown'. Of course, you can also replace it with a value of any other data type.

In the replace method, you can also use a regex by adding the keyword argument regex=True. Here is an example:

import numpy as np

df.replace(r"\s+", np.nan, inplace=True, regex=True)

The regex expression above matches and replaces one or more empty spaces, newline, and tab characters with a NaN value.

Using regex to extract values into columns

You can also extract a regex group using str.extract. More here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Replacing a subset of values

Let's say you want to impute one or more values based on row and column index or label values. You could use the loc, iloc, or at accessors. Here are some examples:

df.loc[2, "deck"] = "A"

The statement above selects the row with index 2 and the column with label deck and updates the value in this cell to 'A'.

df.iloc[3, 11] = "A"

The statement above selects the row with index 3 and the column with index 11 (which is the 'deck') and sets the cell value to 'A'. Column indexes begin at 0.

df.loc[4:6, "deck"] = "A"

The statement above selects all the rows from 4 to 6 (inclusive) and changes the deck column value to 'A'.

df.loc[[2, 4], "deck"] = "A"

The statement above selects the rows with indexes 2 and 4 and changes the deck column value to 'A'.

import numpy as np

df.loc[[2, 4], :] = np.nan

The statement above selects the rows with indexes 2 and 4 and changes all the column values to NaN.

at

You can replace loc with at to get the same desired results in all the examples above. It is preferable to use loc instead of at because in some edge cases, at will silently convert data types behind the scenes without warning. Here is an example:

df = pd.DataFrame(
    [["a", "b", 3], ["c", "c", 1], ["e", "f", 30]], columns=["A", "B", "C"]
)
df.info()

Output:

 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       3 non-null      object
 1   B       3 non-null      object
 2   C       3 non-null      int64

As you can see, the 'C' column is assigned an int64 type. Now, if you try to update the value of the first row, column 'C' to a decimal value with at, you will notice that the decimal value is truncated to form an int64.


df.at[0, 'C'] = 10.70

However, the same is not so when you use loc, as it converts the entire column to float when a new float value is set in one of the rows.

Note:

If the given row index and column label or index does not exist, then both loc and at will create a new row and/or column to satisfy the set value. But iloc cannot enlarge the dataset.
You can also pass in a query for the row or column list values as long as the query expression returns an Index or a list. For example, df.loc[df[df.marks > 70].index, 'marks'] = 'A'. This expression can be applied to the DataFrame example shown in the earlier chapter to see it working.