Data Cleaning Commands for Pandas DataFrame
Depending on the source of the data, a significant amount of time may be spent just cleaning it. Any data analysis without thoroughly cleaning the data will lead to erroneous conclusions.
Sometimes, cleaning and rearranging columns and column names is required so that you can write queries that are more intuitive and less error-prone, thereby improving productivity. In this lesson, you will see some of the common dirty data scenarios and the techniques used for cleaning.
Column name has spaces in between words
Column names having spaces is sometimes seen in datasets. The column names might be meaningful but have spaces between two or more words. This arbitrary number of spaces will stump you while writing queries and waste your time calculating the number of spaces you missed, which causes the query to error out with a message that such a column does not exist, even though you know it does!
In this case, it is better to replace all the spaces with popular replacement characters like an underscore (_) or a hyphen (-). Here is the command you would use to fix this with an underscore:
df.columns = df.columns.str.replace(" ", "_")
To replace multiple characters, you could use a regular expression instead. Here is an example that replaces a space, a comma, and an underscore all in one statement:
df.columns = df.columns.str.replace(r"[(|)| |,]", "_")
Dropping extra columns you do not need
Sometimes your data may have columns that are of no value for your analysis, or you may have columns with duplicate data. In such cases, you would want to delete such columns and keep a smaller dataset. This not only saves space in your memory but also makes it easier to focus on the important data. You can drop such columns with the command below:
df.drop(["col1", "col2", "col3"], axis=1, inplace=True)
del df["col1"] # another variation
df.pop("col1") # another variation
df = df.drop(columns="col1") # another variation
- For dropping columns, you set
axis=1, and for dropping rows, you setaxis=0. - Using
inplace=Truewill ensure that the existing DataFramedfwill be used to drop a column instead of creating a new DataFrame to be returned by the method.
Using filter
Instead of providing separate column names as above, you could also provide a filter with a regular expression to define the column names. Here is an example:
df.filter(regex="^C", axis=1)
If you run the query above, you will see that all the columns starting with 'C' are filtered out and returned as a new DataFrame.
A common error that students make is not using inplace=True, in which case the existing df is not modified, and instead, a new DataFrame with the dropped column is returned. If you do not assign the returned DataFrame to a variable, then the DataFrame that is returned after dropping the column is not reflected in the original DataFrame.
Official reference
Inserting a new column
If you want to insert a new column at a specific index position in the columns, you can use the insert method as shown below:
df.insert(4, "fare-in-dollars", df["fare"] / 0.74)
The method above inserts a new column with the label 'fare-in-dollars' at index position 4 in the columns array by calculating the dollar value using the 'fare' column.
Selecting a subset of columns
Instead of dropping columns, you could also select only a subset of columns. Here is how you do that:
df2 = df[["col1", "col2", "col3"]]
In the example above, a new DataFrame df2 is created using only the col1, col2, and col3 columns of the df DataFrame.
Exercise
import pandas as pd
detroit_demolitions = pd.read_csv(
"https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/datasets/Detroit_Demolitions_withColumns.csv"
)
detroit_demolitions.drop(["Price"], axis=1)
detroit_demolitions.drop([0])
detroit_demolitions
Run the code above, and you will notice that both the 'Price' column and the first row are not dropped from the DataFrame. How will you fix this?
There are two ways of fixing it; try both ways.
Converting column names to lower case
Many a time, column names will have interspersed upper and lower case letters. This will again stump you while writing queries, as you have to be mindful of which letter is in which case. It is better to convert all letters to either upper or lower case. Making it upper case is not very convenient either, as you have to press another key on your keyboard to use upper case. Instead, lower case is a better option, and it is also easier on the eyes. Here is the command to convert all column names to lowercase:
df.columns = df.columns.str.lower() # preferred way
df.columns = map(str.lower, df.columns) # using a map function
df.columns = [x.lower() for x in df.columns] # using a for loop
The code block above shows three ways of getting the same result. The first one is preferred as it is neater, although the performance advantage was negligible for this small number of column names.
To time a statement's execution, you can use the time() function of the time module. Here are the revised statements that print the execution time difference after every statement:
import pandas as pd
import time as time
df = pd.read_csv(
"https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/datasets/Detroit_Demolitions_withColumns.csv"
)
a = time.time()
df.columns = df.columns.str.lower() # preferred way
b = time.time()
print(b - a)
df.columns = map(str.lower, df.columns) # using a map function
c = time.time()
print(c - b)
df.columns = [x.lower() for x in df.columns] # using a for loop
print(time.time() - c)
Note: The str accessor is applied to the Series object. You can apply this to any Series. As all columns in a DataFrame are Series, you can apply this to a column as well. Any string function can be applied using this accessor.
df['col1'].str.lower()
The statement above will convert all values in 'col1' of the DataFrame to lower case.
- While we use the
datetimemodule of Python for the most part, you can use thetimemodule to clock statement executions as shown above.timeprovides Unix timestamps, expressed as a floating-point number representing seconds from the Unix epoch time. - Using
stris safe with null values. Null values are ignored when any of the functions are applied, like string manipulations, etc. You can also use list functions likeslice,get, etc., onstr.
Using the %%time directive to time a block of code
If you want to time an entire code snippet, you can use %%time. Here is an example:
%%time
a=0
while(a<10):
10*a
a += 1
Output:
The output above shows the time it took for the entire code cell to execute. This block is not doing anything other than looping a few times so we can measure the time.
Renaming a column label
Sometimes, very long column labels are unwieldy to use in queries. You might want to change the names of such columns. Here is an example to do just that:
df = df.rename(columns={"data_value": "data"})
df.rename(columns={"column1": "col1"}, inplace=True)
Get a subset of columns using difference
To get columns without a subset of columns, you could use difference:
df.columns.difference(["col1", "col2"])
The statement above returns an Index object of all column names excluding col1 and col2.
Deduping
Deduplication, or deduping for short, is a data cleaning step in which you identify and drop duplicate rows. There may be instances when duplicate rows are real observations and are therefore legitimate. For example, if the timestamp is not noted and only the date is recorded for a patient's visit to a hospital, in certain cases, a patient may visit the hospital multiple times a day. However, in most scenarios, duplicate records are erroneous information. The error could be human data entry or a system anomaly. In any case, you have to ask the question and think through the possible scenarios for duplication. Once you are satisfied that the duplicate data should be removed, Pandas provides handy methods on DataFrames to help identify and remove duplicates.
Here are some examples. First, we create a DataFrame with duplicate rows and then query to see if we can indeed identify duplicates by calling the duplicated() method on the DataFrame. This method returns a Series with True/False values for rows that are duplicates or not, respectively.
import pandas as pd
d = {"col1": [1, 1], "col2": [2, 2]}
df = pd.DataFrame(data=d)
print(df.duplicated())
Output:
0 False
1 True
dtype: bool
In the output, the first row is identified as False for duplicate, but the second row is identified as a duplicate. To make it more intuitive, you can add the returned Series as a column, as shown below:
import pandas as pd
d = {"col1": [1, 1], "col2": [2, 2]}
df = pd.DataFrame(data=d)
df["is_duplicate"] = df.duplicated()
df
Output:
col1 col2 is_duplicate 0 1 2 False 1 1 2 True
In the next example, we drop duplicates by calling the drop_duplicates() method.
df.drop_duplicates(inplace=True)
df
Output:
col1 col2
0 1 2
Note: inplace=True will drop the duplicates in the existing DataFrame, which is a recommended way to conserve memory by not creating a new DataFrame with the dropped rows.
Replace invalid characters in values
Many a time, you may have some business domain-related characters that get in the way of writing your queries. You might want to replace such characters so that you can convert them to numerical values for your analysis. For example:
- having a ` symbol in the column values showing price.
- having a comma for prices.
- having a hyphen (
-) for empty prices.
You can replace such values using the replace method:
df = df.replace("-", 0) # replace all '-' with 0 across all values of the DataFrame
Another example to replace the $ symbol prepending the values in the 'Price' column:
df["Price"] = df["Price"].str.lstrip("$")
Convert Data From One Type to Another
astype method
You will have instances when you have to convert data from one type to another. In the example above, the Price column is of string type, so to convert that to float (so that you can run your numerical analysis), you would invoke the astype() method, providing the float type as the argument, as shown below:
df["Price"] = df["Price"].astype(float)
In fact, you can do all of this in one statement, as shown below:
df["Price"] = df["Price"].str.lstrip("$").astype(float)
Only a number can be converted to a float. You have to first strip out all the non-numeric characters before applying the astype(float) function. If the number has any non-numeric characters, you will get a ValueError: could not convert string to float: ...
to_numeric method
If a column or DataFrame has numbers for all valid values and non-numeric values for invalid values, you could use the to_numeric() method instead. To ensure that it does not error out for non-numeric values, you can add the errors attribute as shown below. All non-numeric values will get a NaN value, and the rest of the values will be converted to one of the numeric types based on the value:
pd.to_numeric(df["Price"], errors="coerce")
While the function above can be applied to individual columns, the function below can be applied to all columns of a DataFrame that need to be converted to numeric values:
df = df.apply(pd.to_numeric, errors="ignore")
Official reference
converters method in read_csv
You can also clean up your data while reading the data file into a Pandas DataFrame. Here is an example of replacing the $ symbol present in the Price column and converting the resulting stripped value to a float while reading the file into a DataFrame:
import pandas as pd
detroit_demolitions = pd.read_csv(
"https://raw.githubusercontent.com/jravi123/datasets/refs/heads/main/datasets/Detroit_Demolitions_withColumns.csv",
converters={"Price": lambda s: float(s.lstrip("$"))},
)
print(detroit_demolitions.head(1))
Handling Missing Values
Missing values in datasets are very common. Missing entries are given the value NaN, short for "Not a Number," and are given the float64 dtype for performance reasons.
Before you start any analysis, you might want to know which values are missing in the DataFrame. Here is the command to do just that:
df.isnull().values.any()
If there are NaN values, it returns True; otherwise, False.
Another function that does exactly the same is isna(), which works similarly to isnull().
The complement to isnull() or isna() is notnull().
Another way of finding the sum of all null values is:
df.isnull().sum()
This provides you with the count of null values across each column.
To get the total sum across all columns, you could use:
df.isnull().sum().values.sum()
Although using the info() method on the DataFrame will list the non-null values and the total number of rows across all columns.
Get all rows with missing values in a specific column
You can also see the entire rows of columns that have missing values. Here is a query that returns rows that have NaN values in the embark_town column:
titanic_df[titanic_df["embark_town"].isna()].head(10)
Once you identify the missing values, you can either drop them if they are few in number or impute them with other values. Here are some examples below:
Note: In the Pandas world, missing values, na, null, and None all mean the same thing, and they are represented by NaN.
Drop all missing values.
You use the dropna() method to drop all rows that have any field with a null or missing value.
df.dropna(inplace=True)
With the command above, all rows that have any column value that is a missing value are dropped.
However, if you only want to drop rows that have missing values in a specific column, you can still use the same function but apply it to the column of interest. Here is an example of both:
df.dropna(subset=["col1"], inplace=True)
With the command above, you drop only those rows that have missing values in col1.
Variations: If you want to drop those rows that have all missing values, use the keyword argument how='all' and axis=0. Refer to: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
Fill with the most common value across all columns
Sometimes it makes sense to simply fill in the missing value with the most commonly occurring value. Here is the code snippet that helps you do just that:
df.apply(lambda x: x.fillna(x.value_counts().index[0]))
Note that the x variable in the lambda function gets one column at a time.
Fill with a default for all missing values
You use the fillna() method to fill all missing values with a default value. In the example below, you replace all missing values with 0.
df.fillna(0, inplace=True)
You can also selectively fill missing values of a specific column:
df['col1'].fillna('empty', inplace=True)
However, if 'empty' is not one of the values in the column, you will get a:
ValueError: fill value must be in categories
To overcome this, you have to add the 'empty' category using the expression below:
df['col1'].cat.add_categories('empty', inplace=True)
and then run fillna() again.
In the example above, you fill the missing values with the 'empty' string for only the column with the label 'col1'.
You can also impute with the mean, median, etc. Here is an example:
df['age'].fillna(df['age'].mean(), inplace=True)
To impute with the previous row's value, use the statement below:
df['price'].ffill(inplace=True)
The statement above is the same as:
df['price'].fillna(method='ffill', inplace=True)
The complement to ffill is bfill.
Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
Imputing with linear regression
If you see a linear regression use case for imputing missing values, you can apply the function below:
df["gdp"].interpolate(method="linear", inplace=True)
The gdp column presumably contains the country's GDP value across many years in ascending order, and using linear regression, you can impute the missing values.
The method parameter can be changed to other values. Refer to: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate
Imputing missing values is a science in itself. There are many other scenarios of dirty data and ways in which you can wrangle data for analysis. Please refer to the various methods as described here: https://ebooks.mobibootcamp.com/eda/datacleaning.html
Replacing one value with another
To replace a specific column value with another, you can use the replace function:
df["embark_town"].replace("Queenstown", "Kingstown", inplace=True)
All the values with 'Queenstown' in the embark_town column will be replaced with 'Kingstown'. Of course, you can also replace it with a value of any other data type.
In the replace method, you can also use a regex by adding the keyword argument regex=True. Here is an example:
import numpy as np
df.replace(r"\s+", np.nan, inplace=True, regex=True)
The regex expression above matches and replaces one or more empty spaces, newline, and tab characters with a NaN value.
Using regex to extract values into columns
You can also extract a regex group using str.extract. More here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
Replacing a subset of values
Let's say you want to impute one or more values based on row and column index or label values. You could use the loc, iloc, or at accessors. Here are some examples:
df.loc[2, "deck"] = "A"
The statement above selects the row with index 2 and the column with label deck and updates the value in this cell to 'A'.
df.iloc[3, 11] = "A"
The statement above selects the row with index 3 and the column with index 11 (which is the 'deck') and sets the cell value to 'A'. Column indexes begin at 0.
df.loc[4:6, "deck"] = "A"
The statement above selects all the rows from 4 to 6 (inclusive) and changes the deck column value to 'A'.
df.loc[[2, 4], "deck"] = "A"
The statement above selects the rows with indexes 2 and 4 and changes the deck column value to 'A'.
import numpy as np
df.loc[[2, 4], :] = np.nan
The statement above selects the rows with indexes 2 and 4 and changes all the column values to NaN.
at
You can replace loc with at to get the same desired results in all the examples above. It is preferable to use loc instead of at because in some edge cases, at will silently convert data types behind the scenes without warning. Here is an example:
df = pd.DataFrame(
[["a", "b", 3], ["c", "c", 1], ["e", "f", 30]], columns=["A", "B", "C"]
)
df.info()
Output:
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 3 non-null object 1 B 3 non-null object 2 C 3 non-null int64
As you can see, the 'C' column is assigned an int64 type. Now, if you try to update the value of the first row, column 'C' to a decimal value with at, you will notice that the decimal value is truncated to form an int64.
df.at[0, 'C'] = 10.70
However, the same is not so when you use loc, as it converts the entire column to float when a new float value is set in one of the rows.
Note:
- If the given row index and column label or index does not exist, then both
locandatwill create a new row and/or column to satisfy the set value. Butiloccannot enlarge the dataset. - You can also pass in a query for the row or column list values as long as the query expression returns an
Indexor a list. For example,df.loc[df[df.marks > 70].index, 'marks'] = 'A'. This expression can be applied to the DataFrame example shown in the earlier chapter to see it working.