Joining Datasets

There are times when you need to combine two or more datasets before performing your analysis. Broadly, you can classify joining into two types:

Joining horizontally: This adds columns from separate datasets together based on a common key, similar to a SQL database JOIN. The total number of rows generally does not increase significantly.
Joining vertically: This stacks rows from one dataset on top of another. The number of columns typically does not increase significantly unless there are mismatches.

Pandas provides operations for both techniques. Here are some examples:

The merge Operation

In the example below, we will create two separate DataFrames and then merge them.

import pandas as pd

raw_data1 = {
    'fruit_id': ['1', '2', '3'],
    'name': ['Apple', 'Banana', 'Orange']
}
df1 = pd.DataFrame(raw_data1, columns=['fruit_id', 'name'])

print(df1)

raw_data2 = {
    'fruit_id': ['1', '2', '3', '4'],
    'calories': ['55', '50', '45', '60']
}

df2 = pd.DataFrame(raw_data2, columns=["fruit_id", "calories"])

print(df2)

df3 = pd.merge(df1, df2)
print(df3)

Note that the column to join on is not explicitly specified. If that information is missing, merge() automatically uses overlapping column names as the keys. While this works, it is best practice to specify the key explicitly:

pd.merge(df1, df2, on='fruit_id')

If on is not specified and there is more than one overlapping column, the merge might fail or produce unexpected results.

Also, did you notice something? The second DataFrame (df2) has an extra row (fruit_id 4) that is missing from the merged result. This is because there is no corresponding row in df1, and the default join method applied is an 'inner' join.

To change this behavior, you can specify the how attribute. The possible values are 'inner' (default), 'left', 'right', and 'outer'. The example below uses an 'outer' join:

pd.merge(df1, df2, how="outer")

Try this: Test the 'left' and 'right' joins. If you specify a 'left' join, all rows in the left DataFrame are kept, even if there is no matching key in the right DataFrame. For a 'right' join, the opposite is true. An 'outer' join keeps all rows from both sides, inserting NaN (null) values for missing data.

If the key columns have different names in the two DataFrames, you can specify them separately using left_on and right_on:

# Create DataFrames with different key column names
data_a = pd.DataFrame({'df1_id': ['1', '2'], 'name': ['Apple', 'Banana']})
data_b = pd.DataFrame({'df2_id': ['1', '2'], 'calories': ['55', '50']})

# Merge specifying the left and right keys
merged_data = pd.merge(data_a, data_b, left_on='df1_id', right_on='df2_id')
print(merged_data)

Handling Duplicate Keys If you have duplicate values in your key columns, merge() will generate a Cartesian product for those specific keys, inserting rows for all matching combinations. Here is an example to make this clear:

raw_data1 = {
    'fruit_id': ['1', '1', '2', '3'],
    'name': ['Apple', 'Apple', 'Banana', 'Orange']
}
df1 = pd.DataFrame(raw_data1, columns=['fruit_id', 'name'])
print(df1)

raw_data2 = {
    'fruit_id': ['1', '2', '3', '4'],
    'calories': ['55', '50', '45', '60']
}
df2 = pd.DataFrame(raw_data2, columns=["fruit_id", "calories"])
df3 = pd.merge(df1, df2)
print(df3)

Lastly, if the join columns are specified but there are other overlapping column names in the two DataFrames, Pandas automatically appends _x and _y to distinguish them. You can customize these suffixes:

raw_data1 = {
    'fruit_id': ['1', '1', '2', '3'],
    'name': ['Apple', 'Apple', 'Banana', 'Orange']
}
df1 = pd.DataFrame(raw_data1, columns=['fruit_id', 'name'])

raw_data2 = {
    'fruit_id': ['1', '2', '3', '4'],
    'name': ['A', 'Ap', 'B', 'O'],
    'calories': ['55', '50', '45', '60']
}

df2 = pd.DataFrame(raw_data2, columns=["fruit_id", "name", "calories"])
df3 = pd.merge(df1, df2, on="fruit_id", suffixes=("_left", "_right"))

print(df2)
print(df3)

If you want to merge on the index itself rather than a column name, you can specify that using left_index=True and/or right_index=True:

pd.merge(df1, df2, left_index=True, right_index=True)

The concat Operation

If one set of DataFrame rows needs to be appended vertically with another, you use the pd.concat() operation.

(Note: In older versions of Pandas, there was an .append() method, but it was permanently removed in Pandas 2.0 due to performance inefficiencies. You must use concat() instead).

Here is an example:

raw_data1 = {
    'fruit_id': ['1', '2', '3'],
    'name': ['Apple', 'Banana', 'Orange']
}
df1 = pd.DataFrame(raw_data1, columns=['fruit_id', 'name'])

raw_data2 = {
    'fruit_id': ['4', '5', '6'],
    'name': ['Mango', 'Pineapple', 'Kiwi']
}
df2 = pd.DataFrame(raw_data2, columns=['fruit_id', 'name'])

# Pass the DataFrames as a list to concat
df3 = pd.concat([df1, df2])
print(df3)

Do you notice that the indexes are duplicated (0, 1, 2, 0, 1, 2)? To ensure that the original indexes are ignored and a clean, contiguous index is created, use the ignore_index=True attribute:

pd.concat([df1, df2], ignore_index=True)

If the columns are not perfectly aligned for appending, Pandas will add the extra columns to the resulting DataFrame and insert NaN values where data is missing. Here is an example:

raw_data1 = {
    'fruit_id': ['1', '2', '3'],
    'name': ['Apple', 'Banana', 'Orange']
}
df1 = pd.DataFrame(raw_data1, columns=['fruit_id', 'name'])

raw_data2 = {
    "fruit_id": ["1", "2", "3", "4"],
    "name": ["A", "Ap", "B", "O"],
    "calories": ["55", "50", "45", "60"],
}
df2 = pd.DataFrame(raw_data2, columns=["fruit_id", "name", "calories"])

df3 = pd.concat([df1, df2], ignore_index=True)
print(df3)

Comparing DataFrames

When you want to compare two DataFrames, use the equals() method instead of the == operator.

This is crucial because, with the == operator, one NaN is not considered equal to another NaN, which will cause the comparison to incorrectly return False. The equals() method correctly handles NaN comparisons. Also, note that the DataFrame indexes and columns must be in the exact same order for equals() to return True.

df1.equals(df2)