With the pandas function DataFrame.merge(), you can combine DataFrames using keys. This allows you to efficiently combine data from various sources, enabling you to carry out more comprehensive analyses.

Web Hosting
Fast, scalable hosting for any website
  • 99.9% uptime
  • PHP 8.3 with JIT compiler
  • SSL, DDoS protection, and backups

What is the syntax for pandas merge()?

The Python pandas DataFrame merge() method can accept a range of different parameters, allowing developers to specify how DataFrames should be combined. The general syntax of the merge() function is as follows:

DataFrame.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
python
Note

The pandas merge() function is similar to the JOIN operation in SQL. Having experience with relational databases like SQL can make it easier to understand how the pandas DataFrame merge() method words. Keep in mind, though, that there are some differences. In pandas, if both key columns have values that evaluate to null, those values will also be merged.

What parameters can be used with pandas merge?

The various parameters accepted by merge() allow you to not only specify which pandas DataFrames to combine but also which type of join to use as well as other details.

Parameter Description Default Value
left The first DataFrame to be merged
right The second DataFrame to be merged
how The type of join operation to perform (inner, outer, left, right) inner
on The column(s) or index level(s) to use as keys; must be present in both DataFrames
left_on The column(s) or index level(s) of the left DataFrame that should be used as the key(s)
right_on The column(s) or index level(s) of the right DataFrame that should be used as the key(s)
left_index If True, the index of the left DataFrame is used as the key False
right_index If True, the index of the right DataFrame is used as the key False
sort If True, the resulting DataFrame’s keys are sorted lexicographically False
suffixes Suffixes used to distinguish columns with the same name ("_x", "_y")
copy If False, a copy is avoided True

How to use pandas merge()

Below are some examples to help illustrate how to use pandas merge() effectively.

INNER JOIN

An INNER JOIN combines two pandas DataFrames and only returns the rows where the keys match in both DataFrames. To get a better idea of how this works, let’s create two DataFrames:

import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
    'Key': ['A', 'B', 'C'],
    'Value1': [1, 2, 3]
})
df2 = pd.DataFrame({
    'Key': ['B', 'C', 'D'],
    'Value2': [4, 5, 6]
})
print(df1)
print(df2)
python

The two resulting DataFrames look like this:

Key    Value1
0     A            1
1     B            2
2     C            3
    Key    Value2
0     B            4
1     C            5
2     D            6

Now, we can perform an INNER JOIN using the merge() function:

# INNER JOIN
result = pd.merge(df1, df2, how='inner', on='Key')
print(result)
python

The output shows that only the rows that have keys B and C are included in the DataFrame. This is because these two keys exist in both of the original DataFrames.

Key    Value1    Value2
0     B            2            4
1     C            3            5

OUTER JOIN

An OUTER JOIN also merges two DataFrames, but unlike INNER JOIN, it returns all the rows and fills in missing values with NaN.

# OUTER JOIN
result = pd.merge(df1, df2, how='outer', on='Key')
print(result)
python

The DataFrame below includes all the rows from both DataFrames. NaN is used for the missing values in key A, which is only in df1 and key D, which is only in df2.

Key    Value1    Value2
0     A        1.0        NaN
1     B        2.0        4.0
2     C        3.0        5.0
3     D        NaN        6.0
Note

The other standard variants of JOIN work in a similar manner.

Using left_on and right_on

Sometimes, two DataFrames have different column names for their keys. In this case, you can use the left_on and right_on parameters to specify which columns to use. First, let’s create two new DataFrames:

df3 = pd.DataFrame({
    'Key': ['A', 'B', 'C'],
    'Value1': [1, 2, 3]
})
df4 = pd.DataFrame({
    'Key2': ['B', 'C', 'D'],
    'Value2': [4, 5, 6]
})
print(df3)
print(df4)
python

The two DataFrames look like this:

Key    Value1
0     A            1
1     B            2
2     C            3
    Key2    Value2
0        B            4
1        C            5
2        D            6

We can use the left_on and right_on parameters to perform the JOIN operation using different keys:

# Join with different key column names
result = pd.merge(df3, df4, how='inner', left_on='Key', right_on='Key2')
print(result)
python

By explicitly using left_on='Key' and right_on='Key2', the corresponding key columns are utilized for the merge.

Key    Value1 Key2    Value2
0     B            2        B            4
1     C            3        C            5

Using indices as keys

You can also use the indices of DataFrames as keys by setting the left_index and right_index parameters to True. First, let’s create two new DataFrames with indices:

df5 = pd.DataFrame({
    'Value1': [1, 2, 3]
}, index=['A', 'B', 'C'])
df6 = pd.DataFrame({
    'Value2': [4, 5, 6]
}, index=['B', 'C', 'D'])
print(df5)
print(df6)
python

Here are the DataFrames:

Value1
A        1
B        2
C        3
    Value2
B        4
C        5
D        6

Now, we can perform a JOIN operation using the indices:

# JOIN with indices
result = pd.merge(df5, df6, how='inner', left_index=True, right_index=True)
print(result)
python

The result is a JOIN that uses the indices from the DataFrames:

Value1  Value2
B        2        4
C        3        5
Was this article helpful?
Go to Main Menu