How to merge DataFrames with pandas merge()

Contents

With the pandas function DataFrame.merge(), you can combine DataFrames using keys. This allows you to efficiently combine data from various sources, enabling you to carry out more comprehensive analyses.

Web Hosting

Hosting that scales with your ambitions

Stay online with 99.99% uptime and robust security
Add performance with a click as traffic grows
Includes free domain, SSL, email, and 24/7 support

What is the syntax for pandas `merge()`?

The Python pandas DataFrame merge() method can accept a range of different parameters, allowing developers to specify how DataFrames should be combined. The general syntax of the merge() function is as follows:

DataFrame.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

python

Note

The pandas merge() function is similar to the JOIN operation in SQL. Having experience with relational databases like SQL can make it easier to understand how the pandas DataFrame merge() method words. Keep in mind, though, that there are some differences. In pandas, if both key columns have values that evaluate to null, those values will also be merged.

What parameters can be used with pandas `merge`?

The various parameters accepted by merge() allow you to not only specify which pandas DataFrames to combine but also which type of join to use as well as other details.

Parameter	Description	Default Value
left	The first DataFrame to be merged
right	The second DataFrame to be merged
how	The type of join operation to perform (`inner`, `outer`, `left`, `right`)	`inner`
on	The column(s) or index level(s) to use as keys; must be present in both DataFrames
left_on	The column(s) or index level(s) of the left DataFrame that should be used as the key(s)
right_on	The column(s) or index level(s) of the right DataFrame that should be used as the key(s)
left_index	If `True`, the index of the left DataFrame is used as the key	`False`
right_index	If `True`, the index of the right DataFrame is used as the key	`False`
sort	If `True`, the resulting DataFrame’s keys are sorted lexicographically	`False`
suffixes	Suffixes used to distinguish columns with the same name	`("_x", "_y")`
copy	If `False`, a copy is avoided	`True`

How to use pandas `merge()`

Below are some examples to help illustrate how to use pandas merge() effectively.

`INNER JOIN`

An INNER JOIN combines two pandas DataFrames and only returns the rows where the keys match in both DataFrames. To get a better idea of how this works, let’s create two DataFrames:

import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
    'Key': ['A', 'B', 'C'],
    'Value1': [1, 2, 3]
})
df2 = pd.DataFrame({
    'Key': ['B', 'C', 'D'],
    'Value2': [4, 5, 6]
})
print(df1)
print(df2)

python

The two resulting DataFrames look like this:

Key    Value1
0     A            1
1     B            2
2     C            3
    Key    Value2
0     B            4
1     C            5
2     D            6

Now, we can perform an INNER JOIN using the merge() function:

# INNER JOIN
result = pd.merge(df1, df2, how='inner', on='Key')
print(result)

python

The output shows that only the rows that have keys B and C are included in the DataFrame. This is because these two keys exist in both of the original DataFrames.

Key    Value1    Value2
0     B            2            4
1     C            3            5

`OUTER JOIN`

An OUTER JOIN also merges two DataFrames, but unlike INNER JOIN, it returns all the rows and fills in missing values with NaN.

# OUTER JOIN
result = pd.merge(df1, df2, how='outer', on='Key')
print(result)

python

The DataFrame below includes all the rows from both DataFrames. NaN is used for the missing values in key A, which is only in df1 and key D, which is only in df2.

Key    Value1    Value2
0     A        1.0        NaN
1     B        2.0        4.0
2     C        3.0        5.0
3     D        NaN        6.0

Note

The other standard variants of JOIN work in a similar manner.

Using `left_on` and `right_on`

Sometimes, two DataFrames have different column names for their keys. In this case, you can use the left_on and right_on parameters to specify which columns to use. First, let’s create two new DataFrames:

df3 = pd.DataFrame({
    'Key': ['A', 'B', 'C'],
    'Value1': [1, 2, 3]
})
df4 = pd.DataFrame({
    'Key2': ['B', 'C', 'D'],
    'Value2': [4, 5, 6]
})
print(df3)
print(df4)

python

The two DataFrames look like this:

Key    Value1
0     A            1
1     B            2
2     C            3
    Key2    Value2
0        B            4
1        C            5
2        D            6

We can use the left_on and right_on parameters to perform the JOIN operation using different keys:

# Join with different key column names
result = pd.merge(df3, df4, how='inner', left_on='Key', right_on='Key2')
print(result)

python

By explicitly using left_on='Key' and right_on='Key2', the corresponding key columns are utilized for the merge.

Key    Value1 Key2    Value2
0     B            2        B            4
1     C            3        C            5

Using indices as keys

You can also use the indices of DataFrames as keys by setting the left_index and right_index parameters to True. First, let’s create two new DataFrames with indices:

df5 = pd.DataFrame({
    'Value1': [1, 2, 3]
}, index=['A', 'B', 'C'])
df6 = pd.DataFrame({
    'Value2': [4, 5, 6]
}, index=['B', 'C', 'D'])
print(df5)
print(df6)

python

Here are the DataFrames:

Now, we can perform a JOIN operation using the indices:

# JOIN with indices
result = pd.merge(df5, df6, how='inner', left_index=True, right_index=True)
print(result)

python

The result is a JOIN that uses the indices from the DataFrames:

Value1  Value2
B        2        4
C        3        5

Was this article helpful?

What is Pandas fillna() and how to use it

The Pandas fillna() method is a function used to handle missing values. Various parameters can be used with the function, offering flexibility when replacing NaN values. In this article, we’ll take a look at this function, its syntax and parameters and how to customize…

Python Pandas

How to identify missing values with the pandas isna() function

The pandas isna() function is a useful tool for identifying missing data in a DataFrame. With its simple syntax, it quickly gives you a clear overview of missing values, helping you take action when data needs to be cleaned. In this article, you’ll learn what pandas isna() is and…

Python Pandas

BEST-BACKGROUNDSShutterstock

How to search DataFrames using pandas isin()

Pandas isin() is a helpful function for data analysis. With its straightforward syntax and versatile applications, it allows you to efficiently check for values in a DataFrame. Whether you’re verifying single columns, filtering DataFrames or conducting more complex analyses with…

Python Pandas

NDAB Creativityshutterstock

What is pandas groupby() and how to use it

The pandas DataFrame.groupby() function is a powerful tool for organizing data. It allows you to group data according to specific criteria, making it easier to perform complex aggregations and transformations. By using this method effectively, you can streamline your analysis…

Python Pandas

How to filter for distinct values with pandas DataFrame[].unique()

With pandas DataFrame[].unique(), you can identify distinct values in a column of a DataFrame. It returns a numpy array, helping you to handle large datasets more efficiently. The method is especially helpful if you want to have an overview of the information in a column without…

Python Pandas

Gorodenkoffshutterstock

How to apply conditions in pandas DataFrames with where()

With pandas DataFrame.where(), you can modify data in your DataFrame using conditions. By creating conditions to determine which values to keep and which ones to replace, you can efficiently clean, extract or transform data in a DataFrame. In this article, we’ll take a look at…

Python Pandas

How to merge DataFrames with pandas merge()

What is the syntax for pandas merge()?

What parameters can be used with pandas merge?

How to use pandas merge()

INNER JOIN

OUTER JOIN

Using left_on and right_on

Using indices as keys

What is the syntax for pandas `merge()`?

What parameters can be used with pandas `merge`?

How to use pandas `merge()`

`INNER JOIN`

`OUTER JOIN`

Using `left_on` and `right_on`