With the Python pandas DataFrame.groupby() function, you can group data based on specific criteria and perform various aggregations and transformations to the data.

Web Hosting
Fast, scalable hosting for any website
  • 99.9% uptime
  • PHP 8.3 with JIT compiler
  • SSL, DDoS protection, and backups

What is the syntax for pandas DataFrame.groupby()?

Pandas groupby() accepts up to four parameters. The basic syntax is as follows:

DataFrame.groupby(by=None, level=None, as_index=True, sort=True, group_keys=True, dropna=True)
python

Important parameters for groupby

Parameter Description Default Value
by Key or Python list of keys to group by; not to be combined with level None
level Used for MultiIndex to specify one or more levels for grouping None
as_index If True, the group keys are set as the index of the resulting DataFrame True
group_keys If True, the group keys are included in the index of the groups True
dropna Specifies whether to exclude groups with NaN values True

How to use pandas DataFrame.groupby()

The pandas groupby() function is particularly useful for analyzing and summarizing large datasets, helping to identify patterns or anomalies.

Grouping and aggregating

Below is an example sales dataset containing information about the sale date, product sold and quantity sold:

import pandas as pd
# Sample sales dataset
data = {
    'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03'],
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Quantity': [10, 20, 15, 25, 10]
}
df = pd.DataFrame(data)
print(df)
python

The resulting DataFrame looks like this:

Date Product  Quantity
0  2021-01-01       A       10
1  2021-01-01       B       20
2  2021-01-02       A       15
3  2021-01-02       B       25
4  2021-01-03       A       10

Next, we’ll group the dataset by product using pandas groupby(). Then, we’ll calculate the total quantity sold for each product using the sum() function:

# Group by product and calculate the sum of the quantity sold
summe = df.groupby('Product')['Quantity'].sum()
print(total)

The result shows the total number of units sold for each product:

Product
A    35
B    45
Name: Quantity, dtype: int64

Multiple aggregations

In the following example, we’re going to use an extended dataset that also includes revenue:

data = {
    'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03'],
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Quantity': [10, 20, 15, 25, 10],
    'Revenue': [100, 200, 150, 250, 100]
}
df = pd.DataFrame(data)
print(df)
python

The DataFrame looks like this:

Date Product  Quantity  Revenue
0  2021-01-01       A       10      100
1  2021-01-01       B       20      200
2  2021-01-02       A       15      150
3  2021-01-02       B       25      250
4  2021-01-03       A       10      100

Using pandas DataFrame.groupby(), we’re going to group the data by product and then use the agg() function to calculate the total quantity and revenue, as well as the average revenue per product.

# Group by product and apply multiple aggregations
groups = df.groupby('Product').agg({
    'Quantity': 'sum',
    'Revenue': ['sum', 'mean']
})
print(groups)

Here’s the result:

Quantity Revenue        
          sum    sum    mean
Product                  
A          35    350  116.666667
B          45    450  225.000000
Was this article helpful?
Go to Main Menu