What is pandas groupby() and how to use it
With the Python pandas DataFrame.groupby()
function, you can group data based on specific criteria and perform various aggregations and transformations to the data.
- 99.9% uptime
- PHP 8.3 with JIT compiler
- SSL, DDoS protection, and backups
What is the syntax for pandas DataFrame.groupby()
?
Pandas groupby()
accepts up to four parameters. The basic syntax is as follows:
DataFrame.groupby(by=None, level=None, as_index=True, sort=True, group_keys=True, dropna=True)
pythonImportant parameters for groupby
Parameter | Description | Default Value |
---|---|---|
by
|
Key or Python list of keys to group by; not to be combined with level
|
None
|
level
|
Used for MultiIndex to specify one or more levels for grouping | None
|
as_index
|
If True , the group keys are set as the index of the resulting DataFrame
|
True
|
group_keys
|
If True , the group keys are included in the index of the groups
|
True
|
dropna
|
Specifies whether to exclude groups with NaN values | True
|
How to use pandas DataFrame.groupby()
The pandas groupby()
function is particularly useful for analyzing and summarizing large datasets, helping to identify patterns or anomalies.
Grouping and aggregating
Below is an example sales dataset containing information about the sale date, product sold and quantity sold:
import pandas as pd
# Sample sales dataset
data = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Quantity': [10, 20, 15, 25, 10]
}
df = pd.DataFrame(data)
print(df)
pythonThe resulting DataFrame looks like this:
Date Product Quantity
0 2021-01-01 A 10
1 2021-01-01 B 20
2 2021-01-02 A 15
3 2021-01-02 B 25
4 2021-01-03 A 10
Next, we’ll group the dataset by product using pandas groupby()
. Then, we’ll calculate the total quantity sold for each product using the sum()
function:
# Group by product and calculate the sum of the quantity sold
summe = df.groupby('Product')['Quantity'].sum()
print(total)
The result shows the total number of units sold for each product:
Product
A 35
B 45
Name: Quantity, dtype: int64
Multiple aggregations
In the following example, we’re going to use an extended dataset that also includes revenue:
data = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Quantity': [10, 20, 15, 25, 10],
'Revenue': [100, 200, 150, 250, 100]
}
df = pd.DataFrame(data)
print(df)
pythonThe DataFrame looks like this:
Date Product Quantity Revenue
0 2021-01-01 A 10 100
1 2021-01-01 B 20 200
2 2021-01-02 A 15 150
3 2021-01-02 B 25 250
4 2021-01-03 A 10 100
Using pandas DataFrame.groupby()
, we’re going to group the data by product and then use the agg()
function to calculate the total quantity and revenue, as well as the average revenue per product.
# Group by product and apply multiple aggregations
groups = df.groupby('Product').agg({
'Quantity': 'sum',
'Revenue': ['sum', 'mean']
})
print(groups)
Here’s the result:
Quantity Revenue
sum sum mean
Product
A 35 350 116.666667
B 45 450 225.000000