Python Pandas is an open-source library specif­i­cal­ly designed for analyzing and ma­nip­u­lat­ing data. It provides pro­gram­mers with data struc­tures and functions that simplify the handling of numerical tables and time series.

What is Python Pandas used for?

The Pandas library is widely used in various areas of data pro­cess­ing, thanks to its extensive functions that support a range of ap­pli­ca­tions:

-Ex­plorato­ry Data Analysis (EDA): Python Pandas fa­cil­i­tates the ex­plo­ration and general un­der­stand­ing of data sets. With functions such as describe(), head() or info(), de­vel­op­ers can quickly gain insights into the data sets and recognize sta­tis­ti­cal cor­re­la­tions.

  • Data cleansing and pre­pro­cess­ing: Data from diverse sources often needs to be cleansed and brought into a con­sis­tent format before it can be analyzed. Here too, Pandas offers a variety of functions for filtering or trans­form­ing data.
  • Data ma­nip­u­la­tion and trans­for­ma­tion: The main task of Pandas is the ma­nip­u­la­tion, analysis, and trans­for­ma­tion of data sets. Functions such as merge() or groupby() enable complex data op­er­a­tions.
  • Data vi­su­al­iza­tion: Another practical field of ap­pli­ca­tion arises in com­bi­na­tion with libraries such as Mat­plotlib or Seaborn. In this way, Pandas data frames can be converted directly into mean­ing­ful diagrams or plotted.

Ad­van­tages of Python Pandas

Python Pandas offers numerous ad­van­tages that make it an in­dis­pens­able tool for data analysts and re­searchers. The intuitive and easy to un­der­stand API ensures a high level of user-friend­li­ness. Since the central data struc­tures of Python Pandas – DataFrame und Series– are similar to spread­sheets, getting started is not too difficult either.

Another key advantage of Python Pandas is its per­for­mance. Although Python is regarded as a rather slow pro­gram­ming language, Pandas can process even large data sets ef­fi­cient­ly. This is because the library is written in C and uses optimized al­go­rithms.

Pandas supports various data formats, including CSV, Excel, and SQL databases, allowing for easy import and export from diverse sources, which adds im­pres­sive flex­i­bil­i­ty. Its in­te­gra­tion with existing libraries in the Python ecosystem, such as NumPy or Mat­plotlib, further enhances its ver­sa­til­i­ty and enables com­pre­hen­sive data analysis and modeling.

Note

If you’re ex­pe­ri­enced with other pro­gram­ming languages like R or database languages such as SQL, you’ll find many familiar concepts when working with Pandas.

A practical example of the Pandas syntax

To il­lus­trate the basic syntax of Pandas, let’s look at a simple example. Suppose we have a CSV dataset that contains in­for­ma­tion about sales. We’ll load this dataset, examine it, and perform some basic data ma­nip­u­la­tion. The data set is struc­tured as follows:

Date,Product,Quantity,Price
2024-01-01,Product A,10,20.00
2024-01-02,Product B,5,30.00
2024-01-03,Product C,7,25.00
2024-01-04,Product A,3,20.00
2024-01-05,Product B,6,30.00
2024-01-06,Product C,2,25.00
2024-01-07,Product A,8,20.00
2024-01-08,Product B,4,30.00
2024-01-09,Product C,10,25.00

Step 1: Importing pandas and loading the data set

Once Python Pandas has been imported, you can create a dataframe from the CSV data using read_csv().

import pandas as pd
# Load the data record from a CSV file named sales_data.csv
df = pd.read_csv('sales_data.csv')
python

Step 2: Examining the data set

An initial overview of the data can be obtained by dis­play­ing the first lines and a sta­tis­ti­cal summary of the data set. The functions head() and describe() are used for this purpose. The latter provides an overview of important sta­tis­ti­cal key figures such as the minimum and maximum value, the standard deviation or the mean value.

# Display the first five lines of the data frame
print(df.head())
# Display a statistical summary
print(df.describe())
python

Step 3: Ma­nip­u­lat­ing the data

Data ma­nip­u­la­tion also works with Python Pandas. In the following code snippet, the sales data is to be ag­gre­gat­ed by product and month:

# Convert the “Date” column into a datetime object so that the dates are recognized as such
df['Date'] = pd.to_datetime(df['Date'])
# Extract the month from the “Date” column and save it in a new column called “Month”
df['Month'] = df['Date'].dt.month
# Calculate the revenue (Quantity * Price) and save it in the column called “Revenue”
df['Revenue'] = df['Quantity'] * df['Price']
# Aggregate sales data by product and month
sales_summary = df.groupby(['Product', 'Month'])['Revenue'].sum().reset_index()
# Display aggregated data
print(sales_summary)
python

Step 4: Vi­su­al­iz­ing the data

Finally, you can visualize the monthly sales figures of a product using the ad­di­tion­al Python library Mat­plotlib.

import matplotlib.pyplot as plt
# Filter data for a specific product
product_sales = sales_summary[sales_summary['Product'] == 'Product A']
# Create a line diagram 
plt.plot(product_sales['Month'], product_sales['Revenue'], marker='o')
plt.xlabel('Month')
plt.gca().set_xticks(product_sales['Month'])
plt.ylabel('Turnover')
plt.title('Monthly turnover for product A')
plt.grid(True)
plt.show()
python

The vi­su­al­ized diagram indicates that in the first month of the year, $940 was generated from product A:

Image: Plot Python Pandas data
Python Pandas data can be easily plotted in com­bi­na­tion with other libraries.
Go to Main Menu