Aggregation and Groupby Operations

May 04, 2023

Aggregation and Groupby Operations are important aspects of data analysis with Pandas.

Aggregation functions allow us to perform mathematical operations such as sum, mean, median, and others on our data. Groupby operations allow us to group our data based on a specific column and perform operations on each group. Pivot tables allow us to summarize our data and perform calculations on subsets of our data.

Aggregation functions:

Pandas provides a set of built-in functions for aggregation such as sum, mean, median, min, max, etc. These functions can be applied to Series or DataFrame objects.

Example:

python
import pandas as pd 
 df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]})
print(df) 
# Sum of all columns 
print(df.sum()) 
# Mean of all columns 
print(df.mean()) 
# Maximum value of each column 
print(df.max())

Output:

less
  A  B  C 
0 1  5  9 
1  2   6   10 
2  3  7   11 
3 4   8  12 
A    10 
B    26 
C    42 
dtype: int64 
A    2.5 
B    6.5 
C    10.5 
dtype: float64 
A     4 
B     8 
C    12 
dtype: int64

Groupby operations:

Groupby is a powerful method in Pandas that allows us to group data based on one or more columns and apply an aggregation function.

Example:

scss
import pandas as pd 
 data = {'Name': ['John', 'Sam', 'John', 'Adam', 'Sam', 'Adam'],
        'Subject': ['Maths', 'Physics', 'Maths', 'Chemistry', 'Physics', 'Chemistry'],
        'Marks': [80, 70, 90, 85, 75, 95]} 
 df = pd.DataFrame(data) 
print(df) 
 # Group by 'Name' column and find mean of 'Marks' 
grouped_df = df.groupby('Name')['Marks'].mean() 
print(grouped_df) 
 # Group by 'Subject' column and find maximum 'Marks' 
grouped_df = df.groupby('Subject')['Marks'].max() 
print(grouped_df)

Output:

yaml
  Name    Subject  Marks 
0   John     Maths    80 
1  Sam   Physics    70 
2   John     Maths    90 
3   Adam  Chemistry    85 
4  Sam   Physics    75 
5   Adam  Chemistry    95 
Name 
Adam    90.0 
John    85.0 
Sam     72.5 
Name: Marks, dtype: float64 
  Subject 
Chemistry    95 
    Maths        90 
  Physics      75 
Name: Marks, dtype: int64

Pivot tables:

Pivot tables allow us to summarize our data and perform calculations on subsets of our data.

Here's an example of using pivot tables in Pandas:

python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Chicago'],
        'Sales': [500, 200, 300, 100, 400, 600]} 
 df = pd.DataFrame(data) 
# Create pivot table to calculate total sales by city 
pivot_table = df.pivot_table(values='Sales', index='City', aggfunc='sum')

print(pivot_table)

Output:

markdown
            Sales  
  City 
 Chicago      900 
 Houston      100 
Los Angeles   200 
New York      900

In this example, we have a DataFrame with sales data for different cities. We use the pivot_table() function to create a summary of total sales by city, using the values, index, and aggfunc parameters. The resulting pivot table shows the total sales for each city.

Aggregation and Groupby on DataFrame:

Here are some examples of aggregation and groupby operations in Pandas using a sample DataFrame

python
import pandas as pd

# create a sample dataframe
data = {'Country': ['USA', 'Canada', 'USA', 'Canada', 'Mexico', 'Mexico'],
        'Year': [2018, 2018, 2019, 2019, 2018, 2019],
        'Sales': [100, 200, 150, 250, 300, 200]} 
df = pd.DataFrame(data) 
# aggregation function to find total sales per country 
total_sales = df.groupby('Country')['Sales'].sum()
print(total_sales) 
# groupby operation to find average sales per year for each country 
avg_sales = df.groupby(['Country', 'Year'])['Sales'].mean()
print(avg_sales) 
# pivot table to summarize sales by country and year 
pivot_sales = df.pivot_table(index='Country', columns='Year', values='Sales', aggfunc='sum') 
print(pivot_sales)

Output:

yaml
Country 
Canada     450 
Mexico     500 
   USA        250 
Name: Sales, dtype: int64 
Country  Year 
 Canada  2018    200
         2019    250 
 Mexico  2018    300
         2019    200 
    USA  2018    100
         2019    150 
Name: Sales, dtype: float64 
  Year  2018  2019 
Country 
 Canada   200  250 
Mexico    300  200 
   USA    100  150

Search This Blog

Module Title: From VBA to Pandas: A Comprehensive Guide for Data Analysts