Aggregation and Groupby Operations

Aggregation and Groupby Operations are important aspects of data analysis with Pandas.

Aggregation functions allow us to perform mathematical operations such as sum, mean, median, and others on our data. Groupby operations allow us to group our data based on a specific column and perform operations on each group. Pivot tables allow us to summarize our data and perform calculations on subsets of our data.

  1. Aggregation functions:

Pandas provides a set of built-in functions for aggregation such as sum, mean, median, min, max, etc. These functions can be applied to Series or DataFrame objects.

Example:

python
import pandas as pd 
 df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]}) print(df) 
# Sum of all columns 
print(df.sum()) 
# Mean of all columns 
print(df.mean()) 
# Maximum value of each column 
print(df.max())

Output:

less
  A  B  C 
0 1  5  9 
1 10 
2 3   11 
3  8  12 
A 10 
B 26 
C 42 
dtype: int64 
A 2.5 
B 6.5 
C 10.5 
dtype: float64 
A 4 
B 8 
C 12 
dtype: int64

  1. Groupby operations:

Groupby is a powerful method in Pandas that allows us to group data based on one or more columns and apply an aggregation function.

Example:

scss
import pandas as pd 
 data = {'Name': ['John', 'Sam', 'John', 'Adam', 'Sam', 'Adam'], 'Subject': ['Maths', 'Physics', 'Maths', 'Chemistry', 'Physics', 'Chemistry'], 'Marks': [80, 70, 90, 85, 75, 95]} 
 df = pd.DataFrame(data) 
print(df) 
 # Group by 'Name' column and find mean of 'Marks
grouped_df = df.groupby('Name')['Marks'].mean() 
print(grouped_df) 
 # Group by 'Subject' column and find maximum 'Marks
grouped_df = df.groupby('Subject')['Marks'].max() 
print(grouped_df)

Output:

yaml
  Name    Subject  Marks 
0 John     Maths    80 
1  Sam   Physics    70 
2 John     Maths    90 
3 Adam Chemistry    85 
4  Sam   Physics    75 
5 Adam Chemistry    95 
Name 
Adam 90.0 
John 85.0 
Sam 72.5 
Name: Marks, dtype: float64 
  Subject 
Chemistry 95 
    Maths 90 
  Physics 75 
Name: Marks, dtype: int64

  1. Pivot tables:

Pivot tables allow us to summarize our data and perform calculations on subsets of our data.

Here's an example of using pivot tables in Pandas:

python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Chicago'], 'Sales': [500, 200, 300, 100, 400, 600]} 
 df = pd.DataFrame(data) 
# Create pivot table to calculate total sales by city 
pivot_table = df.pivot_table(values='Sales', index='City', aggfunc='sum') print(pivot_table)

Output:

markdown
            Sales  
  City 
 Chicago      900 
 Houston      100 
Los Angeles   200 
New York      900

In this example, we have a DataFrame with sales data for different cities. We use the pivot_table() function to create a summary of total sales by city, using the values, index, and aggfunc parameters. The resulting pivot table shows the total sales for each city.

Aggregation and Groupby on DataFrame:

Here are some examples of aggregation and groupby operations in Pandas using a sample DataFrame

python
import pandas as pd # create a sample dataframe data = {'Country': ['USA', 'Canada', 'USA', 'Canada', 'Mexico', 'Mexico'], 'Year': [2018, 2018, 2019, 2019, 2018, 2019], 'Sales': [100, 200, 150, 250, 300, 200]} 
df = pd.DataFrame(data) 
# aggregation function to find total sales per country 
total_sales = df.groupby('Country')['Sales'].sum() print(total_sales) 
# groupby operation to find average sales per year for each country 
avg_sales = df.groupby(['Country', 'Year'])['Sales'].mean() print(avg_sales) 
# pivot table to summarize sales by country and year 
pivot_sales = df.pivot_table(index='Country', columns='Year', values='Sales', aggfunc='sum'
print(pivot_sales)

Output:

yaml
Country 
Canada 450 
Mexico 500 
   USA 250 
Name: Sales, dtype: int64 
Country  Year 
 Canada  2018 200 2019 250 
 Mexico  2018 300 2019 200 
    USA  2018 100 2019 150 
Name: Sales, dtype: float64 
  Year  2018  2019 
Country 
 Canada   200  250 
Mexico    300  200 
   USA    100  150

Comments

Popular posts from this blog

How to use the statsmodels library in Python to calculate Exponential Smoothing

K-means Clustering 3D Plot Swiss roll Dataset

How to detect Credit Card Fraud Using Python Pandas