Aggregation and Groupby Operations
Aggregation and Groupby Operations are important aspects of data analysis with Pandas.
Aggregation functions allow us to perform mathematical operations such as sum, mean, median, and others on our data. Groupby operations allow us to group our data based on a specific column and perform operations on each group. Pivot tables allow us to summarize our data and perform calculations on subsets of our data.
- Aggregation functions:
Pandas provides a set of built-in functions for aggregation such as sum, mean, median, min, max, etc. These functions can be applied to Series or DataFrame objects.Example:
pythonimport pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]})
print(df)
# Sum of all columns
print(df.sum())
# Mean of all columns
print(df.mean())
# Maximum value of each column
print(df.max())
Output:
less A B C
0 1 5 9
1 2 6 10
2 3 7 11
3
4 8 12 A 10
B 26
C 42
dtype: int64
A 2.5
B 6.5
C 10.5
dtype: float64
A 4
B 8
C 12
dtype: int64
- Groupby operations:
Groupby is a powerful method in Pandas that allows us to group data based on one or more columns and apply an aggregation function.Example:
scssimport pandas as pd
data = {'Name': ['John', 'Sam', 'John', 'Adam', 'Sam', 'Adam'],
'Subject': ['Maths', 'Physics', 'Maths', 'Chemistry', 'Physics', 'Chemistry'],
'Marks': [80, 70, 90, 85, 75, 95]}
df = pd.DataFrame(data)
print(df)
# Group by 'Name' column and find mean of 'Marks'
grouped_df = df.groupby('Name')['Marks'].mean()
print(grouped_df)
# Group by 'Subject' column and find maximum 'Marks'
grouped_df = df.groupby('Subject')['Marks'].max()
print(grouped_df)
Output:
yaml Name Subject Marks
0 John Maths 80
1 Sam Physics 70
2 John Maths 90
3 Adam Chemistry 85
4 Sam Physics 75
5 Adam Chemistry 95
Name
Adam 90.0
John 85.0
Sam 72.5
Name: Marks,
dtype: float64 Subject
Chemistry 95
Maths 90
Physics 75
Name: Marks,
dtype: int64
- Pivot tables:
Pivot tables allow us to summarize our data and perform calculations on subsets of our data.Here's an example of using pivot tables in Pandas:
pythonimport pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Chicago'],
'Sales': [500, 200, 300, 100, 400, 600]}
df = pd.DataFrame(data)
# Create pivot table to calculate total sales by city
pivot_table = df.pivot_table(values='Sales', index='City', aggfunc='sum')
print(pivot_table)
Output:
markdown Sales
City
Chicago 900
Houston 100
Los Angeles 200
New York 900
In this example, we have a DataFrame with sales data for different cities. We use the pivot_table()
function to create a summary of total sales by city, using the values
, index
, and aggfunc
parameters. The resulting pivot table shows the total sales for each city.
Aggregation and Groupby on DataFrame:
Here are some examples of aggregation and groupby operations in Pandas using a sample DataFrame
pythonimport pandas as pd
# create a sample dataframe
data = {'Country': ['USA', 'Canada', 'USA', 'Canada', 'Mexico', 'Mexico'],
'Year': [2018, 2018, 2019, 2019, 2018, 2019],
'Sales': [100, 200, 150, 250, 300, 200]}
df = pd.DataFrame(data)
# aggregation function to find total sales per country
total_sales = df.groupby('Country')['Sales'].sum()
print(total_sales)
# groupby operation to find average sales per year for each country
avg_sales = df.groupby(['Country', 'Year'])['Sales'].mean()
print(avg_sales)
# pivot table to summarize sales by country and year
pivot_sales = df.pivot_table(index='Country', columns='Year', values='Sales', aggfunc='sum')
print(pivot_sales)
Output:
yamlCountry
Canada 450
Mexico 500
USA 250
Name: Sales, dtype: int64
Country Year
Canada 2018 200
2019 250
Mexico 2018 300
2019 200
USA 2018 100
2019 150
Name: Sales, dtype: float64
Year 2018 2019
Country
Canada 200 250
Mexico 300 200
USA 100 150
Comments
Post a Comment