Aggregation and Groupby Operations
Aggregation and Groupby Operations are important aspects of data analysis with Pandas.
Aggregation functions allow us to perform mathematical operations such as sum, mean, median, and others on our data. Groupby operations allow us to group our data based on a specific column and perform operations on each group. Pivot tables allow us to summarize our data and perform calculations on subsets of our data.
- Aggregation functions:
Pandas provides a set of built-in functions for aggregation such as sum, mean, median, min, max, etc. These functions can be applied to Series or DataFrame objects.Example:
pythonimport pandas as pd df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12]})
print(df) # Sum of all columns print(df.sum()) # Mean of all columns print(df.mean()) # Maximum value of each column print(df.max())
Output:
less A B C 0 1 5 9 1 2 6 10 2 3 7 11 3 4 8 12 A 10 B 26 C 42 dtype: int64 A 2.5 B 6.5 C 10.5 dtype: float64 A 4 B 8 C 12 dtype: int64
- Groupby operations:
Groupby is a powerful method in Pandas that allows us to group data based on one or more columns and apply an aggregation function.Example:
scssimport pandas as pd data = {'Name': ['John', 'Sam', 'John', 'Adam', 'Sam', 'Adam'],
'Subject': ['Maths', 'Physics', 'Maths', 'Chemistry', 'Physics', 'Chemistry'],
'Marks': [80, 70, 90, 85, 75, 95]} df = pd.DataFrame(data) print(df) # Group by 'Name' column and find mean of 'Marks' grouped_df = df.groupby('Name')['Marks'].mean() print(grouped_df) # Group by 'Subject' column and find maximum 'Marks' grouped_df = df.groupby('Subject')['Marks'].max() print(grouped_df)
Output:
yaml Name Subject Marks 0 John Maths 80 1 Sam Physics 70 2 John Maths 90 3 Adam Chemistry 85 4 Sam Physics 75 5 Adam Chemistry 95 Name Adam 90.0 John 85.0 Sam 72.5 Name: Marks, dtype: float64 Subject Chemistry 95 Maths 90 Physics 75 Name: Marks, dtype: int64- Pivot tables:
Pivot tables allow us to summarize our data and perform calculations on subsets of our data.Here's an example of using pivot tables in Pandas:
pythonimport pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Chicago'],
'Sales': [500, 200, 300, 100, 400, 600]} df = pd.DataFrame(data) # Create pivot table to calculate total sales by city pivot_table = df.pivot_table(values='Sales', index='City', aggfunc='sum')
print(pivot_table)
Output:
markdown Sales City Chicago 900 Houston 100 Los Angeles 200 New York 900
In this example, we have a DataFrame with sales data for different cities. We use the pivot_table() function to create a summary of total sales by city, using the values, index, and aggfunc parameters. The resulting pivot table shows the total sales for each city.
Aggregation and Groupby on DataFrame:
Here are some examples of aggregation and groupby operations in Pandas using a sample DataFrame
pythonimport pandas as pd
# create a sample dataframe
data = {'Country': ['USA', 'Canada', 'USA', 'Canada', 'Mexico', 'Mexico'],
'Year': [2018, 2018, 2019, 2019, 2018, 2019],
'Sales': [100, 200, 150, 250, 300, 200]} df = pd.DataFrame(data) # aggregation function to find total sales per country total_sales = df.groupby('Country')['Sales'].sum()
print(total_sales) # groupby operation to find average sales per year for each country avg_sales = df.groupby(['Country', 'Year'])['Sales'].mean()
print(avg_sales) # pivot table to summarize sales by country and year pivot_sales = df.pivot_table(index='Country', columns='Year', values='Sales', aggfunc='sum') print(pivot_sales)
Output:
yamlCountry Canada 450 Mexico 500 USA 250 Name: Sales, dtype: int64 Country Year Canada 2018 200
2019 250 Mexico 2018 300
2019 200 USA 2018 100
2019 150 Name: Sales, dtype: float64 Year 2018 2019 Country Canada 200 250 Mexico 300 200 USA 100 150
Comments
Post a Comment