Data Cleaning and Manipulation in Pandas:

May 04, 2023

Data Cleaning and Manipulation in Pandas:

Handling missing values:

In real-world datasets, missing values are quite common. Pandas provides various functions to handle missing values, such as isna(), fillna(), and dropna().

For example:

python
import pandas as pd 
data = {'Name': ['John', 'Alice', 'Bob', 'Mary'], 'Age': [25, 30, None, 35]} 
df = pd.DataFrame(data) 
df.fillna(0, inplace=True) 
print(df)

This will fill all the missing values in the DataFrame with 0.

Data filtering:

Data filtering is the process of selecting rows or columns based on some condition. Pandas provides various ways to filter data, such as using boolean indexing, query() method, and loc[] and iloc[] methods.

For example:

bash
import pandas as pd 
data = {'Name': ['John', 'Alice', 'Bob', 'Mary'], 'Age': [25, 30, 35, 40]} 
df = pd.DataFrame(data) 
filtered_data = df[df['Age'] > 30] 
print(filtered_data)

This will filter out all the rows where the age is less than or equal to 30.

Data transformation:

Data transformation is the process of converting data from one form to another. Pandas provides various functions to transform data, such as apply(), map(), and replace().

For example:

scss
import pandas as pd 
data = {'Name': ['John', 'Alice', 'Bob', 'Mary'], 'Age': [25, 30, 35, 40]} 
df = pd.DataFrame(data) 
df['Gender'] = ['Male', 'Female', 'Male', 'Female'] 
df['Age'] = df['Age'].apply(lambda x: x + 10) 
print(df)

This will add a new column 'Gender' to the DataFrame and increase the age of all individuals by 10.

Data merging and joining:

Data merging and joining are the processes of combining data from different sources into a single DataFrame. Pandas provides various functions to merge and join data, such as merge(), concat(), and join().

For example:

python
import pandas as pd 
data1 = {'Name': ['John', 'Alice', 'Bob'], 'Age': [25, 30, 35]}
data2 = {'Name': ['Alice', 'Bob', 'Mary'], 'Salary': [50000, 60000, 70000]} 
df1 = pd.DataFrame(data1) 
df2 = pd.DataFrame(data2) 
merged_data = pd.merge(df1, df2, on='Name') 
print(merged_data)

This will merge the two DataFrames based on the 'Name' column and create a new DataFrame that contains the columns 'Name', 'Age', and 'Salary'.

Search This Blog

Module Title: From VBA to Pandas: A Comprehensive Guide for Data Analysts