Data Cleaning Strategies

Issue #24 | Aug 23

Aug 23, 2024

Hello!!
Welcome to the new edition of Business Analytics Review!

We hope you enjoyed our previous newsletter on Exploratory Data Analysis! It was all about summarizing the main characteristics of a dataset with visual methods and discovering patterns.

In this edition, we’re diving into the world of Data Cleaning Strategies. Data cleaning strategies refer to the methods and techniques used to identify, correct, or remove inaccuracies, inconsistencies, and errors in data. Data cleaning allows you to improve the quality of data, making it accurate, complete, and ready for analysis or processing.

Data Cleaning

Handling Missing Values: Tackling empty cells.
Outlier Detection: Identifying and dealing with unusual data points.
Data Imputation: Filling in missing values with estimated values.

Recommended Reads on Data Cleaning

A comprehensive guide on using Power Query in Excel to clean and organize a messy dataset, specifically one derived from FIFA 21
Read More

A comprehensive guide on best practices for data cleaning in R, and handle data on Behavioral Risk Factors such as Nutrition, Physical Activity, and Obesity
Read More

Python's extensive library ecosystem offers numerous tools and utilities for data cleaning and preprocessing, allowing data scientists to streamline their data analysis workflow and prepare datasets for machine learning tasks efficiently
Read More

The data cleaning tools market is expected to witness significant growth in the coming years, driven by the increasing need for accurate and reliable data across various industries
Read More

Python code for data cleaning

import pandas as pd

# Sample data
data = {
    'Name': ['John Doe', 'Jane Smith', None, 'John Doe'],
    'Age': [28, None, 35, 28],
    'Salary': [50000, 54000, None, 50000]
}

# Create a DataFrame
df = pd.DataFrame(data)

# 1. Fill missing 'Age' with the median value
df['Age'].fillna(df['Age'].median(), inplace=True)

# 2. Fill missing 'Name' with 'Unknown'
df['Name'].fillna('Unknown', inplace=True)

# 3. Remove rows where 'Salary' is missing
df.dropna(subset=['Salary'], inplace=True)

# 4. Remove duplicate rows
df.drop_duplicates(inplace=True)

# Print the cleaned DataFrame
print(df)

In our last email we talked about Exploratory Data Analysis. Please read here
Or search ‘businessanalytics@substack.com’ in your mailbox.

Latest Insights on Business Analytics

Wealthtech startup InvestorAI has raised INR 80 crore in Series A funding to enhance its scaling efforts and expand operations
Read More

Cloudera has achieved PCI DSS 4.0 compliance, enhancing security for financial institutions while leveraging AI for business value
Read More

WaveFX and Lazarus AI are transforming data challenges with innovative AI technology, enhancing efficiency and decision-making processes
Read More

EarthDaily Analytics has announced a contract with Malaysian geospatial mapping specialist MySpatial, expanding its reach in the Asia-Pacific region
Read More

Tool of the Day: Tableau

Tableau is a powerful data visualization and business intelligence (BI) tool used to analyze and present data in an interactive and visual format. It allows users to create a wide variety of charts, graphs, maps, dashboards, and stories to visualize data and gain insights from it. It is widely used in various industries for its ability to handle large datasets and present complex data in a way that is easy to understand and share.

Learn more

If you found this edition valuable, consider gracing us with a like.
We'd love to hear your two cents (or maybe a whole dollar if you really loved it!) in the comments below.

Stay tuned for our next edition on Data Integration!

STUDY DATA VISUSUALIZATION.
105 Python codes on Data Visualization. Learn More

Business Analytics Review

Data Cleaning Strategies

Issue #24 | Aug 23

Recommended Reads on Data Cleaning

Latest Insights on Business Analytics

Tool of the Day: Tableau

Discussion about this post