Photo by Markus Spiske on Unsplash
Unraveling the Art of Data Cleaning: From Raw to Refined
Mastering techniques to Improve Data Quality, Identify Trends, and Provide Useful Information for Well-Informed Decision-Making
Table of contents
A crucial step in every data analysis procedure is data cleaning. This phase involves removing errors, filling in any missing data, and ensuring your data is in a format that you can use. Any analysis done afterward may be biased or inaccurate if the dataset has not been thoroughly cleaned.
This post exposes you to several essential Python data-cleaning strategies that make use of potent modules like Pandas and Scikit-learn.
The Importance of Data Cleaning
Let's first examine the significance of data cleansing before delving into its mechanics. Real-world data frequently has errors. Duplicate entries, inconsistent or inaccurate data types, missing values, extraneous features, and outliers might all be present. When evaluating data, all of these elements may result in inaccurate conclusions. Data cleansing is therefore a crucial step in the data science process.
We’ll cover the following data-cleaning tasks:
import library
loading the data
deleting columns
handling missing values
handling duplicates
standardizing alphabetical columns
encoding alphabetical columns
Python Data Cleaning Setup
For this task, we are going to be making use of the Titanic dataset from Kaggle.com Dataset, so firstly, we need to import the Pandas library, which is used for data manipulation.
import pandas as pd
Now, we also need to load our data. Here we are going to load a CSV file using the Pandas library.
data = pd.read_csv('titanic.csv')
Drop columns that are not required for your analysis.
data = data.drop(['Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1)
Now, we move on to check for missing values.
data.isnull().sum()
Now we see that there are 177 missing values in Age and 2 in Embarked. Generally, there are two methods of handling missing values, one method is to drop all the rows with the missing values, then the other is imputation (which can be done through mean, median and mode).
# Dropping missing values
data['Age'] = data.dropna(subset=['Age', 'Embarked']
# Fill missing values in 'Age' column with mean
mean = data.Age.mean()
data['Age'] = data['Age'].fillna(mean)
# Fill missing values in 'Age' column with median
median = data.Age.median()
data['Age'] = data['Age'].fillna(median)
# Fill missing values in 'Embarked' column with mode
mode = data.Embarked.mode()
data['Embarked'] = data['Embarked'].fillna(mode)
After handling missing values, we have to take note of duplicates, which can affect the analysis or model training.
# Remove duplicate rows
data = data.drop_duplicates()
Now have to standardize any alphabetical columns, which in this case is the 'Sex' columns
# Convert 'Sex' column to lowercase
data['Sex'] = data['Sex'].str.lower()
After this has been done, we can notice we have very clean data, but there's still one more step to go, and that is encoding the alphabetical columns to numerical columns, as the machine learning algorithm works with numerical values and not alphabetical. If your analysis doesn't involve machine learning, you are good to go, so to do this, we'll be importing the encoder to help with this operation.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cols = ['Sex', 'Embarked']
for col in cols:
data[col] = le.fit_transform(data[col])
print(le.classes_)
Now that all these steps have been taken, you can go through the data, derive actionable insights from it and use it to make predictions (I'll be covering that in another article)....