Not your typical data cleaning Python article.

Fix Data Quality Issues Using Python, NumPy, and Pandas

Code on laptop screen.

Install The Tools

Download Dataset

Create a Jupyter Notebook

jupyter notebook
Jupyter notebook file directory with an arrow pointing to “Python data cleaning practice.ipynb”
Jupyter notebook file directory with an arrow pointing to “Python data cleaning practice.ipynb”
Jupyter file directory.
## for data
import numpy as np
import pandas as pd
# Import and read dataset
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')
df.head(5)
# Print the first 5 records of the data frame 
df.head(5)
# Print a data frame shape
df.shape
# Check the index 
df.index.values
RangeIndex(start=0, stop=299, step=1)
Girl saying ‘Lets do this.’
Girl saying ‘Lets do this.’

Define Data Shape and Data Type

Python Pandas and NumPy data type mapping.
# Print information, shape, and data type for the data frame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 299 non-null float64
1 anaemia 299 non-null int64
2 creatinine_phosphokinase 299 non-null int64
3 diabetes 299 non-null int64
4 ejection_fraction 299 non-null int64
5 high_blood_pressure 299 non-null int64
6 platelets 299 non-null float64
7 serum_creatinine 299 non-null float64
8 serum_sodium 299 non-null int64
9 sex 299 non-null int64
10 smoking 299 non-null int64
11 time 299 non-null int64
12 DEATH_EVENT 299 non-null int64
dtypes: float64(3), int64(10)
memory usage: 30.5 KB
# Converty column data type to integer
df['age'] = df['age'].astype('int64')
# Print first 5 records of the data frame
df.head(5)
# Change the data types for multiple fields to boolean
df.astype({'anemia': 'bool', 'diabetes': 'bool', 'high_blood_pressure':'bool', 'smoking':'bool', 'smoking':'bool'})

Fix Missing — Null — NaN Values

# Drop columns
drop_columns = ['col','col']
df = df.drop(drop_columns,axis=1)
# Drop any rows which have any NaNs
df.dropna(axis=0)
# Drop columns with over 70% non-NaNs df.dropna(thresh=int(df.shape[0] * .7), axis=1)# Drop rows with NaN value in a specific column
df.dropna(axis=0, subset=['colname'])
# Determine count of unique values for each column in the dataframe
df.nunique()
# Checking if any rows are missing any data.
df.isnull().sum()
# Fill NaN with a blank space
df['col'] = df['col'].fillna(' ')
# Fill NaN values with a mean value
df['col'] = df['col'].fillna(df['col'].mean())
Consistency
Consistency

Handle Inconsistent Data

# Rename a column
df = df.rename(columns={'sex': 'gender'})
# Replace numerical value with string
df['gender'].replace(0, 'Female', inplace=True)
df['gender'].replace(1, 'Male', inplace=True)
# Replace all values that equal a specific variable
df = df.replace(valueToReplace =["current", "alsoCurrent"],value ="newvalue")

Summary

Resources to check out:

I’m a Data Science advocate, developer, and technical writer. I write mostly data science and dev stuff. Follow me on Twitter for more @ElizabethDGroot