Not your typical data cleaning Python article.

Fix Data Quality Issues Using Python, NumPy, and Pandas

Elizabeth DeGroot
10 min readFeb 25, 2021
Code on laptop screen.

This is not your typical “data cleaning with Python” article. By the end of this article, you have a Jupyter notebook demonstrating data cleaning techniques for your portfolio. You build on previous techniques learned and explore tactics when to apply each data cleaning technique to prepare data for analysis.

Data sets are often riddled with missing or inaccurate data. Its just a fact of data science life. Some datasets even contain mixed data types that aren’t ideal for the type of analysis you’re probably looking to do. With all of these data issues, it’s not surprising data analysts and scientists spend a large portion of their time (usually between 60% — 80%) cleaning data before ever getting to the analysis part.

The challenge is, every data set is different. (Not surprising right?) Not every data set will have the same data quality issues. Therefore you won’t always follow the same steps.

This article uncovers simple techniques to clean messy data in the most common situations.

We’ll go through these three areas:

  • Define Data Shape and Data Type
  • Fix Missing — Null — NaN Values
  • Handle Inconsistent Data

We’ll walk through each step and explore why and when it’s helpful to apply a technique.

To get started, you first download and install the tools and the project.

Install The Tools

For this article, you need the following tools:

  • Python
  • Pandas
  • NumPy
  • Jupyter notebook
  • VS Code

I recommend following Real Python’s guide to installing and setting up Python. Go ahead and install Pandas and NumPy if you haven’t already. Real Python is one of the primary sources people go to for Python information (myself included).

NOTE: If you’ve installed Anaconda, skip to step 5. Anaconda automatically installs Python, Pandas, NumPy, and Jupyter.

  1. Install Python by reading Python installation guide.
  2. Install Pandas by reading Pandas installation guide.
  3. Install NumPy by reading NumPy installation guide.
  4. Once you have Python, Pandas, and NumPy installed, next install Jupyter Notebooks. Read through the Real Python guide to installing and Launching Jupyter Notebooks.
  5. The code editor we’ll be using is Visual Studios Code (or VS Code for short). You can install VS Code here.

Next, download the dataset.

Download Dataset

I’ve created a GitHub repo for this project. In the repo, I’ve included the dataset called Heart Failure Clinical Data I found on Kaggle, and a Jupyter notebook with the code in this article.

I chose this data set because it presents multiple data challenges. This data set is ideal to practice analysis, visualizations, and building machine learning models.

Here is what the dataset owner has to say about the data.

“Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths world-wide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.”

  1. Clone or Download the repository
  2. In VS Code, open the project folder. FileOpen → Select project folder heart_failure.
  3. Double clip or unzip the compressed folder. The csv file should now appear in the project. (The dataset is too large to commit directly to GitHub, so it must be compressed.)

Create a Jupyter Notebook

With all of the tools installed, it’s time to create a Jupyter Notebook. You add and modify the code in the notebook throughout this article.

  1. In VS Code, select Terminal in the navigation menu, then New Terminal.
  2. In the terminal window, run the Jupyter notebook command to start a new Jupyter notebook.
jupyter notebook

A new browser window should open. In the window, you’ll see the project directory with the dataset.

3. To create a new notebook, click New. To see my code in a completed notebook, open the Python data cleaning practice.ipynb.

Jupyter notebook file directory with an arrow pointing to “Python data cleaning practice.ipynb”
Jupyter file directory.

Before changing or modifying columns, lets look at the data.

Note, you should already have libraries and the dataset imported and read.

## for data
import numpy as np
import pandas as pd
# Import and read dataset
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')
df.head(5)

REVIEW FIRST 5 ROWS IN DATAFRAME

Use the .head() method to return the first five rows of the dataframe. This method gives you an idea about the data. You can quickly see the fields (columns) and type of data returned.

# Print the first 5 records of the data frame 
df.head(5)

SHAPE OF THE DATASET

Before you begin transforming and cleaning up the dataset, its a good idea to understand the shape of the dataset. (e.g. how many rows and columns exist)

Use .shape to return how many records and columns exist in a dataset.

# Print a data frame shape
df.shape

This should return something like:

(168, 12)

It doesn’t tell you the structure, keys, index, missing values, or anything else. This just tells you the shape of the dataset.

CHECK INDEX

An index refers to a position within a list. This can mean selecting all the rows and some of the columns, or most of the rows and all of the columns, or any combination in between. Knowing this index helps in the case you want to add it modify columns later.

If there is no index, you’ll get an AttributeError: ‘function’ object has no attribute ‘index’ error displayed.

# Check the index 
df.index.values

If there is an index, you’ll receive a RangeIndex error.

RangeIndex(start=0, stop=299, step=1)

It’s time to clean some data!

Girl saying ‘Lets do this.’

Define Data Shape and Data Type

The header is an important component of a dataset. In the header, you have multiple columns (or fields). You use these columns to slice and perform analysis and categorize the data.

IDENTIFY DATA TYPES

No matter how well labeled the columns are, at some point you’re going to need to convert data from one type to another. A data type is a construct that identifies how data is stored and read in the system. The data type also affects how you aggregate and use the data for analysis. So its super important to understand how data types are defined for Pandas vs NumPy vs just in Python.

Python Pandas and NumPy data type mapping.

Use the .info() method to look at the data type of each column. Using .dtypes also returns the information, but .info() returns a bit more. I find .info() way more helpful.

Using the .info() method helps get you analyze the dataframe in its entirety.

# Print information, shape, and data type for the data frame
df.info()

The .info() function returns the following…

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 299 non-null float64
1 anaemia 299 non-null int64
2 creatinine_phosphokinase 299 non-null int64
3 diabetes 299 non-null int64
4 ejection_fraction 299 non-null int64
5 high_blood_pressure 299 non-null int64
6 platelets 299 non-null float64
7 serum_creatinine 299 non-null float64
8 serum_sodium 299 non-null int64
9 sex 299 non-null int64
10 smoking 299 non-null int64
11 time 299 non-null int64
12 DEATH_EVENT 299 non-null int64
dtypes: float64(3), int64(10)
memory usage: 30.5 KB

CHANGE DATA TYPE

Based on the information returned, there are 10 columns that are integers (int64) data type, and 3 that are float (float64) data type. Int64 stores whole numbers, and float64 stores decimal numbers.

Notice how the ‘age’ column is a float64 data type (decimal). Rather than having the full number, ‘75’, the data has ‘75.0’. Weird! The decimal point isn’t needed since 75.5 is not really an age number. Without that decimal, data analysis and plotting are more understandable and readable.

Use astype() to convert a column data type in Panas to a different data type. Go ahead and change the data type of the age column to int64.

# Converty column data type to integer
df['age'] = df['age'].astype('int64')

CHANGE MORE DATA TYPES

Go ahead and print the first 5 records of the dataframe using

# Print first 5 records of the data frame
df.head(5)

You should see a few of the fields has results in 1’s and 0’s. These 1’s and 0’s represents True/False (True when the patient is diagnosed with that variable). Numbers are fine, but what if we wanted to change it to a boolean data type?

Change the data types on anemia, diabetes, high blood pressure, and smoking to boolean data type.

# Change the data types for multiple fields to boolean
df.astype({'anemia': 'bool', 'diabetes': 'bool', 'high_blood_pressure':'bool', 'smoking':'bool', 'smoking':'bool'})

Now that we have our data types and column headers ready, next, you learn how to address missing, null, and NaN values in the data frame.

Fix Missing — Null — NaN Values

Now that you have a dataframe, its time to review the columns and values to address missing, null, NaN (not a number), or incomplete data.

DROP COLUMNS

We don’t need to drop any columns for this project, but if you wanted to or create a new dataframe with fewer columns, use the drop function.

NOTE: axis = 1 means you’re dropping columns whereas axis = 0 is dropping rows.

# Drop columns
drop_columns = ['col','col']
df = df.drop(drop_columns,axis=1)
# Drop any rows which have any NaNs
df.dropna(axis=0)
# Drop columns with over 70% non-NaNs df.dropna(thresh=int(df.shape[0] * .7), axis=1)# Drop rows with NaN value in a specific column
df.dropna(axis=0, subset=['colname'])

COUNT UNIQUE VALUES

Next I like to count how many unique values for each field. If a field is capturing two possible variables, but three are returned, then you may want to explore that field a bit more. Often you’ll find that two variables are really the same but one is slightly spelled differently.

# Determine count of unique values for each column in the dataframe
df.nunique()

COUNT NULL VALUES

Before I modify any missing data, I like to calculate how many records have null values. To do this, I use the isnull() and sum() functions. This returns a list of total records that came back null for each column.

# Checking if any rows are missing any data.
df.isnull().sum()

FILL NaN or NA VALUES

You hav to use your judgement when to fill NaN or null values. We don’t have any values we want to fill in on this dataset, but if you want to practice, you use fillna and define what you want to fill the value with.

The code snippet below replaces the NaN value in a specific column with a space.

# Fill NaN with a blank space
df['col'] = df['col'].fillna(' ')

Or you can fill with a mean value if you feel you need to have something there.

# Fill NaN values with a mean value
df['col'] = df['col'].fillna(df['col'].mean())

Using this information, you’ll be able to determine how you should fill it (if at all). Should you replace with a value placeholder, or should you calculate a mean value to fill it? The answer depends on the data you need.

Anytime you’re working on data quality, here are a few thought pointers to keep in mind:

  • Examine for missing data points — What should be there (if anything)?
  • Investigate why the data is missing — Should it be?
  • Drop rows or columns — Do you need that column? Does the row add value to the analysis?
  • Fill-in values — Its okay to fill in a value with a string, number, mean, avg, if it makes sense to the analysis.

The key to data quality is consistency!

Consistency

Handle Inconsistent Data

Often when cleaning up a dataset, you’ll need to replace or rename inconsistent data. This could be typos, duplicates that are just slightly off, or even a numerical values to string. Your analysis will dictate how (or if) you should replace or rename values.

RENAME COLUMN

It’s good practice to have column names be consistent, lowercase, and no spaces (use underscore instead). Following this practice prevents programming languages from misunderstanding the naming convention and helps you remember the column names later.

This particular dataset doesn’t really have column naming issues, but lets say we wanted to rename a column for readability. Use the .rename(columns={'current_column_name':’new_column_name'}) function to rename the column. This method gives you the ability to make your dataset columns readable and consistent.

# Rename a column
df = df.rename(columns={'sex': 'gender'})

REPLACE VALUES

Much like the rename function that works on column names, replace handles the values in the column.

The gender column in our dataset doesn’t have boolean data even though the results are 1’s and 0's. The 1’s and 0’s represent Male and Female. Having a gender column generically identify the two values as 1’s and 0’s in lieu of Male/Female (or even M/F) is a bad idea. You can’t tell if 1 = Female or 1 = Male.

To fix this, replace the data in the gender field: 0 = Female, and 1 = Male.

# Replace numerical value with string
df['gender'].replace(0, 'Female', inplace=True)
df['gender'].replace(1, 'Male', inplace=True)
# Replace all values that equal a specific variable
df = df.replace(valueToReplace =["current", "alsoCurrent"],value ="newvalue")

When inplace = True , the data is modified in place. Otherwise when inplace = False, the object is returned.

And just like that, the quality of your data is improved!

I know, it wasn’t easy and this article contained a lot of steps. The good news is you don't have to use each step with every dataset. Use your best judgement and challenge yourself!

Wanna see how I applied these techniques in a notebook? Check out these two notebooks that use a combination of the techniques. Feel free to comment, clone, and go beyond!

Summary

Before creating visualizations or doing any data analysis, you always first prepare the data. Your data visualization and machine learning models are only as good as the quality of your data! So, take the time to clean and prepare your data. You’ll be glad you did.

Thank you for reading and following along!

Follow me for more Python tutorials and tips ElizabethDGroot.

Resources to check out:

--

--

Elizabeth DeGroot

I’m a Data Science advocate, developer, and technical writer. I write mostly data science and dev stuff. Follow me on Twitter for more @ElizabethDGroot