5 best ways to handle missing values in the dataset.

You have just started a machine learning project, got the data in your workspace and data exploration has started. After some initial visualization of the data, you found that there are some rows having missing values in some of the features. It is well saying that if the data is garbage then you can expect that the result would also be garbage and you are already familiar with this statement. What would you do now? I think you can remove those rows or you can try some other approaches too. We will discuss the different techniques mentioned below to handle missing data in this blog.

Let’s get started!

Table of Contents

Dataset having missing values

Before we start to get introduced to different strategies to handle missing values in a dataset, we must have a dataset first having the missing values. There are two different datasets we have chosen to explore different types of scenarios:

Class grades dataset. Download link: https://openmv.net/info/class-grades
Travel time dataset. Download link: https://openmv.net/info/travel-times

Read the datasets and find whether they contain missing values or not.

Import required python libraries

import pandas as pd
import numpy as np

Checking for null values in Class grade dataset:

# reading and displaying data 

cgdata = pd.read_csv('class-grades.csv',error_bad_lines=False)
cgdata.head()

RN	Prefix	Assignment	Tutorial	Midterm	TakeHome	Final
0	5	57.14	34.09	64.38	51.48	52.50
1	8	95.05	105.49	67.50	99.07	68.33
2	8	83.70	83.17	30.00	63.15	48.89
3	7	81.22	96.06	49.38	105.93	80.56
4	8	91.32	93.64	95.00	107.41	73.89

First five rows of class grade dataset.

# get a breif idea about data

cgdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Prefix      96 non-null     int64  
 1   Assignment  96 non-null     float64
 2   Tutorial    96 non-null     float64
 3   Midterm     96 non-null     float64
 4   TakeHome    95 non-null     float64
 5   Final       96 non-null     float64
dtypes: float64(5), int64(1)
memory usage: 4.6 KB

# to check whether dataset has missing values or not

cgdata.isnull().any()

Prefix        False
Assignment    False
Tutorial      False
Midterm       False
TakeHome       True
Final         False
dtype: bool

From above we can understand that the feature “TakeHome” does have some missing values, but to find the rows which contain missing values, let us move ahead to the upcoming code snippets.

# find the rows have missing missing values

rows,columns=np.where(cgdata.isnull())
cgdata.loc[rows]

	Prefix	Assignment	Tutorial	Midterm	TakeHome	Final
7	7	72.85	86.85	60.0	NaN	56.11

Row having the missing value.

Class grade dataset has only one row with the missing value which is present in the feature “TakeHome”.

Above were few simple steps to know whether data has missing values or not and what is their location of presence. We must do the same with the Travel time dataset to get a clear picture when we will explore other scenarios using this dataset.

Checking null values for Travel Time dataset:

# reading and displaying data 
ttdata = pd.read_csv('travel-times.csv')
ttdata.head()

RN.	Date	StartTime	DayOfWeek	GoingTo	Distance	MaxSpeed	AvgSpeed	AvgMovingSpeed	FuelEconomy	TotalTime	MovingTime	Take407All	Comments
0	1/6/2012	16:37	Friday	Home	51.29	127.4	78.3	84.8	NaN	39.3	36.3	No	NaN
1	1/6/2012	08:20	Friday	GSK	51.63	130.3	81.8	88.9	NaN	37.9	34.9	No	NaN
2	1/4/2012	16:17	Wednesday	Home	51.27	127.4	82.0	85.8	NaN	37.5	35.9	No	NaN
3	1/4/2012	07:53	Wednesday	GSK	49.17	132.3	74.2	82.9	NaN	39.8	35.6	No	NaN
4	1/3/2012	18:57	Tuesday	Home	51.15	136.2	83.4	88.1	NaN	36.8	34.8	No	NaN

First five rows of the Travel time dataset

# get a breif idea about data

ttdata.info()

&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 205 entries, 0 to 204
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            205 non-null    object 
 1   StartTime       205 non-null    object 
 2   DayOfWeek       205 non-null    object 
 3   GoingTo         205 non-null    object 
 4   Distance        205 non-null    float64
 5   MaxSpeed        205 non-null    float64
 6   AvgSpeed        205 non-null    float64
 7   AvgMovingSpeed  205 non-null    float64
 8   FuelEconomy     188 non-null    object 
 9   TotalTime       205 non-null    float64
 10  MovingTime      205 non-null    float64
 11  Take407All      205 non-null    object 
 12  Comments        24 non-null     object 
dtypes: float64(6), object(7)
memory usage: 20.9+ KB

# to check whether dataset has missing values or not

ttdata.isnull().any()

Date              False
StartTime         False
DayOfWeek         False
GoingTo           False
Distance          False
MaxSpeed          False
AvgSpeed          False
AvgMovingSpeed    False
FuelEconomy        True
TotalTime         False
MovingTime        False
Take407All        False
Comments           True
dtype: bool

In this dataset, there are two features “FuelEconomy” and “Comments” having missing values. If we see the result of “data info” snippet we can find that out of 205 rows, “FuelEconomy” holds values for 188 rows and 17 rows have no information about this feature. Similarly, there are only 25 rows having values for “Comments” feature.

# find the rows have missing missing values

rows,columns=np.where(ttdata.isnull())
ttdata.loc[np.unique(rows)].head()

	Date	StartTime	DayOfWeek	GoingTo	Distance	MaxSpeed	AvgSpeed	AvgMovingSpeed	FuelEconomy	TotalTime	MovingTime	Take407All
0	1/6/2012	16:37	Friday	Home	51.29	127.4	78.3	84.8	NaN	39.3	36.3	No
1	1/6/2012	08:20	Friday	GSK	51.63	130.3	81.8	88.9	NaN	37.9	34.9	No
2	1/4/2012	16:17	Wednesday	Home	51.27	127.4	82.0	85.8	NaN	37.5	35.9	No
3	1/4/2012	07:53	Wednesday	GSK	49.17	132.3	74.2	82.9	NaN	39.8	35.6	No
4	1/3/2012	18:57	Tuesday	Home	51.15	136.2	83.4	88.1	NaN	36.8	34.8	No

First, five missing values containing rows

# replace "-" present in the feature "FuelEconomy" with "NaN" 
ttdata['FuelEconomy']=ttdata['FuelEconomy'].replace('-',np.nan)

How to handle missing data present in the dataset

Let us now work on the solution to handle missing values present in the dataset. There are multiple ways to handle the missing data based on:

The number of rows having missing values.
Distribution, datatype, and relevance of the feature having missing values.

Let us understand different techniques for handling missing data.

Dropping rows and columns

Dropping rows and columns means removing those specific rows and columns in the dataset which have got null values. For example, We have already seen that the class grade dataset has only one row having a null value, and therefore dropping that row can be the solution to opt for.

#dropping the row having null values 

cgdata = cgdata.dropna()

#checking the prsence of null values after dropping the rows

data.isnull().any()

Prefix        False 
Assignment    False 
Tutorial      False 
Midterm       False 
TakeHome      False 
Final         False 
dtype: bool

Similarly, For the dataset Travel Time, we can find that there are only 24 rows out of 205 which have got the legitimate value for the feature “Comments”, remaining are just null values. Here, one can go for dropping the “Comment” column because of the lack of valid data.

#dropping the column Comments due to lack of valid data.

ttdata=ttdata.drop('Comments',axis=1)

#checking whether the column is dropped successfully or not

ttdata.info()

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 205 entries, 0 to 204 
Data columns (total 12 columns):  
#   Column          Non-Null Count  Dtype   
---  ------          --------------  -----    
0   Date            205 non-null    object   
1   StartTime       205 non-null    object   
2   DayOfWeek       205 non-null    object   
3   GoingTo         205 non-null    object   
4   Distance        205 non-null    float64  
5   MaxSpeed        205 non-null    float64  
6   AvgSpeed        205 non-null    float64  
7   AvgMovingSpeed  205 non-null    float64  
8   FuelEconomy     188 non-null    object   
9   TotalTime       205 non-null    float64  
10  MovingTime      205 non-null    float64  
11  Take407All      205 non-null    object  
dtypes: float64(6), object(6) 
memory usage: 19.3+ KB

Yes, the column “Comments” does not appear in the result “dataset info” and hence it is dropped successfully.

Dropping might be a simple and straightforward idea for getting rid of the missing data, but along the way, you are also losing some data which might be valuable for your machine learning algorithm.

Let us move on to the next approach.

Imputation

The idea of imputation is simple which asks to fill the missing data with a value using the below options:

Using any constant value
Getting value randomly from the not null values of the column
Using mean/median/mode of the column
Value predicted from a predictive model, where the target variable is the one that contains missing values.

If we see the “TakeHome” feature of the class grade dataset and the feature “FuelEconomy” feature of the travel time dataset, both have fewer rows having NULL data. Therefore, it will be a nice option to replace those missing data using the value derived from the above list of approaches.

#Imputation Using any constant value
cgdata.fillna(5)

#Imputation by getting value randomly from the not null values of the column
cgdata.fillna(random.choice(ttdata['FuelEconomy'].dropna()),inplace=True)

#Imputation using mean value of the column
cgdata.fillna(cgdata.mean(),inplace=True)

#Imputation using median value of the column
cgdata.fillna(cgdata.median(),inplace=True)

#Imputation using mode of the column
cgdata.fillna(cgdata.mode(),inplace=True)

All the methods shown above can also be done with the help of SimpleImputer class provided by sklearn.

class sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='mean/median/most_frequent/constant', fill_value=None, verbose=0, copy=True, add_indicator=False)

#Imputation using sklearn SimpleImputer

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(cgdata)
cgdata = imputer.transform(cgdata)

However, adding random or approximate values for replacing null values brings disturbance in bias and variance of the feature. Therefore, it is suggested to try out other efficient methods and select one of them to go with.

Using predictive algorithm to predict missing values

We can use various predictive algorithms to predict the missing values and can do experiments with them to find the one providing the best result. The process will be to split the dataset into two parts in which one part will contain rows that have valid values and another part will have rows having null values. We will train the predictive model using the first part in which y will be the feature for which we want to predict missing values and the remaining features will be x. Then we can predict the missing values of another part using the trained model.

We have used LinearRegression to predict missing values in the feature “FuelEconomy” of the travel-times dataset using scikit-learn library.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

tt_nulldata = ttdata.loc[np.where(ttdata['FuelEconomy'].isnull())[0]]
tt_notnulldata = ttdata.dropna()
train_y = tt_notnulldata['FuelEconomy'].values
train_x = tt_notnulldata.drop(['FuelEconomy','Date','StartTime',],axis=1)
test_x = tt_nulldata.drop(['FuelEconomy','Date','StartTime',],axis=1)

# one hot encoding the categorical variables
ct = ColumnTransformer([("ohe", OneHotEncoder(sparse=False,['DayOfWeek','GoingTo','Take407All'])],remainder='passthrough')
train_x = ct.fit_transform(train_x)
test_x = ct.fit_transform(test_x)

# training model to predict the missing values
lr = LinearRegression()
lr.fit(train_x,train_y)
test_y = lr.predict(test_x)

Scikit-learn also provides some algorithms like IterativeImputer and KNNImputer to impute the missing value using estimation.

#using IterativeImputer to fill missing data
# dont't drop "FuelEconomy" feature from tt_notnulldata and tt_nulldata derived in previous code snippet

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
imputer.fit(tt_notnulldata)
imputed_data = imputer.transform(tt_nulldata)

#using KNNImputer to fill missing data

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputed_data = imputer.fit_transform(tt_data)

Using Algorithms which remains unaffected by missing values

Algorithms like K-Nearest Neighbor and Naive Bayes does not get affected by dataset having missing values. For handling missing values present in a certain feature of a row, KNN uses the most frequent values in the same feature for its neighbors or it will just ignore the missing value feature in distance measure. However, KNN and Naive Bayes algorithm provided by Scikit-learn does not have the robustness to missing values.

Using Python Libraries

Scikit Learn: We have already discussed above, how sklearn can be used for handling missing data. We can use SimpleImputer, KNNImputer and IterativeImputer classes provided by sklearn for handling missing values.
fancyImpute: This is also a python library which uses machine learning algorithms for handling missing data. It primarily uses KNN and MICE (Multiple Imputation by Chained Equation) for imputation.
Datawig: it uses deep learning feature extractors with derived through hyperparameter tuning to impute missing values.