You have just started a machine learning project, got the data in your workspace and data exploration has started. After some initial visualization of the data, you found that there are some rows having missing values in some of the features. It is well saying that if the data is garbage then you can expect that the result would also be garbage and you are already familiar with this statement. What would you do now? I think you can remove those rows or you can try some other approaches too. We will discuss the different techniques mentioned below to handle missing data in this blog.
Let’s get started!
Table of Contents
Dataset having missing values
Before we start to get introduced to different strategies to handle missing values in a dataset, we must have a dataset first having the missing values. There are two different datasets we have chosen to explore different types of scenarios:
- Class grades dataset. Download link: https://openmv.net/info/class-grades
- Travel time dataset. Download link: https://openmv.net/info/travel-times
Read the datasets and find whether they contain missing values or not.
Import required python libraries
import pandas as pd
import numpy as np
Checking for null values in Class grade dataset:
# reading and displaying data
cgdata = pd.read_csv('class-grades.csv',error_bad_lines=False)
cgdata.head()
RN | Prefix | Assignment | Tutorial | Midterm | TakeHome | Final |
---|---|---|---|---|---|---|
0 | 5 | 57.14 | 34.09 | 64.38 | 51.48 | 52.50 |
1 | 8 | 95.05 | 105.49 | 67.50 | 99.07 | 68.33 |
2 | 8 | 83.70 | 83.17 | 30.00 | 63.15 | 48.89 |
3 | 7 | 81.22 | 96.06 | 49.38 | 105.93 | 80.56 |
4 | 8 | 91.32 | 93.64 | 95.00 | 107.41 | 73.89 |
# get a breif idea about data
cgdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 96 entries, 0 to 95 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Prefix 96 non-null int64 1 Assignment 96 non-null float64 2 Tutorial 96 non-null float64 3 Midterm 96 non-null float64 4 TakeHome 95 non-null float64 5 Final 96 non-null float64 dtypes: float64(5), int64(1) memory usage: 4.6 KB
# to check whether dataset has missing values or not
cgdata.isnull().any()
Prefix False Assignment False Tutorial False Midterm False TakeHome True Final False dtype: bool
From above we can understand that the feature “TakeHome” does have some missing values, but to find the rows which contain missing values, let us move ahead to the upcoming code snippets.
# find the rows have missing missing values
rows,columns=np.where(cgdata.isnull())
cgdata.loc[rows]
Prefix | Assignment | Tutorial | Midterm | TakeHome | Final | |
---|---|---|---|---|---|---|
7 | 7 | 72.85 | 86.85 | 60.0 | NaN | 56.11 |
Class grade dataset has only one row with the missing value which is present in the feature “TakeHome”.
Above were few simple steps to know whether data has missing values or not and what is their location of presence. We must do the same with the Travel time dataset to get a clear picture when we will explore other scenarios using this dataset.
Checking null values for Travel Time dataset:
# reading and displaying data
ttdata = pd.read_csv('travel-times.csv')
ttdata.head()
RN. | Date | StartTime | DayOfWeek | GoingTo | Distance | MaxSpeed | AvgSpeed | AvgMovingSpeed | FuelEconomy | TotalTime | MovingTime | Take407All | Comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1/6/2012 | 16:37 | Friday | Home | 51.29 | 127.4 | 78.3 | 84.8 | NaN | 39.3 | 36.3 | No | NaN |
1 | 1/6/2012 | 08:20 | Friday | GSK | 51.63 | 130.3 | 81.8 | 88.9 | NaN | 37.9 | 34.9 | No | NaN |
2 | 1/4/2012 | 16:17 | Wednesday | Home | 51.27 | 127.4 | 82.0 | 85.8 | NaN | 37.5 | 35.9 | No | NaN |
3 | 1/4/2012 | 07:53 | Wednesday | GSK | 49.17 | 132.3 | 74.2 | 82.9 | NaN | 39.8 | 35.6 | No | NaN |
4 | 1/3/2012 | 18:57 | Tuesday | Home | 51.15 | 136.2 | 83.4 | 88.1 | NaN | 36.8 | 34.8 | No | NaN |
# get a breif idea about data
ttdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 205 non-null object 1 StartTime 205 non-null object 2 DayOfWeek 205 non-null object 3 GoingTo 205 non-null object 4 Distance 205 non-null float64 5 MaxSpeed 205 non-null float64 6 AvgSpeed 205 non-null float64 7 AvgMovingSpeed 205 non-null float64 8 FuelEconomy 188 non-null object 9 TotalTime 205 non-null float64 10 MovingTime 205 non-null float64 11 Take407All 205 non-null object 12 Comments 24 non-null object dtypes: float64(6), object(7) memory usage: 20.9+ KB
# to check whether dataset has missing values or not
ttdata.isnull().any()
Date False StartTime False DayOfWeek False GoingTo False Distance False MaxSpeed False AvgSpeed False AvgMovingSpeed False FuelEconomy True TotalTime False MovingTime False Take407All False Comments True dtype: bool
In this dataset, there are two features “FuelEconomy” and “Comments” having missing values. If we see the result of “data info” snippet we can find that out of 205 rows, “FuelEconomy” holds values for 188 rows and 17 rows have no information about this feature. Similarly, there are only 25 rows having values for “Comments” feature.
# find the rows have missing missing values
rows,columns=np.where(ttdata.isnull())
ttdata.loc[np.unique(rows)].head()
Date | StartTime | DayOfWeek | GoingTo | Distance | MaxSpeed | AvgSpeed | AvgMovingSpeed | FuelEconomy | TotalTime | MovingTime | Take407All | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1/6/2012 | 16:37 | Friday | Home | 51.29 | 127.4 | 78.3 | 84.8 | NaN | 39.3 | 36.3 | No |
1 | 1/6/2012 | 08:20 | Friday | GSK | 51.63 | 130.3 | 81.8 | 88.9 | NaN | 37.9 | 34.9 | No |
2 | 1/4/2012 | 16:17 | Wednesday | Home | 51.27 | 127.4 | 82.0 | 85.8 | NaN | 37.5 | 35.9 | No |
3 | 1/4/2012 | 07:53 | Wednesday | GSK | 49.17 | 132.3 | 74.2 | 82.9 | NaN | 39.8 | 35.6 | No |
4 | 1/3/2012 | 18:57 | Tuesday | Home | 51.15 | 136.2 | 83.4 | 88.1 | NaN | 36.8 | 34.8 | No |
# replace "-" present in the feature "FuelEconomy" with "NaN"
ttdata['FuelEconomy']=ttdata['FuelEconomy'].replace('-',np.nan)
How to handle missing data present in the dataset
Let us now work on the solution to handle missing values present in the dataset. There are multiple ways to handle the missing data based on:
- The number of rows having missing values.
- Distribution, datatype, and relevance of the feature having missing values.
Let us understand different techniques for handling missing data.
Dropping rows and columns
Dropping rows and columns means removing those specific rows and columns in the dataset which have got null values. For example, We have already seen that the class grade dataset has only one row having a null value, and therefore dropping that row can be the solution to opt for.
#dropping the row having null values
cgdata = cgdata.dropna()
#checking the prsence of null values after dropping the rows
data.isnull().any()
Prefix False Assignment False Tutorial False Midterm False TakeHome False Final False dtype: bool
Similarly, For the dataset Travel Time, we can find that there are only 24 rows out of 205 which have got the legitimate value for the feature “Comments”, remaining are just null values. Here, one can go for dropping the “Comment” column because of the lack of valid data.
#dropping the column Comments due to lack of valid data.
ttdata=ttdata.drop('Comments',axis=1)
#checking whether the column is dropped successfully or not
ttdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 205 non-null object 1 StartTime 205 non-null object 2 DayOfWeek 205 non-null object 3 GoingTo 205 non-null object 4 Distance 205 non-null float64 5 MaxSpeed 205 non-null float64 6 AvgSpeed 205 non-null float64 7 AvgMovingSpeed 205 non-null float64 8 FuelEconomy 188 non-null object 9 TotalTime 205 non-null float64 10 MovingTime 205 non-null float64 11 Take407All 205 non-null object dtypes: float64(6), object(6) memory usage: 19.3+ KB
Yes, the column “Comments” does not appear in the result “dataset info” and hence it is dropped successfully.
Dropping might be a simple and straightforward idea for getting rid of the missing data, but along the way, you are also losing some data which might be valuable for your machine learning algorithm.
Let us move on to the next approach.
Imputation
The idea of imputation is simple which asks to fill the missing data with a value using the below options:
- Using any constant value
- Getting value randomly from the not null values of the column
- Using mean/median/mode of the column
- Value predicted from a predictive model, where the target variable is the one that contains missing values.
If we see the “TakeHome” feature of the class grade dataset and the feature “FuelEconomy” feature of the travel time dataset, both have fewer rows having NULL data. Therefore, it will be a nice option to replace those missing data using the value derived from the above list of approaches.
#Imputation Using any constant value
cgdata.fillna(5)
#Imputation by getting value randomly from the not null values of the column
cgdata.fillna(random.choice(ttdata['FuelEconomy'].dropna()),inplace=True)
#Imputation using mean value of the column
cgdata.fillna(cgdata.mean(),inplace=True)
#Imputation using median value of the column
cgdata.fillna(cgdata.median(),inplace=True)
#Imputation using mode of the column
cgdata.fillna(cgdata.mode(),inplace=True)
All the methods shown above can also be done with the help of SimpleImputer class provided by sklearn.
class sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='mean/median/most_frequent/constant', fill_value=None, verbose=0, copy=True, add_indicator=False)
#Imputation using sklearn SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(cgdata)
cgdata = imputer.transform(cgdata)
However, adding random or approximate values for replacing null values brings disturbance in bias and variance of the feature. Therefore, it is suggested to try out other efficient methods and select one of them to go with.
Using predictive algorithm to predict missing values
We can use various predictive algorithms to predict the missing values and can do experiments with them to find the one providing the best result. The process will be to split the dataset into two parts in which one part will contain rows that have valid values and another part will have rows having null values. We will train the predictive model using the first part in which y will be the feature for which we want to predict missing values and the remaining features will be x. Then we can predict the missing values of another part using the trained model.
We have used LinearRegression to predict missing values in the feature “FuelEconomy” of the travel-times dataset using scikit-learn library.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
tt_nulldata = ttdata.loc[np.where(ttdata['FuelEconomy'].isnull())[0]]
tt_notnulldata = ttdata.dropna()
train_y = tt_notnulldata['FuelEconomy'].values
train_x = tt_notnulldata.drop(['FuelEconomy','Date','StartTime',],axis=1)
test_x = tt_nulldata.drop(['FuelEconomy','Date','StartTime',],axis=1)
# one hot encoding the categorical variables
ct = ColumnTransformer([("ohe", OneHotEncoder(sparse=False,['DayOfWeek','GoingTo','Take407All'])],remainder='passthrough')
train_x = ct.fit_transform(train_x)
test_x = ct.fit_transform(test_x)
# training model to predict the missing values
lr = LinearRegression()
lr.fit(train_x,train_y)
test_y = lr.predict(test_x)
Scikit-learn also provides some algorithms like IterativeImputer and KNNImputer to impute the missing value using estimation.
#using IterativeImputer to fill missing data
# dont't drop "FuelEconomy" feature from tt_notnulldata and tt_nulldata derived in previous code snippet
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
imputer.fit(tt_notnulldata)
imputed_data = imputer.transform(tt_nulldata)
#using KNNImputer to fill missing data
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputed_data = imputer.fit_transform(tt_data)
Using Algorithms which remains unaffected by missing values
Algorithms like K-Nearest Neighbor and Naive Bayes does not get affected by dataset having missing values. For handling missing values present in a certain feature of a row, KNN uses the most frequent values in the same feature for its neighbors or it will just ignore the missing value feature in distance measure. However, KNN and Naive Bayes algorithm provided by Scikit-learn does not have the robustness to missing values.
Using Python Libraries
- Scikit Learn: We have already discussed above, how sklearn can be used for handling missing data. We can use SimpleImputer, KNNImputer and IterativeImputer classes provided by sklearn for handling missing values.
- fancyImpute: This is also a python library which uses machine learning algorithms for handling missing data. It primarily uses KNN and MICE (Multiple Imputation by Chained Equation) for imputation.
- Datawig: it uses deep learning feature extractors with derived through hyperparameter tuning to impute missing values.
In depth Reading
- Handbook of Missing Data Methodology: 5 (Chapman & Hall/CRC Handbooks of Modern Statistical Methods)
- Flexible Imputation of Missing Data, Second Edition (Chapman & Hall/CRC Interdisciplinary Statistics)
- Feature Engineering and Selection: A Practical Approach for Predictive Models (Chapman & Hall/CRC Data Science)
- Data Mining: Practical Machine Learning Tools and Techniques