Let us start with an intuitive description before going into exploratory data analysis in detail.
There is a temple abandoned and unknown to the world for centuries deep down a forest. You were reading an ancient mythological book and came across a paragraph mentioning, “there is a huge amount of wealth hidden in the backyard of the temple”.
There was a map showing the location of the forest. You are very excited now for a thrilling treasure hunt experience. Hey! put a brake on the adrenaline rush and do some analysis, as things are not so easy as you think. You got to do exclusive planning before marching to the forest and a few of the points are mentioned below:
Does the forest really exist or not. ✔
The exact location of the forest. ✔
How to go there. ✔
Area of the forest. ✔
Is forest safe or not. ✔
Tools required to perform the treasure hunt. ✔
How much money and people needed. ✔
These are some points to look for, nonetheless, there are even more and you have to think about that. I guess now you must have a clear idea about where we are heading. Obscurity is always dreadful, therefore for any kind of exploration, one should have a clear understanding of what he is doing, what he is expecting and his expectation has any ground or not, to reduce temper (time, effort, mess, pain, errors, rework).
Exploratory Data Analysis
It is one of the primary steps one has to do in any machine learning or data science project. Many times just performing exploratory data analysis is enough to make your job done without going for machine learning. Suppose you want to know some statistics of business like the market trend, max profit, min profit, etc. can be done just using EDA.
It is the process of extracting high-level insights of data that could be anything like statistical analysis of data, data dimension, datatypes of features, presence of null values, presence of outliers, feature importance, feature correlation, feature distribution, feature attributes, speculation of ML model suitable for the dataset, etc.
There is no fixed rules or sequential strategy to follow to perform exploratory data analysis. You have to ask questions and use the right tools to get the answers. Python provides tools like numpy, pandas, scipy, matplotlib, seaborn to perform all the analysis.
Showing by doing brings better clarity and hence let us perform exploratory data analysis on the Student’s academic performance dataset. Some of the features are removed to reduce the dimensionality and dataset along with feature description is shown below:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns stdata=pd.read_csv('Student_performance.csv') stdata[4:10]
- Gender: Student’s gender (nominal: ‘Male’ or ‘Female’)
- Educational Stages: Educational level student belongs (nominal: ‘lowerlevel’, ’MiddleSchool’, ’HighSchool’)
- Topic: Course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)
- Raised hand: Number of times student raised his hands in the classroom (numeric:0–100)
- Visited resources: Number of times a student visited a course content(numeric:0–100)
- Viewing announcements: Number of times a student checks the new announcements(numeric:0–100)
- Discussion groups: Number of times a student participates in discussion groups (numeric:0–100)
- Student Absence Days: Number of days of absence for each student (nominal: above-7, under-7)
- Class: Range in which student’s mark lies (Low-Level: interval includes values from 0 to 69, Middle-Level: interval includes values from 70 to 89, High-Level: interval includes values from 90–100)
- Dataset has 480 rows and 9 columns.
- Name of Features are: [‘gender’, ‘StageID’, ‘Topic’, ‘raisedhands’, ‘VisITedResources’, ‘AnnouncementsView’, ‘Discussion’, ‘StudentAbsenceDays’, ‘Class’] in which five features are categorical and four features are quantitative in nature.
- Dataset does not have any null values.
The above three pieces of information can be known using one line of code shown below.
Statistical information about numerical features helps in getting the basic interpretation or summary of data.
There are three types of analysis we will perform on the dataset:
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
In this type of analysis, we will deal with each variable one by one. There are various things we want to know about features individually. In the above dataset, there are five categorical and four quantitative features, while we will perform an analysis on a few of them.
- The above plots i.e., bar chart and piechart, for categorical feature “Topic”, show the number of students corresponding to different subjects.
- It can be observed that the number of students studying “IT” is maximum and for “History” it is minimum.
- IT, Arabic, Science, English, and French are some of the most chosen subjects.
This kind of plot is used to show the minimum, first quartile, median, third quartile, max, interquartile range of a quantitative feature under a single representation. It also helps in detecting the presence of an outlier in the feature.
Above boxplot is of feature “raisedhands” which clearly shows the values of following attributes:
- minimum = 0
- 1st quartile = 15.75
- median = 50
- 3rd quartile = 75
- max = 100
- IQR = 59.25
It shows nearly everything that boxplot shows along with the distribution of feature data which is mirror-imaged to look like a violin.
It shows histogram and kernel density estimation of a feature.
The above plot shows the histogram and kernel density estimation of feature “raisedhands” which look somewhat like an overlap of two normal distributions.
Here two different features are taken into consideration to perform analysis. Here we will focus on performing pairwise analysis that whether the change of one feature affects other features or not.
The above plot shows different boxplots for feature “raisedhands” based on different classes of students.
The above plot shows different violin plots for feature “raisedhands” based on different classes of students.
The above plot shows different distribution plots for feature “raisedhands” based on different classes of students.
- KDE of feature “raisedhands” is high between 0–20 for students belonging to the class “L”.
- KDE of feature “raisedhands” is high between 70–90 for students belonging to the class “H”.
- KDE of feature “raisedhands” for students belonging to the class “M” looks similar to KDE of “raisedhands” without division by feature “Class”.
- The above plot shows that most of the students who get absent less than 7 days visit resources larger number of times.
- While students who get absent more than 7 days visit resources less number of times.
The above plot shows a linear relationship between features “raisedhands” and “AnnouncementsView” i.e., the number of times a student sees announcements increases as his/her raising hands in classroom increases.
In this type of analysis the number of features selected together for analysis will be more than two and up to four because after that visualization becomes very difficult in 2-D representation.
The above plot shows different boxplots for feature “raisedhands” based on different classes of students separately for each gender.
- The above plot shows that most of the students who get absent less than 7 days visit resources larger number of times and mostly belong to class “H” i.e., scores higher marks.
- While students who get absent more than 7 days visit resources less number of times and mostly belong to class “L” i.e., scores lower marks.
- The above plot takes four features (“raisedhands”, “AnnouncementsView”, “gender”, “Class”) at the same time for analysis.
- It can be observed that most of the male students having higher values of “raisedhands” and “AnnouncementsView” lies in L (lower) class.
- While most of the female students having higher values of “raisedhands” and “AnnouncementsView” lies in H (lower) class.
- Similar to the above two observation we can analyze for lower values.
This plot is used to show the correlation of each quantitative feature with others in a pairwise manner. It basically compresses down few univariate, bivariate, and multivariate analyses under a single representation.
- The above representation shows that features “raisedhands” and “VisitedResources” are somewhat linearly correlated.
- Similarly “raisedhands” and “AnnouncementsView” are also somewhat linearly correlated.
- Features “discussion” and “VisitedResources” are poorly correlated or nearly uncorrelated
The numerical values of correlation between features help in getting confidence for the analysis done in pair plot representation.
In this type of representation value in each cell is mapped to color strength because of which visualization of data becomes easier. Here the correlation between features is shown with the help of different color combinations.