Analyzing the data
So far, I have shared some details regarding:
This was just the tip of the iceberg.
An initial and extremely important step in building ML solutions is the analysis of the dataset, known as Exploratory Data Analysis (EDA).
EDA improves model accuracy.
Dataset used : Titanic - Machine Learning from Disaster (https://www.kaggle.com/competitions/titanic/data?select=train.csv)
Description of Dataset
Problem statement
Given the data -> build a model which could predict what sort of passenger could survive.
Analysis of dataset :
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
df = pd.read_csv('train.csv')
#Columns which may impact the survival rate
columns = ['Survived', 'Pclass', 'Age', 'SibSp','Parch','Fare']
#How big is the data
df.shape
#Some sample rows
df.sample(5)
#What is the data type of cols
df.info()
#Are there any missing values?
df.isnull().sum()
How does the data look mathematically
A single line of code can do it
Recommended by LinkedIn
df.describe()
Correlation of the dataset
columns = ['Survived', 'Pclass', 'Age', 'SibSp','Parch','Fare']
corrmat = df[columns].corr()
mask= np.zeros_like(corrmat)
mask[np.triu_indices_from(mask)] = False
sns.heatmap(corrmat,
vmax=1, vmin=-1,
annot=True, annot_kws={'fontsize':7},
mask=mask,
cmap=sns.diverging_palette(20,220,as_cmap=True))
output of above code is
Same columns are listed on X and Y axis.
If you match Age of X and Y axis - value is 1 and color is BLUE means its highly correlated.
if greater the positive value - bluish color gets deeper -> highly correlated in positive way and vice versa
Single Column Analysis
import matplotlib.pyplot as plt
plt.hist(df['Age'],bins=5)
out put is
Some Facts
If dataset is unbalanced, has a number of null values, has duplicated values, has dispersed points, data needs to be fixed accordingly.
The better the dataset, better the model accuracy