Analyzing the data

Abhishek Roy

Published Jun 2, 2024

So far, I have shared some details regarding:

How a machine learning algorithm identifies patterns within a dataset.
The process of deducing a polynomial equation.
Python code for creating a model using a dataset.

This was just the tip of the iceberg.

An initial and extremely important step in building ML solutions is the analysis of the dataset, known as Exploratory Data Analysis (EDA).

EDA improves model accuracy.

Dataset used : Titanic - Machine Learning from Disaster (https://www.kaggle.com/competitions/titanic/data?select=train.csv)

Description of Dataset

PassengerId : Unique identifier of passenger
Survived : if the passenger survived or not (Yes = 1 , No = 0_
Pclass : Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex : Gender
Age : Age of passenger
sibsp : Number of siblings or spouses boarded the ship
parch : Number of parents or children boarded the ship
ticket : Ticket number
cabin : cabin number
embarked : Location of passengers boarded the ship (C = Cherbourg, Q = Queenstown, S = Southampton)

Problem statement

Given the data -> build a model which could predict what sort of passenger could survive.

Analysis of dataset :

Survived column is our Y (dependent variable and rest are independent variable)
Rest other columns are independent columns that is x1, x2..xn
with given columns it can be safely assumed that PassengerId, cabin number and Ticket number would not make any impact of the survival of the passenger so these 3 columns could be safely removed.
Lets write some code to find the insight of the data

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt

df = pd.read_csv('train.csv')
#Columns which may impact the survival rate
columns = ['Survived', 'Pclass', 'Age', 'SibSp','Parch','Fare']

#How big is the data
df.shape

#Some sample rows
df.sample(5)

#What is the data type of cols
df.info()

#Are there any missing values?
df.isnull().sum()

How does the data look mathematically

A single line of code can do it

Recommended by LinkedIn

Regridding: Comparing xarray and SciPy

Rohan Anand 1 year ago

an experiment of combining images features with…

Zenodia Charpy 7 years ago

The Data Pulse #15 - The Design Assumption: What 2,912…

Adalbert Ngongang, PhD 1 week ago

df.describe()

Correlation of the dataset

columns = ['Survived', 'Pclass', 'Age', 'SibSp','Parch','Fare']
corrmat = df[columns].corr()
mask= np.zeros_like(corrmat)
mask[np.triu_indices_from(mask)] = False
sns.heatmap(corrmat,
            vmax=1, vmin=-1,
            annot=True, annot_kws={'fontsize':7},
            mask=mask,
            cmap=sns.diverging_palette(20,220,as_cmap=True))

output of above code is

Same columns are listed on X and Y axis.

If you match Age of X and Y axis - value is 1 and color is BLUE means its highly correlated.

if greater the positive value - bluish color gets deeper -> highly correlated in positive way and vice versa

Single Column Analysis

import matplotlib.pyplot as plt
plt.hist(df['Age'],bins=5)

out put is

Some Facts

If dataset is unbalanced, has a number of null values, has duplicated values, has dispersed points, data needs to be fixed accordingly.

The better the dataset, better the model accuracy

To view or add a comment, sign in

Analyzing the data

Abhishek Roy

Description of Dataset

Problem statement

Analysis of dataset :

How does the data look mathematically

Recommended by LinkedIn

Correlation of the dataset

Single Column Analysis

Some Facts

More articles by Abhishek Roy

Others also viewed

Can EBITDA normalization adjustments be automated? Part 2 (with model)

Stirring the pile effectively

Ideas for viewing data in a scaled manner

K-Nearest Neighbors

AI_Part_5_K-NN

Deepchecks for Data and Model Validation

Ordinary Least Squares in Simple Linear Regression - Unveiling the Math Behind the Line

ML Classification Algorithms to Predict Market Movements and Backtesting

A Robust Model Construction Strategy - LASSO

keras-tensorflow code for Telecom Customer churn modelling

Explore content categories

Description of Dataset

Problem statement

Analysis of dataset :

How does the data look mathematically

Recommended by LinkedIn

Correlation of the dataset

Single Column Analysis

Some Facts

More articles by Abhishek Roy

Why Transformers Work: The Architecture That Changed AI

The Evolution of AI Architectures (ANN → CNN → RNN → Transformer)

From Chat to Code: Parsing LLM Responses for Software Integration

Langchain Interoperability

RAG (Retrieval-Augmented Generation)

LLM Observability(LLMOps) - An Insight

Lets get hands-on

ML Dataset Nomenclature

ML Model Ingredients

Simplified ML

Others also viewed

Can EBITDA normalization adjustments be automated? Part 2 (with model)

Stirring the pile effectively

Ideas for viewing data in a scaled manner

K-Nearest Neighbors

AI_Part_5_K-NN

Deepchecks for Data and Model Validation

Ordinary Least Squares in Simple Linear Regression - Unveiling the Math Behind the Line

ML Classification Algorithms to Predict Market Movements and Backtesting

A Robust Model Construction Strategy - LASSO

keras-tensorflow code for Telecom Customer churn modelling

Explore content categories