Analyzing the data

So far, I have shared some details regarding:

  • How a machine learning algorithm identifies patterns within a dataset.
  • The process of deducing a polynomial equation.
  • Python code for creating a model using a dataset.

This was just the tip of the iceberg.

An initial and extremely important step in building ML solutions is the analysis of the dataset, known as Exploratory Data Analysis (EDA).

EDA improves model accuracy.

Dataset used : Titanic - Machine Learning from Disaster (https://www.kaggle.com/competitions/titanic/data?select=train.csv)

Description of Dataset

  • PassengerId : Unique identifier of passenger
  • Survived : if the passenger survived or not (Yes = 1 , No = 0_
  • Pclass : Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Sex : Gender
  • Age : Age of passenger
  • sibsp : Number of siblings or spouses boarded the ship
  • parch : Number of parents or children boarded the ship
  • ticket : Ticket number
  • cabin : cabin number
  • embarked : Location of passengers boarded the ship (C = Cherbourg, Q = Queenstown, S = Southampton)

Problem statement

Given the data -> build a model which could predict what sort of passenger could survive.

Analysis of dataset :

  • Survived column is our Y (dependent variable and rest are independent variable)
  • Rest other columns are independent columns that is x1, x2..xn
  • with given columns it can be safely assumed that PassengerId, cabin number and Ticket number would not make any impact of the survival of the passenger so these 3 columns could be safely removed.
  • Lets write some code to find the insight of the data

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt

df = pd.read_csv('train.csv')
#Columns which may impact the survival rate
columns = ['Survived', 'Pclass', 'Age', 'SibSp','Parch','Fare']

#How big is the data
df.shape

#Some sample rows
df.sample(5)

#What is the data type of cols
df.info()

#Are there any missing values?
df.isnull().sum()
        

How does the data look mathematically

A single line of code can do it

df.describe()        
Article content

Correlation of the dataset

columns = ['Survived', 'Pclass', 'Age', 'SibSp','Parch','Fare']
corrmat = df[columns].corr()
mask= np.zeros_like(corrmat)
mask[np.triu_indices_from(mask)] = False
sns.heatmap(corrmat,
            vmax=1, vmin=-1,
            annot=True, annot_kws={'fontsize':7},
            mask=mask,
            cmap=sns.diverging_palette(20,220,as_cmap=True))        

output of above code is

Article content

Same columns are listed on X and Y axis.

If you match Age of X and Y axis - value is 1 and color is BLUE means its highly correlated.

if greater the positive value - bluish color gets deeper -> highly correlated in positive way and vice versa

Single Column Analysis

import matplotlib.pyplot as plt
plt.hist(df['Age'],bins=5)        

out put is

Article content

Some Facts

If dataset is unbalanced, has a number of null values, has duplicated values, has dispersed points, data needs to be fixed accordingly.

The better the dataset, better the model accuracy


To view or add a comment, sign in

More articles by Abhishek Roy

  • Why Transformers Work: The Architecture That Changed AI

    My last article (here) was primarily focused on evolution of AI architecture . Let's focus on Transformers here.

  • The Evolution of AI Architectures (ANN → CNN → RNN → Transformer)

    The internet today is flooded with discussions around Generative AI, Agentic AI, and AI Agents. So, I thought of taking…

  • From Chat to Code: Parsing LLM Responses for Software Integration

    Context By now, most of us have tried ChatGpt - you ask a question in natural language and it responds back. This…

  • Langchain Interoperability

    Everyone who builds or uses software is familiar with databases. Almost all software applications rely on databases…

  • RAG (Retrieval-Augmented Generation)

    In the era where information is in abundance and often overwhelming, IT companies are thriving towards making…

    4 Comments
  • LLM Observability(LLMOps) - An Insight

    Context In early days of information technology, system architecture was primary monolithic- capturing basic resource…

    6 Comments
  • Lets get hands-on

    Its time to write some python code around the articles I had posted so far. Prerequisites: Python 3 installed Some…

  • ML Dataset Nomenclature

    As the pattern of a dataset gets imbedded into a Machine Learning algorithm, it constitutes a ML Model. Please refer…

  • ML Model Ingredients

    Datasets contains the patterns. Refer to my previous article here for more details.

  • Simplified ML

    Machine Learning algorithms learns the behaviors using patterns in the dataset. Patterns in dataset- What does it mean…

    2 Comments

Others also viewed

Explore content categories