Implementation of Data Preprocessing

Implementation of Data Preprocessing

Why do we need to do Preprocessing ?

For machine learning algorithms to work, it is necessary to convert the raw data into a clean data set and dataset must be converted to numeric data. You have to encode all the categorical lables to column vectors with binary values. Missing values or NaNs in the dataset is an annoying problem. You have to either drop the missing rows or fill them up with a mean or interpolated values.

No alt text provided for this image


Loading data in pandas

To work on the data, you can either load the CSV in excel software or in pandas. Lets load the csv data in pandas.

No alt text provided for this image

Data Ingestion - Import the Titanic Dataset

No alt text provided for this image

Data Description

PassengerId

Survived           - Passenger Survived or not - 0 - Passenger Survived 1 Passenger Did not Survive

pclass            - 1,2,3 class

Name             - Name of the Passenger

Sex              - Gender of Passenger

Age              - Age of Passenger

SibSp             - no of sibling ,spouse travelling with passenger

Parch             - No of family members accompanying the passenger who is oether parent or child \


Ticket            - Ticket No

Fare             - Ticket Value

Cabin             - Cabni number

Embarked           - Port in which Passenger Embarked

Data Exploration - Profile the Data

Print top 10 & bottom 10 samples from the dataframe

No alt text provided for this image
No alt text provided for this image

Set the index of the dataframe to be the first column

No alt text provided for this image

Problem with dropping rows having missing values

After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can.

THANK YOU

To view or add a comment, sign in

More articles by Gauransh Singh

  • Decision Tree

    Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine…

    2 Comments
  • k-Nearest Neighbors

    In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and…

    2 Comments
  • Logistic Regression

    Logistic regression is similar to linear regression because both of these involve estimating the values of parameters…

  • Support Vector Regression

    Support Vector regression is a type of Support vector machine that supports linear and non-linear regression. As it…

  • Decision Tree Regression

    For reference how to implement decision tree regression refer to my git hub link:- Decision tree builds regression or…

  • Linear Regression

    Linear regression is perhaps one of the most well known and well understood algorithms in statistics and machine…

  • Data Visualization in Python using Matplotlib and Seaborn

    Data Visualization in Python using Matplotlib What is matplotlib? Matplotlib is a Python 2D plotting library which…

    2 Comments
  • How to preprocess data to make Machine Learning ready using Numpy, Pandas and other libraries.

    Data Preprocessing refers to the steps applied to make data more suitable for data mining. The steps used for Data…

  • Types of Machine Learning Algorithms

    There some variations of how to define the types of Machine Learning Algorithms but commonly they can be divided into…

    2 Comments
  • Machine Learning

    What is Machine Learning? Machine Learning is a sub-area of artificial intelligence, whereby the term refers to the…

    2 Comments

Explore content categories