Implementation of Data Preprocessing

Gauransh Singh

Published May 30, 2020

Why do we need to do Preprocessing ?

For machine learning algorithms to work, it is necessary to convert the raw data into a clean data set and dataset must be converted to numeric data. You have to encode all the categorical lables to column vectors with binary values. Missing values or NaNs in the dataset is an annoying problem. You have to either drop the missing rows or fill them up with a mean or interpolated values.

Loading data in pandas

To work on the data, you can either load the CSV in excel software or in pandas. Lets load the csv data in pandas.

Data Ingestion - Import the Titanic Dataset

Data Description

PassengerId

Survived - Passenger Survived or not - 0 - Passenger Survived 1 Passenger Did not Survive

pclass - 1,2,3 class

Name - Name of the Passenger

Sex - Gender of Passenger

Age - Age of Passenger

SibSp - no of sibling ,spouse travelling with passenger

Parch - No of family members accompanying the passenger who is oether parent or child \

Ticket - Ticket No

Fare - Ticket Value

Cabin - Cabni number

Embarked - Port in which Passenger Embarked

Data Exploration - Profile the Data

Print top 10 & bottom 10 samples from the dataframe

Set the index of the dataframe to be the first column

Problem with dropping rows having missing values

After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can.

THANK YOU

To view or add a comment, sign in

More articles by Gauransh Singh

Decision Tree

Jun 15, 2020

Decision Tree

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine…

2 Comments
k-Nearest Neighbors

Jun 10, 2020

k-Nearest Neighbors

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and…

2 Comments
Logistic Regression

Jun 9, 2020

Logistic Regression

Logistic regression is similar to linear regression because both of these involve estimating the values of parameters…
Support Vector Regression

Jun 6, 2020

Support Vector Regression

Support Vector regression is a type of Support vector machine that supports linear and non-linear regression. As it…
Decision Tree Regression

Jun 5, 2020

Decision Tree Regression

For reference how to implement decision tree regression refer to my git hub link:- Decision tree builds regression or…
Linear Regression

Jun 5, 2020

Linear Regression

Linear regression is perhaps one of the most well known and well understood algorithms in statistics and machine…
Data Visualization in Python using Matplotlib and Seaborn

May 31, 2020

Data Visualization in Python using Matplotlib and Seaborn

Data Visualization in Python using Matplotlib What is matplotlib? Matplotlib is a Python 2D plotting library which…

2 Comments
How to preprocess data to make Machine Learning ready using Numpy, Pandas and other libraries.

May 28, 2020

How to preprocess data to make Machine Learning ready using Numpy, Pandas and other libraries.

Data Preprocessing refers to the steps applied to make data more suitable for data mining. The steps used for Data…
Types of Machine Learning Algorithms

May 27, 2020

Types of Machine Learning Algorithms

There some variations of how to define the types of Machine Learning Algorithms but commonly they can be divided into…

2 Comments
Machine Learning

May 26, 2020

Machine Learning

What is Machine Learning? Machine Learning is a sub-area of artificial intelligence, whereby the term refers to the…

2 Comments

See all articles

Why do we need to do Preprocessing ?

Loading data in pandas

Problem with dropping rows having missing values

More articles by Gauransh Singh

Decision Tree

k-Nearest Neighbors

Logistic Regression

Support Vector Regression

Decision Tree Regression

Linear Regression

Data Visualization in Python using Matplotlib and Seaborn

How to preprocess data to make Machine Learning ready using Numpy, Pandas and other libraries.

Types of Machine Learning Algorithms

Machine Learning

Explore content categories