Implementation of Data Preprocessing
Why do we need to do Preprocessing ?
For machine learning algorithms to work, it is necessary to convert the raw data into a clean data set and dataset must be converted to numeric data. You have to encode all the categorical lables to column vectors with binary values. Missing values or NaNs in the dataset is an annoying problem. You have to either drop the missing rows or fill them up with a mean or interpolated values.
Loading data in pandas
To work on the data, you can either load the CSV in excel software or in pandas. Lets load the csv data in pandas.
Data Ingestion - Import the Titanic Dataset
Data Description
PassengerId
Survived - Passenger Survived or not - 0 - Passenger Survived 1 Passenger Did not Survive
pclass - 1,2,3 class
Name - Name of the Passenger
Sex - Gender of Passenger
Age - Age of Passenger
SibSp - no of sibling ,spouse travelling with passenger
Parch - No of family members accompanying the passenger who is oether parent or child \
Ticket - Ticket No
Fare - Ticket Value
Cabin - Cabni number
Embarked - Port in which Passenger Embarked
Data Exploration - Profile the Data
Print top 10 & bottom 10 samples from the dataframe
Set the index of the dataframe to be the first column
Problem with dropping rows having missing values
After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can.
THANK YOU