DATA PREPROCESSING
Data preprocessing is the first machine learning step in which we transform raw data obtained from various sources into a usable format to implement accurate machine learning models.
Preprocessing phase:
GET THE DATASET
The first thing we required is a dataset as a machine learning model completely works on data. The collected data for a particular problem in a proper format is known as the dataset. To use the dataset , we usually put it into a CSV (Comma Separated File - which saves tabular data such as spreadsheets) file. However, sometimes, we may also need to use an HTML or xlsx file.
We can also create our dataset by gathering data using various API with Python and put that data into a .csv file.
IMPORTING LIBRARIES
we need to import some predefined Python libraries. These libraries are used to perform some specific jobs. There are three specific libraries that we will use for data preprocessing, which are:
IMPORTING DATASETS
Now we need to import the datasets which we have collected for our machine learning project. But before importing a dataset, we need to set the current directory as a working directory. To set a working directory in Spyder IDE, we need to follow the below steps:
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv file and performs various operations on it. Using this function, we can read a csv file locally as well as through an URL.
data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed the name of our dataset.
Extracting dependent and independent variables:
it is important to distinguish the matrix of features (independent variables) and dependent variables from dataset.
Extracting independent variable:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the required rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
the first colon(:) is used to take all the rows, and the second colon(:) is for all the columns. Here we have used :-1, because we don't want to take the last column as it contains the dependent variable. So by doing this, we will get the matrix of features.
HANDLING MISSING DATA
If our dataset contains some missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to handle missing values present in the dataset.
Two ways of handling missing data : By detecting the particular row or By calculating mean.
To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for building machine learning models.
Recommended by LinkedIn
ENCODING CATEGORICAL DATA
Categorical data is data which has some categories. Since machine learning model completely works on mathematics and numbers, but if our dataset would have a categorical variable, then it may create trouble while building the model. So it is necessary to encode these categorical variables into numbers.
Dummy Variables: Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that variable in a particular column, and rest variables become 0. With dummy encoding, we will have a number of columns equal to the number of categories.
SPLTTING DATASET INTO TRAINING AND TEST SET
Suppose, if we have given training to our machine learning model by a dataset and we test it by a completely different dataset. Then, it will create difficulties for our model to understand the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a new dataset to it, then it will decrease the performance. So we always try to make a machine learning model which performs well with the training set and also with the test dataset.
Training Set: A subset of dataset to train the machine learning model, and we already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the output.
For splitting the dataset, we will use the below lines of code:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
the first line is used for splitting arrays of the dataset into random train and test subsets.
FEATURE SCALING
the final step of data preprocessing in machine learning. It is a technique to standardize the independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same range and in the same scale so that no any variable dominate the other variable.
A machine learning model is based on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our machine learning model.
There are two ways to perform feature scaling in machine learning:
Standardization :
Normalization :
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
from sklearn.preprocessing import StandardScaler
Combining all the steps:
in the end, we can combine all the steps together to make our complete code more understandable.
SAMPLE CODE :
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Dataset.csv')
#Extracting Independent Variable
x= data_set.iloc[:, :-1].values
#Extracting Dependent variable
y= data_set.iloc[:, 3].values
#handling missing data(Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent varibles x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
#for Country Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
#encoding for purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
#Feature Scaling of datasets
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
There are some steps or lines of code which are not necessary for all machine learning models. So we can exclude them from our code to make it reusable for all models.