DATA PREPROCESSING

Article content
Data preprocessing steps in machine learning

Data preprocessing is the first machine learning step in which we transform raw data obtained from various sources into a usable format to implement accurate machine learning models.

Preprocessing phase:

  1. Getting the data set
  2. Importing libraries
  3. Importing datasets
  4. Finding missing data
  5. Encoding categorical data
  6. Splitting dataset into Training and Test set
  7. Feature scaling

GET THE DATASET

The first thing we required is a dataset as a machine learning model completely works on data. The collected data for a particular problem in a proper format is known as the dataset. To use the dataset , we usually put it into a CSV (Comma Separated File - which saves tabular data such as spreadsheets) file. However, sometimes, we may also need to use an HTML or xlsx file.

We can also create our dataset by gathering data using various API with Python and put that data into a .csv file.

IMPORTING LIBRARIES

we need to import some predefined Python libraries. These libraries are used to perform some specific jobs. There are three specific libraries that we will use for data preprocessing, which are:

  • Numpy : Numpy Python library is used for including any type of mathematical operation in the code. It is the fundamental package for scientific calculation in Python. ("import numpy as nm")  
  • Matplotlib : It is a Python 2D plotting library, and with this library, we need to import a sub-library pyplot. This library is used to plot any type of charts in Python for the code. ("import matplotlib.pyplot as mpt")
  • Pandas : one of the most famous Python libraries and used for importing and managing the datasets. ("import pandas as pd")

IMPORTING DATASETS

Now we need to import the datasets which we have collected for our machine learning project. But before importing a dataset, we need to set the current directory as a working directory. To set a working directory in Spyder IDE, we need to follow the below steps:

  1. Save your Python file in the directory which contains dataset.
  2. Go to File explorer option in Spyder IDE, and select the required directory.
  3. Click on F5 button or run option to execute the file.

read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to read a csv file and performs various operations on it. Using this function, we can read a csv file locally as well as through an URL.

data_set= pd.read_csv('Dataset.csv')

Here, data_set is a name of the variable to store our dataset, and inside the function, we have passed the name of our dataset.

Extracting dependent and independent variables:

it is important to distinguish the matrix of features (independent variables) and dependent variables from dataset.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the required rows and columns from the dataset.

      x= data_set.iloc[:,:-1].values          

the first colon(:) is used to take all the rows, and the second colon(:) is for all the columns. Here we have used :-1, because we don't want to take the last column as it contains the dependent variable. So by doing this, we will get the matrix of features.

HANDLING MISSING DATA

If our dataset contains some missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to handle missing values present in the dataset.

Two ways of handling missing data : By detecting the particular row or By calculating mean.

To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for building machine learning models.

ENCODING CATEGORICAL DATA

Categorical data is data which has some categories. Since machine learning model completely works on mathematics and numbers, but if our dataset would have a categorical variable, then it may create trouble while building the model. So it is necessary to encode these categorical variables into numbers.

Dummy Variables: Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that variable in a particular column, and rest variables become 0. With dummy encoding, we will have a number of columns equal to the number of categories.

SPLTTING DATASET INTO TRAINING AND TEST SET

Suppose, if we have given training to our machine learning model by a dataset and we test it by a completely different dataset. Then, it will create difficulties for our model to understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a new dataset to it, then it will decrease the performance. So we always try to make a machine learning model which performs well with the training set and also with the test dataset.

Article content

Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the output.

For splitting the dataset, we will use the below lines of code:

from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)          

the first line is used for splitting arrays of the dataset into random train and test subsets.

  • x_train: features for the training data
  • x_test: features for testing data
  • y_train: Dependent variables for training data
  • y_test: Independent variable for testing data

FEATURE SCALING

the final step of data preprocessing in machine learning. It is a technique to standardize the independent variables of the dataset in a specific range. In feature scaling, we put our variables in the same range and in the same scale so that no any variable dominate the other variable.

A machine learning model is based on Euclidean distance, and if we do not scale the variable, then it will cause some issue in our machine learning model.

Article content

There are two ways to perform feature scaling in machine learning:

Standardization :

Article content

Normalization :

Article content

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

from sklearn.preprocessing import StandardScaler         

Combining all the steps:

in the end, we can combine all the steps together to make our complete code more understandable.

SAMPLE CODE :
# importing libraries  
import numpy as nm  
import matplotlib.pyplot as mtp  
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('Dataset.csv')  
  
#Extracting Independent Variable  
x= data_set.iloc[:, :-1].values  
  
#Extracting Dependent variable  
y= data_set.iloc[:, 3].values  
  
#handling missing data(Replacing missing data with the mean value)  
from sklearn.preprocessing import Imputer  
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)  
  
#Fitting imputer object to the independent varibles x.   
imputerimputer= imputer.fit(x[:, 1:3])  
  
#Replacing missing data with the calculated mean value  
x[:, 1:3]= imputer.transform(x[:, 1:3])  
  
#for Country Variable  
from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
label_encoder_x= LabelEncoder()  
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
  
#Encoding for dummy variables  
onehot_encoder= OneHotEncoder(categorical_features= [0])    
x= onehot_encoder.fit_transform(x).toarray()  
  
#encoding for purchased variable  
labelencoder_y= LabelEncoder()  
y= labelencoder_y.fit_transform(y)  
  
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  
  
#Feature Scaling of datasets  
from sklearn.preprocessing import StandardScaler  
st_x= StandardScaler()  
x_train= st_x.fit_transform(x_train)  
x_test= st_x.transform(x_test)          

There are some steps or lines of code which are not necessary for all machine learning models. So we can exclude them from our code to make it reusable for all models.

To view or add a comment, sign in

More articles by Varshinii kasirajan

  • KNN ALGORITHM

    K-Nearest Neighbors (K-NN) algorithm is a popular Machine Learning algorithm used mostly for solving classification…

  • LOGISTIC REGRESSION

    Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many social…

  • Is R-Square value always between 0 to 1?

    LINEAR REGRESSION R-square value gives the measure of how much variance is explained by model. the default regression…

    1 Comment
  • LINEAR REGRESIION

    Linear regression is used for finding linear relationship between target and one or more predictions.There are two…

    2 Comments

Others also viewed

Explore content categories