Start() with python & R - Machine Learning - Data Pre-processing

Start() with python & R - Machine Learning - Data Pre-processing

From the very first of python and R of my Lab....

The below scripts will help in pre-processing data for machine learning practices. The idea it entails is of importing data and filling missing data with either mean/median/most-frequent data in the column. Once that is done there is a need for encoding datas that can easily be replaced by "1" or "0" so that machine learning algorithm can be administered. Later, I have divided the data into two sets - test and train. The reason behind the split is to do parallel learning of the algorithm so that it performs adequately well in both datasets else the algorithm needs more optimization.

Let's see the data first - It specifies country, age, salary and purchased. It has four columns and ten rows excluding the label on the top.

you can download the csv file from - https://www.superdatascience.com/machine-learning/

In Python:

Python considers numbering the column from 0,1,2,3...so on. That's why you would see some unusual selection on columns in the scripts below.

#Data Preprocessing

#Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

#importing the dataset

dataset = pd.read_csv('Data.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 3].values

#taking care of missing data

from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)

imputer = imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

#Encoding categorical data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

onehotencoder = OneHotEncoder(categorical_features = [0])

X = onehotencoder.fit_transform(X).toarray()

labelencoder_y = LabelEncoder()

y = labelencoder_y.fit_transform(y)

#spilting the dataset into the training set and Test set

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#Feature Scaling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.fit_transform(X_test)


In R:

R performs the numbering from 1,2,3, and it is easier in many ways compare to python when you see how encoding is done.

#data Preprocessing

#Importing the dataset

dataset = read.csv('Data.csv')

dataset$Age = ifelse (is.na(dataset$Age),

           ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),

           dataset$Age)


dataset$Salary = ifelse (is.na(dataset$Salary),

             ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),

             dataset$Salary)


#Encoding categorical data

dataset$Country = factor(dataset$Country,

             levels = c('France', 'Spain', 'Germany'),

             labels = c(1,2,3))

dataset$Purchased = factor(dataset$Purchased,

             levels = c('No', 'Yes'),

             labels = c(0,1))

#splitting the dataset into the training set and test set

# install.packages('caTools')

library(caTools)

set.seed(123)

split = sample.split(dataset$Purchased, SplitRatio = 0.8)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)


#feature scaling

training_set[, 2:3] = scale(training_set[, 2:3])

test_set[, 2:3] = scale(test_set[, 2:3])

Appendix/Results:

Dataset Python -

Dataset R -

Output of Train - Python

Output of test - python

Output of train - R

Output of Test - R


Rest, I will soon share...

To view or add a comment, sign in

More articles by Abhishek Singh

Explore content categories