Start() with python & R - Machine Learning - Data Pre-processing

Abhishek Singh

Published Feb 8, 2018

From the very first of python and R of my Lab....

The below scripts will help in pre-processing data for machine learning practices. The idea it entails is of importing data and filling missing data with either mean/median/most-frequent data in the column. Once that is done there is a need for encoding datas that can easily be replaced by "1" or "0" so that machine learning algorithm can be administered. Later, I have divided the data into two sets - test and train. The reason behind the split is to do parallel learning of the algorithm so that it performs adequately well in both datasets else the algorithm needs more optimization.

Let's see the data first - It specifies country, age, salary and purchased. It has four columns and ten rows excluding the label on the top.

you can download the csv file from - https://www.superdatascience.com/machine-learning/

In Python:

Python considers numbering the column from 0,1,2,3...so on. That's why you would see some unusual selection on columns in the scripts below.

#Data Preprocessing

#Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

#importing the dataset

dataset = pd.read_csv('Data.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 3].values

#taking care of missing data

from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)

imputer = imputer.fit(X[:, 1:3])

X[:, 1:3] = imputer.transform(X[:, 1:3])

#Encoding categorical data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

onehotencoder = OneHotEncoder(categorical_features = [0])

X = onehotencoder.fit_transform(X).toarray()

labelencoder_y = LabelEncoder()

y = labelencoder_y.fit_transform(y)

#spilting the dataset into the training set and Test set

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#Feature Scaling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

X_test = sc_X.fit_transform(X_test)

In R:

R performs the numbering from 1,2,3, and it is easier in many ways compare to python when you see how encoding is done.

#data Preprocessing

#Importing the dataset

dataset = read.csv('Data.csv')

dataset$Age = ifelse (is.na(dataset$Age),

ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),

dataset$Age)

dataset$Salary = ifelse (is.na(dataset$Salary),

ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),

dataset$Salary)

#Encoding categorical data

dataset$Country = factor(dataset$Country,

levels = c('France', 'Spain', 'Germany'),

labels = c(1,2,3))

dataset$Purchased = factor(dataset$Purchased,

levels = c('No', 'Yes'),

labels = c(0,1))

#splitting the dataset into the training set and test set

# install.packages('caTools')

library(caTools)

set.seed(123)

split = sample.split(dataset$Purchased, SplitRatio = 0.8)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

#feature scaling

training_set[, 2:3] = scale(training_set[, 2:3])

test_set[, 2:3] = scale(test_set[, 2:3])

Appendix/Results:

Dataset Python -

Dataset R -

Output of Train - Python

Output of test - python

Output of train - R

Output of Test - R

Rest, I will soon share...

To view or add a comment, sign in

Start() with python & R - Machine Learning - Data Pre-processing

Abhishek Singh

More articles by Abhishek Singh

Explore content categories

More articles by Abhishek Singh

Good Data, Dirty Data - How To Identify Them?

Disruption Trends and demand from IoT - (The Introduction - Part 1)

How To Migrate On-premise Database To IBM DB2 On Cloud?

All you need to Know about AWS CloudFront

RedShift HSM Integration With Hybrid Cloud

How To Secure GlusterFS ~ OpenStack Cloud?

Part1 - Dockers on Ubuntu Xenial 16.04 (LTS) and troubleshooting

Threats and Vulnerabilities - Practical Approach To Secure Openstack Cloud

How To Use SNS and SQS. How Can They Help Building More Resilient Services.

YARN & MapR, YARN Requirements and YARN Frameworks

Explore content categories