Start() with python & R - Machine Learning - Data Pre-processing
From the very first of python and R of my Lab....
The below scripts will help in pre-processing data for machine learning practices. The idea it entails is of importing data and filling missing data with either mean/median/most-frequent data in the column. Once that is done there is a need for encoding datas that can easily be replaced by "1" or "0" so that machine learning algorithm can be administered. Later, I have divided the data into two sets - test and train. The reason behind the split is to do parallel learning of the algorithm so that it performs adequately well in both datasets else the algorithm needs more optimization.
Let's see the data first - It specifies country, age, salary and purchased. It has four columns and ten rows excluding the label on the top.
you can download the csv file from - https://www.superdatascience.com/machine-learning/
In Python:
Python considers numbering the column from 0,1,2,3...so on. That's why you would see some unusual selection on columns in the scripts below.
#Data Preprocessing
#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
#taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
#spilting the dataset into the training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
In R:
R performs the numbering from 1,2,3, and it is easier in many ways compare to python when you see how encoding is done.
#data Preprocessing
#Importing the dataset
dataset = read.csv('Data.csv')
dataset$Age = ifelse (is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse (is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)
#Encoding categorical data
dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany'),
labels = c(1,2,3))
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0,1))
#splitting the dataset into the training set and test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
#feature scaling
training_set[, 2:3] = scale(training_set[, 2:3])
test_set[, 2:3] = scale(test_set[, 2:3])
Appendix/Results:
Dataset Python -
Dataset R -
Output of Train - Python
Output of test - python
Output of train - R
Output of Test - R
Rest, I will soon share...