from data preparation , visualization to fit a simple neuralnet in R+RStudio

from data preparation , visualization to fit a simple neuralnet in R+RStudio

I have been thinking to write some more useful posts from a data wrangling perspective... as you guys might have quickly realized already... that.. well, in real life.. raw data are usually quite....well.... dirty... unfortunately ....>__<....

and in our daily work as a data scientist, data wrangling takes most of the time ( I would dare say about 80-90% of the time before I even consider fitting any models at all )

With an intention to assist you in getting a head-start on how to deal with raw data in general... I created below post with R script in it, so you can copy and paste it directly using , of cause, your own dataset :)

in all cases, I tried to have a flow and a goal to work toward to ( i.e fitting a neuralnet as the end goal). This should make things a bit easier to follow in steps below, at least I hope so :)

Note: if you have not yet install R + RStudio , you should visit below link (for Windows) to make the installation as well as all the packaged used in this post.

ready ? ... oki , let's get started :)

Step 1 : import/load the raw data into your workspace !

you can of cause use RStudio's native import function to import your dataset , like this

or you can use R script like this

library(readr) # the library you need to import data from csv
data <- read_delim("C:/Users/zenodia.charpy/R/data/explore.csv", ";", escape_double = FALSE, na = "NA", trim_ws = TRUE)

Step 2 : view the imported data

str(data) # view the data the examine what variables you have in the dataframe
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	63479 obs. of  10 variables:
 $ ID             : int  1 2 3 4 5 6 7 8 9 10 ...
 $ medium         : chr  "(none)" "(none)" "(none)" "(none)" ...
 $ DaysLastVisits : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Device         : chr  "mobile" "tablet" "desktop" "mobile" ...
 $ numVisits      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ TotalEvents    : int  21 15 7 8 7 0 88 20 18 6 ...
 $ avgTime        : num  519 429 116 174 825 ...
 $ UniqueEvents   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ UniquePageviews: int  13 5 4 7 3 10 10 7 4 6 ...
 $ Label          : chr  "NonPurchaser" "NonPurchaser" "NonPurchaser" "NonPurchaser" ...
 
head(data)
# A tibble: 6 × 10
     ID medium DaysLastVisits  Device numVisits TotalEvents avgTime UniqueEvents UniquePageviews        Label
  <int>  <chr>          <int>   <chr>     <int>       <int>   <dbl>        <int>           <int>        <chr>
1     1 (none)              0  mobile         1          21     519            1              13 NonPurchaser
2     2 (none)              0  tablet         1          15     429            1               5 NonPurchaser
3     3 (none)              0 desktop         1           7     116            1               4 NonPurchaser
4     4 (none)              0  mobile         1           8     174            1               7 NonPurchaser
5     5 (none)              0 desktop         1           7     825            1               3 NonPurchaser
6     6 (none)              0 desktop         1           0     298            1              10 NonPurchaser

notice that you have 3 categorical columns( data type =chr ; columns =medium , Device and Label), it would be interesting to see the unique values inside for each categorical column, starting with medium as an example.

 unique(data$medium)# use unique to find out what are the values in this column
 [1] "(none)"   NA         "Display"  "Email"    "cpc"      "email"    "organic"  "referral" "banner"   "partner"

Step 3 :clean up the column 'medium' a little bit by ...

change all capital Email --> email ( or email--> Email), here I choose to move Email --> email

and (none)--> direct to avoid brackets , like below R script shown

Note: data of type chr is actually 'characters' ( or strings), you know the column is categorical after you examine it, now the chr data type allow you to do the follow R script : ( tip : use the R script first before you change the column type to factor )

# replacing (none) to direct and replace Email --> email used when the column is of class =char
data$medium[data$medium == "(none)"] <- "direct"
data$medium[data$medium == "Email"] <- "email"

unique(data$medium) # notice that 'Email' is replaced with 'email' as well as '(none)' --> 'direct'
[1] "direct"   NA         "Display"  "email"    "cpc"      "organic"  "referral" "banner"   "partner" 


so far so good. well notice that we still have NA(s) in this column 'medium'

Note : NA(s) means Not Available or NULL values

let's not simply remove the NA(s) rows, let's think about an interesting approach : replacing the NA values with top values with relative probability instead.

how do we do that ? look at the below R script (tip : carefully read through the comments as well inside the script block )

# use ddply function to examine the frequency of each value in order to construct your probability for each unique value inside column: medium 
# like below script shown

with(data,ddply(data, .(medium), nrow)) 
>    medium    V1
1   banner     7
2      cpc 14569
3   direct 10968
4  Display   162
5    email  2610
6  organic 28132
7  partner     3
8 referral  7025
9     <NA>     3
# we extract the unique values inside of medium columnmedium_cat<-unique(data$medium) # extract the unique categories of medium
> medium_cat
[1] "direct"   NA         "Display"  "email"    "cpc"      "organic"  "referral" "banner"   "partner" 
# we only need the values EXCEPT for NA(s)
medium_cat<-medium_cat[c(1,3:9)] # check what are the values in medium column
medium_cat
[1] "direct"   "Display"  "email"    "cpc"      "organic"  "referral" "banner"   "partner"
# we need a divider to construct our probability 

divider<-nrow(data) # count how many rows in the dataset and use it as divider for probability 

#specify the probability _for each categories 
# in medium based on each categories' count, like below

replace_medium<-sample(medium_cat[c(1,3:9)], 3, replace=TRUE, prob=c(10968/divider,162/divider,2610/divider,14569/divider,28132/divider,7025/divider,7/divider, 3/divider))

# now replace the NA inside column medium with the probability you created for the top values 
data$medium <- replace(data$medium,which(is.na(data$medium)),replace_medium) # this is how you do the replacement 
#notice that the 3 NA(s) were replaced indeed and exactly which value was the NA(s) were replaced for 
 with(data,ddply(data, .(medium), nrow))
    medium    V1
1   banner     8 #banner was 7 now it is 8, meaning that one NA was replaced with ' banner' as its value
2      cpc 14570 # cpc was 14569 now it is 14570, meaning one NA was replaced with 'cpc'  as value
3   direct 10968
4  Display   162
5    email  2610
6  organic 28133 # organic was 28132 now it is 28133 meaning one NA was replaced with 'organic' as value
7  partner     3

# notice that all 3 NA(s) were gone and each NA were replaced with a probability we specified

Good ! now we have no NA(s) for this column 'medium' , but do we have NA(s) else where in the dataset ?

side note : a tip to master data manipulation in R, I would strongly suggest you to read the post below using tidyr and dplyr

Step 4: find and remove all NA(s) in the dataset

# you can find NA like this at the same time get descriptive statistic for all columns
> summary(data)
       ID             medium          DaysLastVisits          Device            numVisits         
 Min.   :    1.0   Length:63479       Min.   :  0.000000   Length:63479       Min.   :  1.000000  
 1st Qu.:15870.5   Class :character   1st Qu.:  0.000000   Class :character   1st Qu.:  1.000000  
 Median :31740.0   Mode  :character   Median :  0.000000   Mode  :character   Median :  2.000000  
 Mean   :31740.0                      Mean   :  7.316168                      Mean   :  4.624915  
 3rd Qu.:47609.5                      3rd Qu.:  1.000000                      3rd Qu.:  4.000000  
 Max.   :63479.0                      Max.   :188.000000                      Max.   :323.000000  
                                                                                                  
  TotalEvents           avgTime            UniqueEvents      UniquePageviews       Label          
 Min.   :  0.00000   Min.   :    0.0000   Min.   :1.000000   Min.   : 0.00000   Length:63479      
 1st Qu.:  0.00000   1st Qu.:   29.0000   1st Qu.:1.000000   1st Qu.: 2.00000   Class :character  
 Median :  5.00000   Median :  164.0000   Median :1.000000   Median : 4.00000   Mode  :character  
 Mean   : 12.75505   Mean   :  390.8156   Mean   :1.003686   Mean   : 4.63496                     
 3rd Qu.: 15.00000   3rd Qu.:  465.0000   3rd Qu.:1.000000   3rd Qu.: 6.00000                     
 Max.   :378.00000   Max.   :19265.0000   Max.   :2.000000   Max.   :41.00000                     
                     NA's   :1                               NA's   :1  


# if you scroll to the right you will see that UniquePageviews and avgTime both has 1 NA

# use apply function to check which columns have NA(s) and mark the row indexes
apply(data, 2, function(x) length(which(is.na(x)))) # this is how you find all NA(s) in the dataset per column

             ID          medium  DaysLastVisits          Device       numVisits     TotalEvents         avgTime    UniqueEvents UniquePageviews 
              0               0               0               0               0               0               1               0               1 
          Label 
              0 

Now we use traditional ways to remove NA by finding the index of the rows that has na in it and then remove the entire row

# find the indexes of the rows that has NA(s) in it 
row.has.na <- apply(data, 1, function(x){any(is.na(x))})
# remove the entire row like this 
data <- data[!row.has.na,] # you can of cause create another new dataframe , but i am lazy so i just overwrite the original dataframe to remove all NA(s)

Step 5: turn all columns with character class(=chr) into factor(=factor) for preparation of One-Hot-Encoding (OHE)

str(data) # original dataframe with chr as column types instead of factor
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	63479 obs. of  10 variables:
 $ ID             : int  1 2 3 4 5 6 7 8 9 10 ...
 $ medium         : chr  "(none)" "(none)" "(none)" "(none)" ...
 $ DaysLastVisits : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Device         : chr  "mobile" "tablet" "desktop" "mobile" ...
 $ numVisits      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ TotalEvents    : int  21 15 7 8 7 0 88 20 18 6 ...
 $ avgTime        : num  519 429 116 174 825 ...
 $ UniqueEvents   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ UniquePageviews: int  13 5 4 7 3 10 10 7 4 6 ...
 $ Label          : chr  "NonPurchaser" "NonPurchaser" "NonPurchaser" "NonPurchaser" ...
 
 
data<-as.data.frame(unclass(data)) # this is how you turn chr--> factor
> str(data)
'data.frame':	63478 obs. of  10 variables:
 $ ID             : int  1 2 3 4 5 6 7 8 9 10 ...
 $ medium         : Factor w/ 8 levels "banner","cpc",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ DaysLastVisits : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Device         : Factor w/ 3 levels "desktop","mobile",..: 2 3 1 2 1 1 1 3 1 1 ...
 $ numVisits      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ TotalEvents    : int  21 15 7 8 7 0 88 20 18 6 ...
 $ avgTime        : num  519 429 116 174 825 ...
 $ UniqueEvents   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ UniquePageviews: int  13 5 4 7 3 10 10 7 4 6 ...
 $ Label          : Factor w/ 2 levels "NonPurchaser",..: 1 1 1 1 1 1 2 1 1 1 ...
# notice ALL chr columns are now turn into factor as column type which is exactly what we want

before we do one-hot-encoding, let's look at some basic statistical plots to help us understanding and explore the data more

Step 6 : basic statistic plots

> summary(data[,2:10]) # first let's look at summarizing statistic
    medium          DaysLastVisits       Device            numVisits        TotalEvents        avgTime       
 Length:63479       Min.   :  0.000   Length:63479       Min.   :  1.000   Min.   :  0.00   Min.   :    0.0  
 Class :character   1st Qu.:  0.000   Class :character   1st Qu.:  1.000   1st Qu.:  0.00   1st Qu.:   29.0  
 Mode  :character   Median :  0.000   Mode  :character   Median :  2.000   Median :  5.00   Median :  164.0  
                    Mean   :  7.316                      Mean   :  4.625   Mean   : 12.76   Mean   :  390.8  
                    3rd Qu.:  1.000                      3rd Qu.:  4.000   3rd Qu.: 15.00   3rd Qu.:  465.0  
                    Max.   :188.000                      Max.   :323.000   Max.   :378.00   Max.   :19265.0  
                                                                                            NA's   :1        
  UniqueEvents   UniquePageviews     Label          
 Min.   :1.000   Min.   : 0.000   Length:63479      
 1st Qu.:1.000   1st Qu.: 2.000   Class :character  
 Median :1.000   Median : 4.000   Mode  :character  
 Mean   :1.004   Mean   : 4.635                     
 3rd Qu.:1.000   3rd Qu.: 6.000                     
 Max.   :2.000   Max.   :41.000                     
                 NA's   :1     
# prepare to make boxplots
num<-sapply(data, is.numeric) # take only numeric columns
num_cols<-colnames(data[,num]) # get the list of names of numerical columns only
num_cols<-num_cols[c(2:7)] # skip the first numerical column=ID
op <- par(mar = c(5, 10, 4, 2) + 0.1) # set the backgroud
boxplot(data[,num_cols],horizontal = TRUE, las = 1,cex.axis = 0.7) # plot it horizontally so that text for each column have enough space
#boxplot shown below
# import the libraries you need in order to do plots
library(plyr)   
library(dplyr)
library(dplR)
library(lattice)
library(latticeExtra)
library(MEMSS)
library(stringr)
library(lubridate)
library(tidyr)
library(ggplot2)
# this is a way to view two categorical columns(Device +Label) vs. one numerical columns(DaysLastVisits)

densityplot( ~ DaysLastVisits | Device+Label, data  = data,plot.points = FALSE, ref = TRUE)
# pair-wised scatterplot, remembe to ONLY use numerical columns 
pairs(data[,c(3,5:10)], col=data$Label) # include only numerical columns + the Label
# create histograms for all numerical columns
library(tidyr)
library(dplyr)
require(lattice)
require(ggplot2)

data[,num_cols] %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

side note : for graphs in R , you should definitely take a look at ggplot2, lattice, extraLattice

In reality, witnessing a situation like this ---when we noticed that we have a lot of outliers ( from boxplots and histograms) as well as many extreme values and not exactly normally distributed numerical data, we will do something(statistical) about it before fitting any (regression) models. well , since we are not that concern with regression models right now, let's move on...

Step 7 : do one-hot-encoding ( normally for tree models, we usually do not need to do OHE, rpart or randomforest take in categorical=factor columns quite well, however, for neuralnet model, unfortunatedly, yes, we do need to do OHE).

 #get the One-hot-encoding features
library(ade4)
library(data.table)
 
is.fact <- sapply(data, is.factor) # get factor from 
ohe_feats <- colnames(data[, is.fact])
ohe_feats
[1] "medium" "Device" "Label" # labels are the one we want to use for neuralnet so we dont do OHE on Label
ohe_feats<-ohe_feats[1:2] # we just want to encode medium and Device
 
for (f in ohe_feats){
   df_all_dummy = acm.disjonctif(data[f])
   data[f] = NULL
   data = cbind(data, df_all_dummy)
 }
 str(data)
'data.frame':	63478 obs. of  19 variables:
 $ ID             : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DaysLastVisits : int  0 0 0 0 0 0 0 0 0 0 ...
 $ numVisits      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ TotalEvents    : int  21 15 7 8 7 0 88 20 18 6 ...
 $ avgTime        : num  519 429 116 174 825 ...
 $ UniqueEvents   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ UniquePageviews: int  13 5 4 7 3 10 10 7 4 6 ...
 $ Label          : Factor w/ 2 levels "NonPurchaser",..: 1 1 1 1 1 1 2 1 1 1 ...
 $ medium.banner  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.cpc     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.direct  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ medium.Display : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.email   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.organic : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.partner : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.referral: num  0 0 0 0 0 0 0 0 0 0 ...
 $ Device.desktop : num  0 0 1 0 1 1 1 0 1 1 ...
 $ Device.mobile  : num  1 0 0 1 0 0 0 0 0 0 ...
 $ Device.tablet  : num  0 1 0 0 0 0 0 1 0 0 ...
# notice all categorical columns (except for Label) are one hot encoded into for example Device.tablet, Device.desktop...etc

#remember to scale the dataset, except for labels and OHE features, before feeding into neuralnets
data[,2:6]<-scale(data[,2:6])

Voilà ! now you have a dataframe with OHE features ready to fit your neural net model

Step 8 : prepare the Labels, once again, neuralnet has specific requirement if you have two class(Purchase, NonPurchaser) , this is ( refer to below R script) how you create the 'labels' for the neuralnet by specifying the Label=='Purchase' and Label=='NonPurchaser'

library(neuralnet) # make sure you install and load the neuralnet library
unique(data$Label)
[1] NonPurchaser Purchase   # we have two class : Purchase, NonPurchaser 
data<- cbind(data, data$Label == 'Purchase')
data<- cbind(data, data$Label == 'NonPurchaser')
str(data)
'data.frame':	63479 obs. of  21 variables:
 $ ID                           : int  16855 23623 36364 57650 12803 57025 59962 41943 39931 3923 ...
 $ DaysLastVisits               : int  0 0 10 0 0 8 0 0 0 0 ...
 $ numVisits                    : int  1 1 5 2 1 3 4 4 8 1 ...
 $ TotalEvents                  : int  49 24 0 3 30 0 9 0 0 0 ...
 $ avgTime                      : num  1085 304 109 120 350 ...
 $ UniqueEvents                 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ UniquePageviews              : int  13 9 2 4 8 2 2 4 1 2 ...
 $ Label                        : Factor w/ 2 levels "NonPurchaser",..: 1 1 1 1 1 1 1 2 1 1 ...
 $ medium.banner                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.cpc                   : num  0 0 1 0 0 0 0 0 0 0 ...
 $ medium.direct                : num  0 0 0 0 0 0 0 0 0 1 ...
 $ medium.Display               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.email                 : num  0 0 0 0 0 0 0 0 1 0 ...
 $ medium.organic               : num  1 1 0 0 1 1 0 1 0 0 ...
 $ medium.partner               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.referral              : num  0 0 0 1 0 0 1 0 0 0 ...
 $ Device.desktop               : num  0 1 0 0 1 1 0 1 0 1 ...
 $ Device.mobile                : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Device.tablet                : num  1 0 0 0 0 0 1 0 1 0 ...
 $ data$Label == "Purchase"    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ data$Label == "NonPurchaser": logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

# change the names of the columns: data$Label=="Purchase" and data$Label=="NonPurchaswer"
# so it is much more easier to read and use for the model formula
names(data)[20:21] <- c('Purchase', 'NonPurchaser') # this is how you change the names of the last two columns

# check the columns look oki and indeed changed
str(data)
'data.frame':	63479 obs. of  21 variables:
 $ ID             : int  16855 23623 36364 57650 12803 57025 59962 41943 39931 3923 ...
 $ DaysLastVisits : int  0 0 10 0 0 8 0 0 0 0 ...
 $ numVisits      : int  1 1 5 2 1 3 4 4 8 1 ...
 $ TotalEvents    : int  49 24 0 3 30 0 9 0 0 0 ...
 $ avgTime        : num  1085 304 109 120 350 ...
 $ UniqueEvents   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ UniquePageviews: int  13 9 2 4 8 2 2 4 1 2 ...
 $ Label          : Factor w/ 2 levels "NonPurchaser",..: 1 1 1 1 1 1 1 2 1 1 ...
 $ medium.banner  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.cpc     : num  0 0 1 0 0 0 0 0 0 0 ...
 $ medium.direct  : num  0 0 0 0 0 0 0 0 0 1 ...
 $ medium.Display : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.email   : num  0 0 0 0 0 0 0 0 1 0 ...
 $ medium.organic : num  1 1 0 0 1 1 0 1 0 0 ...
 $ medium.partner : num  0 0 0 0 0 0 0 0 0 0 ...
 $ medium.referral: num  0 0 0 1 0 0 1 0 0 0 ...
 $ Device.desktop : num  0 1 0 0 1 1 0 1 0 1 ...
 $ Device.mobile  : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Device.tablet  : num  1 0 0 0 0 0 1 0 1 0 ...
 $ Purchase       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ NonPurchaser   : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Note: normally we will divide the datasets into training and testing set in order to check the model performance ( using the testing set), however, I don't want to make a lengthy post, so let's leave it for some other time.

Step 9 : now we are ready to fit the neural net like the following R script shown

# the model formula , another requirement from neuralnet you need to do it manually like below 
f<-"Purchase+NonPurchaser~ DaysLastVisits+Device.mobile+Device.tablet+medium.cpc+UniquePageviews"
nn <- neuralnet(
  f,
  data =data, 
  hidden=c(3) # you can specify more than 1 hidden layer of cause, for demostration purpose, we only use one hidden layer with 3 hidden nodes
)
plot(nn) # remember to plot the nn for visual effect
# below is the graph of this neural net  we just built

this is what it looks like with 1 hidden layer with 3 nodes and the input variables we specified in the f formula in Step 9 and the output is the specified Label == Purchase |NonPurchaser ( specified from Step 8)

oki, so we went through data manipulation, visualzation and how to deal with missing values =NA(s) and one-hot-encoding , eveually we fit a neuralnet model , as is for model tweaking, makeing predictions and validate the model performance.. there is a lot of posts specifically discussing and comparing different neuralnet packages and performance of prediction, for further reading please refer to the following link, personally, I find it to be a good starting point if you are interested in neural networks in general using R :)

Hope that you enjoy this post and hope that the above script help you somehow when you processing raw data

until next time ...

keep on smiling ^__^b

Hi Zenodia. Very nice article. Just what I was looking for , to use in a very different field than yours (genetics). Lasse

Hi zenodia nice article do you know if we use nnet package for creating the neural network how to decide on size or even if we use neural net how to decide how many layers we are going to use?

To view or add a comment, sign in

More articles by Zenodia Charpy

Others also viewed

Explore content categories