Dummify Variables in R

For Machine Learning we often need to create dummy variables for categorical data.

Caret package of R has dummyVars function to dummify the variables.

Eg. lets say we have a data frame (df) as following
df <- data.frame(age=c(40,47,34,21,34,67,56,78),
                 gender=c("M","M","F","F","M","F","M","F"),
                 favdrink=c("Coffee","Tea","Cold-drink","Tea","Coffee","Tea","Coffee","Cold-drink")
                )
print(df)
  age gender   favdrink
  40      M     Coffee
  47      M        Tea
  34      F Cold-drink
  21      F        Tea
  34      M     Coffee
  67      F        Tea
  56      M     Coffee
  78      F Cold-drink

library(caret)

# dummify the data
# dummyVars function breaks out unique values from a column into individual columns 
 
# Lets dummify gender column
dummy1 <- dummyVars("~ gender", data=df, fullRank = T)
trsf <- data.frame(predict(dummy1, newdata = df))
print(trsf)
  gender.M
        1
        1
        0
        0
        1
        0
        1
        0
dummy Variable gender.M 1 means male, 0 means female

We can dummify all or more than 1 variable as following

dummy2 <- dummyVars("~ gender + favdrink", data=df, fullRank = T)
trsf <- data.frame(predict(dummy2, newdata = df))
print(trsf)
  gender.M favdrink.Cold.drink favdrink.Tea
        1                   0            0
        1                   0            1
        0                   1            0
        0                   0            1
        1                   0            0
        0                   0            1
        1                   0            0
        0                   1            0
 
if favdrink.Cold-drink and favdrink.Tea are both zero, it means person has favdrink
as Coffee for this dataset
 
  



the last display was only for the fav drink right? very helpful. thanks

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories