For Machine Learning we often need to create dummy variables for categorical data.
Caret package of R has dummyVars function to dummify the variables.
Eg. lets say we have a data frame (df) as following
df <- data.frame(age=c(40,47,34,21,34,67,56,78),
gender=c("M","M","F","F","M","F","M","F"),
favdrink=c("Coffee","Tea","Cold-drink","Tea","Coffee","Tea","Coffee","Cold-drink")
)
print(df)
age gender favdrink
40 M Coffee
47 M Tea
34 F Cold-drink
21 F Tea
34 M Coffee
67 F Tea
56 M Coffee
78 F Cold-drink
library(caret)
# dummify the data
# dummyVars function breaks out unique values from a column into individual columns
# Lets dummify gender column
dummy1 <- dummyVars("~ gender", data=df, fullRank = T)
trsf <- data.frame(predict(dummy1, newdata = df))
print(trsf)
gender.M
1
1
0
0
1
0
1
0
dummy Variable gender.M 1 means male, 0 means female
We can dummify all or more than 1 variable as following
dummy2 <- dummyVars("~ gender + favdrink", data=df, fullRank = T)
trsf <- data.frame(predict(dummy2, newdata = df))
print(trsf)
gender.M favdrink.Cold.drink favdrink.Tea
1 0 0
1 0 1
0 1 0
0 0 1
1 0 0
0 0 1
1 0 0
0 1 0
if favdrink.Cold-drink and favdrink.Tea are both zero, it means person has favdrink
as Coffee for this dataset
the last display was only for the fav drink right? very helpful. thanks