Title: | Automatic Creation of Dummies with Support for Predictive Modeling |
---|---|
Description: | Efficiently create dummies of all factors and character vectors in a data frame. Support is included for learning the categories on one data set (e.g., a training set) and deploying them on another (e.g., a test set). |
Authors: | Michel Ballings and Dirk Van den Poel |
Maintainer: | Michel Ballings <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.1.3 |
Built: | 2024-11-16 03:48:31 UTC |
Source: | https://github.com/cran/dummy |
categories
stores all the categorical values that are present in the factors and character vectors of a data frame. Numeric and integer vectors are ignored. It is a preprocessing step for the dummy
function. This function is appropriate for settings in which the user only wants to compute dummies for the categorical values that were present in another data set. This is especially useful in predictive modeling, when the new (test) data has more or other categories than the training data.
categories(x, p = "all")
categories(x, p = "all")
x |
data frame containing factors or character vectors that need to be transformed to dummies. Numerics, dates and integers will be ignored. |
p |
select the top p values in terms of frequency. Either "all" (all categories in all variables), an integer scalar (top p categories in all variables), or a vector of integers (number of top categories per variable in order of appearance. |
A list containing the variable names and the categories
Authors: Michel Ballings, and Dirk Van den Poel, Maintainer: [email protected]
#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) categories(x=traindata,p="all") categories(x=traindata,p=2) categories(x=traindata,p=c(2,1,3))
#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) categories(x=traindata,p="all") categories(x=traindata,p=2) categories(x=traindata,p=c(2,1,3))
dummy
creates dummy variables of all the factors and character vectors in a data frame. It also supports settings in which the user only wants to compute dummies for the categorical values that were present in another data set. This is especially useful in the context of predictive modeling, in which the new (test) data has more or other categories than the training data.
dummy(x, p = "all", object = NULL, int = FALSE, verbose = FALSE)
dummy(x, p = "all", object = NULL, int = FALSE, verbose = FALSE)
x |
a data frame containing at least one factor or character vector |
p |
Only relevant if object is NULL. Select the top p values in terms of frequency. Either "all" (all categories in all variables), an integer scalar (top p categories in all variables), or a vector of integers (number of top categories per variable in order of appearance). |
object |
output of the |
int |
should the dummies be integers (TRUE) or factors (FALSE) |
verbose |
logical. Used to show progress |
A data frame containing dummy variables
Authors: Michel Ballings, and Dirk Van den Poel, Maintainer: [email protected]
#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) #create dummies of training set (dummies_train <- dummy(x=traindata)) #create dummies of new set (dummies_new <- dummy(x=newdata)) #how many new dummy variables should not have been created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of new set using categories found in training set (dummies_new <- dummy(x=newdata,object=categories(traindata,p="all"))) #how many new dummy variables should not have be created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of training set, #using the top 2 categories of all variables found in the training data dummy(x=traindata,p=2) #create dummies of training set, #using respectively the top 2,3 and 1 categories of the three #variables found in training data dummy(x=traindata,p=c(2,3,1)) #create all dummies of training data dummy(x=traindata)
#create toy data (traindata <- data.frame(var1=as.factor(c("a","b","b","c")), var2=as.factor(c(1,1,2,3)), var3=c("val1","val2","val3","val3"), stringsAsFactors=FALSE)) (newdata <- data.frame(var1=as.factor(c("a","b","b","c","d","d")), var2=as.factor(c(1,1,2,3,4,5)), var3=c("val1","val2","val3","val3","val4","val4"), stringsAsFactors=FALSE)) #create dummies of training set (dummies_train <- dummy(x=traindata)) #create dummies of new set (dummies_new <- dummy(x=newdata)) #how many new dummy variables should not have been created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of new set using categories found in training set (dummies_new <- dummy(x=newdata,object=categories(traindata,p="all"))) #how many new dummy variables should not have be created? sum(! colnames(dummies_new) %in% colnames(dummies_train)) #create dummies of training set, #using the top 2 categories of all variables found in the training data dummy(x=traindata,p=2) #create dummies of training set, #using respectively the top 2,3 and 1 categories of the three #variables found in training data dummy(x=traindata,p=c(2,3,1)) #create all dummies of training data dummy(x=traindata)
dummyNews
shows the NEWS file of the dummy
package.
dummyNews()
dummyNews()
Authors: Michel Ballings and Dirk Van den Poel, Maintainer: [email protected]
dummyNews()
dummyNews()