--- title: "ModelMatrixModel" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{demo_ModelMatrixModel} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{css, echo=FALSE} pre code { white-space: pre-wrap; } ``` ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` model.matrix function in R is a convenient way to transform training dataset for modeling. But it does not save any parameter used in transformation, so it is hard to apply the same transformation to test dataset or new dataset. ModelMatrixModel package is created to solve the problem. ## setup ```{r} #devtools::install_github("xinyongtian/R_ModelMatrixModel") #install from github rm(list=ls()) library(ModelMatrixModel) set.seed(10) traindf= data.frame(x1 = sample(LETTERS[1:5], replace = T, 20), x2 = rnorm(20, 100, 5), x3 = factor(sample(c("U","L","P"), replace = T, 20)), y = rnorm(20, 10, 2)) set.seed(20) newdf=data.frame(x1 = sample(LETTERS[1:5], replace = T, 3), x2 = rnorm(3, 100, 5), x3 = sample(c("U","L","P"), replace = T, 3)) head(traindf) sapply(traindf,class) #input categorical variable can be either character or factor ``` ## problem with model.matrix() function ```{r} f1=formula("~x1+x2") head(model.matrix(f1, traindf),2) head(model.matrix(f1, newdf),2) ``` Note the number of columns is different in the two outputs, which will be problematic when applying the built model to new data . To avoid that, column x1 in both dataset needs to be transformed to factor with exact same levels. That will be cumbersome if there are many categorical columns. In addition, other transforming parameters, in transformation like orthogonal polynomials, also need to be saved. ## ModelMatixModel comes to rescue ### fit data to create ModelMatixModel object ```{r} f2=formula("~ 1+x1+x2") # "1" is need in order to output intercept column mm=ModelMatrixModel( f2,traindf,remove_1st_dummy =T,sparse = F) ``` ```{r} class(mm) head(mm$x,2) #note "_Intercept_" is intercept column ``` ### transform new data ```{r} mm_pred=predict(mm,newdf) head(mm_pred$x,2) ``` ## dummy variable ### keep first dummy variable ```{r} mm=ModelMatrixModel(~x1+x2+x3,traindf,remove_1st_dummy = F) ``` default is to keep first dummy variable ```{r} data.frame(as.matrix(head(mm$x,2))) mm_pred=predict(mm,newdf) data.frame(as.matrix(head(mm_pred$x,2))) ``` ### dummy variable with interaction #### keep 1st dummy variable ```{r} mm=ModelMatrixModel(~x2+x3+x2:x3,traindf) data.frame(as.matrix(head(mm$x,2))) # ':' in column name is replaced with '_X_' mm_pred=predict(mm,newdf) data.frame(as.matrix(head(mm_pred$x,2))) ``` #### remove 1st dummy variable ```{r} mm=ModelMatrixModel(~x2*x3,traindf,remove_1st_dummy = T) data.frame(as.matrix(head(mm$x,2))) mm_pred=predict(mm,newdf) data.frame(as.matrix(head(mm_pred$x,2))) ``` ### invalid level in new data It is a common categorical column in new data contains in valid level, it can be handled as following ```{r} mm=ModelMatrixModel(~x2+x3,traindf) data.frame(as.matrix(head(mm$x,2))) newdf2=newdf newdf2[1,'x3']='z' #create invalid level mm_pred=predict(mm,newdf2,handleInvalid = "keep") ``` default is to keep the invalid row ,i.e. set all dummy variables as 0. if handleInvalid = "error", throw error. ```{r} data.frame(as.matrix(head(mm_pred$x,2))) ``` ## poly() in formula ModelMatrixModel can save orthogonal polynomials parameter. ```{r} mm=ModelMatrixModel(~poly(x2,3)+x3,traindf) data.frame(as.matrix(head(mm$x,2))) mm_pred=predict(mm,newdf) data.frame(as.matrix(head(mm_pred$x,2))) ``` also works raw polynomial transformation ```{r} mm=ModelMatrixModel(~poly(x2,3,raw=T)+x3, traindf) data.frame(as.matrix(head(mm$x,2))) mm_pred=predict(mm,newdf) data.frame(as.matrix(head(mm_pred$x,2))) ``` ## scale and center training dataset can be scaled, and same scale parameters then can be applied to new dataset. ```{r} mm=ModelMatrixModel(~x2+x3,traindf,scale = T,center = T) data.frame(as.matrix(head(mm$x,2))) mm_pred=predict(mm,newdf) data.frame(as.matrix(head(mm_pred$x,2))) ```