--- title: "mixgb: Multiple Imputation Through XGBoost" author: "Yongshi Deng" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{mixgb: Multiple Imputation Through XGBoost} %\VignetteEngine{knitr::rmarkdown} --- ## Introduction Mixgb offers a scalable solution for imputing large datasets using XGBoost, subsampling and predictive mean matching. Our method utilizes the capabilities of XGBoost, a highly efficient implementation of gradient boosted trees, to capture interactions and non-linear relations automatically. Moreover, we have integrated subsampling and predictive mean matching to minimize bias and reflect appropriate imputation variability. Our package supports various types of variables and offers flexible settings for subsampling and predictive mean matching. We also include diagnostic tools for evaluating the quality of the imputed values. ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Impute missing values with `mixgb` We first load the `mixgb` package and the `nhanes3_newborn` dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values. ```{r} library(mixgb) str(nhanes3_newborn) colSums(is.na(nhanes3_newborn)) ``` To impute this dataset, we can use the default settings. The default number of imputed datasets is `m = 5`. Note that we do not need to convert our data into dgCMatrix or one-hot coding format. Our package will automatically convert it for you. Variables should be of the following types: numeric, integer, factor or ordinal factor. ```{r, eval = FALSE} # use mixgb with default settings imputed.data <- mixgb(data = nhanes3_newborn, m = 5) ``` ### Customize imputation settings We can also customize imputation settings: - The number of imputed datasets `m` - The number of imputation iterations `maxit` - XGBoost hyperparameters and verbose settings. `xgb.params`, `nrounds`, `early_stopping_rounds`, `print_every_n` and `verbose`. - Subsampling ratio. By default, `subsample = 0.7`. Users can change this value under the `xgb.params` argument. - Predictive mean matching settings `pmm.type`, `pmm.k` and `pmm.link`. - Whether ordinal factors should be converted to integer (imputation process may be faster) `ordinalAsInteger` - Whether or not to use bootstrapping `bootstrap` - Initial imputation methods for different types of variables `initial.num`, `initial.int` and `initial.fac`. - Whether to save models for imputing newdata `save.models` and `save.vars`. ```{r, eval = FALSE} # Use mixgb with chosen settings params <- list( max_depth = 5, subsample = 0.9, nthread = 2, tree_method = "hist" ) imputed.data <- mixgb( data = nhanes3_newborn, m = 10, maxit = 2, ordinalAsInteger = FALSE, bootstrap = FALSE, pmm.type = "auto", pmm.k = 5, pmm.link = "prob", initial.num = "normal", initial.int = "mode", initial.fac = "mode", save.models = FALSE, save.vars = NULL, xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0 ) ``` ### Tune hyperparameters Imputation performance can be affected by the hyperparameter settings. Although tuning a large set of hyperparameters may appear intimidating, it is often possible to narrowing down the search space because many hyperparameters are correlated. In our package, the function `mixgb_cv()` can be used to tune the number of boosting rounds - `nrounds`. There is no default `nrounds` value in `XGBoost,` so users are required to specify this value themselves. The default `nrounds` in `mixgb()` is 100. However, we recommend using `mixgb_cv()` to find the optimal `nrounds` first. ```{r} params <- list(max_depth = 3, subsample = 0.7, nthread = 2) cv.results <- mixgb_cv(data = nhanes3_newborn, nrounds = 100, xgb.params = params, verbose = FALSE) cv.results$evaluation.log cv.results$response cv.results$best.nrounds ``` By default, `mixgb_cv()` will randomly choose an incomplete variable as the response and build an XGBoost model with other variables as explanatory variables using the complete cases of the dataset. Therefore, each run of `mixgb_cv()` will likely return different results. Users can also specify the response and covariates in the argument `response` and `select_features` respectively. ```{r} cv.results <- mixgb_cv( data = nhanes3_newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1, response = "BMPHEAD", select_features = c("HSAGEIR", "HSSEX", "DMARETHN", "BMPRECUM", "BMPSB1", "BMPSB2", "BMPTR1", "BMPTR2", "BMPWT"), xgb.params = params, verbose = FALSE ) cv.results$best.nrounds ``` Let us just try setting `nrounds = cv.results$best.nrounds` in `mixgb()` to obtain 5 imputed datasets. ```{r, eval = FALSE} imputed.data <- mixgb(data = nhanes3_newborn, m = 5, nrounds = cv.results$best.nrounds) ``` ## Inspect multiply imputed values The `mixgb` package used to provide a few visual diagnostics functions. However, we have moved these functions to the `vismi` package, which provides a wide range of visualisation tools for multiple imputation. For more details, please check the `vismi` package on GitHub [Visualisation Tools for Multiple Imputation](https://github.com/agnesdeng/vismi).