--- title: "How I can predict events on a new dataset?" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{predict} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` # Load packages ```{r setup} library(recforest) library(dplyr) ``` # Prepare data We use the built-in dataset `bladder1_recforest` for this example. We build two subsamples of initial data for training and testing the model. ```{r} data("bladder1_recforest") id_individuals_bladder1_recforest <- unique(bladder1_recforest$id) train_ids <- sample(id_individuals_bladder1_recforest, size = 100, replace = FALSE) test_ids <- setdiff(id_individuals_bladder1_recforest, train_ids) train_bladder1_recforest <- bladder1_recforest %>% filter(id %in% train_ids) test_bladder1_recforest <- bladder1_recforest %>% filter(id %in% test_ids) ``` # Train a recforest model Hyperparameters are user-fixed (to be optimized in real-world settings). Considering the small number of predictors, `mtry` was set to 2. For further details on hyperparameters, call `?train_forest`. ```{r} set.seed(1234) trained_forest <- train_forest( data = train_bladder1_recforest, id_var = "id", covariates = c("treatment", "number", "size"), time_vars = c("t.start", "t.stop"), death_var = "death", event = "event", n_trees = 3, n_bootstrap = round(2 * length(train_ids) / 3), mtry = 2, minsplit = 3, nodesize = 15, method = "NAa", min_score = 5, max_nodes = 20, seed = 111, parallel = FALSE, verbose = FALSE ) ``` # Predict on new data Predictions from recforest model are the expected mean cumulative number of recurrent events for each individual at the end of follow-up. Evaluations on new data based on the 3 metrics (C-index for recurrent events, Integrated MSE for recurrent events and Integrated Score for recurrent events) will be available soon. ```{r} predictions <- predict( trained_forest, newdata = test_bladder1_recforest, id_var = "id", covariates = c("treatment", "number", "size"), time_vars = c("t.start", "t.stop"), death_var = "death" ) ```