--- title: "Benchmarks: discretize()" output: rmarkdown::html_vignette: mathjax: default fig_caption: true toc: true section_numbering: true css: ggsci.css bibliography: RJreferences.bib vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Benchmarks: discretize()} --- ```{r, echo=FALSE, message=TRUE, results='asis'} is.pkg <- all(c("RTCGA.rnaseq", "microbenchmark", "RWeka", "pkgdown") %in% rownames(installed.packages())) if(!is.pkg) { message("Please install all suggested packages to run benchmark.") } if(is.pkg && !pkgdown::in_pkgdown()) { is.pkg <- FALSE cat("### Benchmark is not available here", "(we don't want to run this code on the CRAN server).", "Please visit the ", paste(sep = "", "https://mi2-warsaw.github.io/FSelectorRcpp/", "articles/benchmarks_discretize.html") ) } ``` ```{r setup, eval=is.pkg, include = FALSE} library(knitr) fig.path <- if(pkgdown::in_pkgdown()) { "benchmarks_discretize_files/figure-html/" } else { "bench-figs/bench-" } opts_chunk$set( comment = "", fig.width = 8, fig.height = 8, message = FALSE, warning = FALSE, tidy.opts = list( keep.blank.line = TRUE, width.cutoff = 150 ), options(width = 150), eval = TRUE, fig.path = fig.path ) ``` ```{r, eval=is.pkg} library(microbenchmark) library(FSelectorRcpp) library(RWeka) ``` # About discretization In the statistical modelling, the **discretization** is the process of transferring continuous explanatory variables into discrete counterparts. Problems caused by categorization of continuous variables are known and widely spread [@HarrellDISC], but in some cases there appear an algorithmic requirement for the discretization. Moreover, there exist few algorithms, like **Decision Trees** [@Salzberg1994], where continuous attributes are discretized during the learning process. Other reason for variable discretization include increasing the speed of induction algorithms [@Catlett1991]. Even though many categorization algorithms have been developed [@Holte1993, @Chan169942], in this article we focus on a recursive entropy minimization heuristic for categorization coupled with a **Minimum Description Length (MDL)** critetion [@rissanen1986] that controls the number of intervals produced over the continuous space. This is motivated by the original usage of this algorithm in **FSelector** package, which is our baseline. [@Dougherty95supervisedand] showed better performance of classification for discretized feature set on real-world datasets and states that the described method was found promising by the authors not only for the local discretization but also for global discretization [@Ting94discretizationof]. # Comparisons on RTCGA data ## About The Cancer Genome Atlas > The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing - [https://cancergenome.nih.gov/](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). [RTCGA team](https://github.com/orgs/RTCGA/people) converted selected datasets from this study into few separate packages that are hosted on Bioconductor. These R packages make selected datasets easier to access and manage. Data sets in RTCGA packages are large and cover complex relations between clinical outcomes and genetic background. To use RTCGA install package with instructions from it's [Bioconductor home page](https://bioconductor.org/packages/release/bioc/html/RTCGA.html) ```{r, eval = FALSE} ## try https:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("RTCGA") ``` ## Dependent variable Specifies whether a sample came from tumour or healthy tissue. It is denoted by 14th sign in `bcr_patiend_barcode` variable in `BRCA.rnaseq` dataset that containt RNA-Seq genes' expressions for patients suffering from Breast Cancer. ## Independent variables It is mandatory to get rid of special signs from their names. ```{r, eval=is.pkg} library(RTCGA.rnaseq) BRCA.rnaseq <- RTCGA.rnaseq::BRCA.rnaseq BRCA.rnaseq$bcr_patient_barcode <- factor(substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)) names(BRCA.rnaseq) <- gsub(pattern = "?", replacement = "q", x = names(BRCA.rnaseq), fixed = TRUE) names(BRCA.rnaseq) <- gsub(pattern = "[[:punct:]]", replacement = "_", x = names(BRCA.rnaseq)) ``` ## More - [RTCGA Website](https://rtcga.github.io/RTCGA/) # Discretization Benchmarks ```{r, eval=is.pkg} library(ggplot2) library(pbapply) library(tibble) names_by <- function(nameBy, vecBy) { c(paste0(nameBy, sapply(max(nchar(vecBy)) - nchar(vecBy), function(zeros) { paste0(rep(0, zeros), collapse = "") }), vecBy)) } get_times <- function(fun, vecRows, vecCols, nTimes) { forRows <- pblapply(vecRows, function(nRows) { forCols <- pblapply(vecCols + 1, function(nCols) { suppressWarnings(microbenchmark(fun(bcr_patient_barcode ~ ., BRCA.rnaseq[1:nRows, 1:nCols]), times = nTimes)) }) colsNames <- names_by(nameBy = "columns_", vecBy = vecCols) setNames(forCols, colsNames) }) rowsNames <- names_by(nameBy = "rows_", vecBy = vecRows) setNames(forRows, rowsNames) } make_df <- function(bm_data, package) { tibble( rows = rep( x = as.integer( gsub( pattern = "rows_0*", replacement = "", x = names(bm_data) ) ), each = unique(sapply(bm_data, length)) ), columns = as.integer( gsub( pattern = "columns_0*", replacement = "", x = as.vector(sapply(bm_data, names)) ) ), time = as.vector( sapply( bm_data, function(setOfCols) { sapply( setOfCols, function(set) { median(set$time) / 10 ^ 9 } ) } ) ), package = package ) } bm_fsr_discretize <- get_times(fun = discretize, vecRows = seq(from = 200, to = 1000, by = 200), vecCols = c(10, 100, 500, 1000), nTimes = 1L) bm_rwk_discretize <- get_times(fun = Discretize, vecRows = seq(from = 200, to = 1000, by = 200), vecCols = c(10, 100, 500, 1000), nTimes = 1L) df_fsr <- make_df(bm_data = bm_fsr_discretize, package = "FSelectorRcpp") df_rwk <- make_df(bm_data = bm_rwk_discretize, package = "RWeka") bm_df <- rbind(df_fsr, df_rwk) make_plot <- function(y) { yUnit <- gsub(pattern = "time", replacement = "s", x = y) ggplot(data = bm_df, aes_string(x = "rows", y = y, color = "package")) + facet_grid(columns ~ ., labeller = label_both) + geom_point() + xlab("Number of rows [-]") + ylab(paste0("Elapsed time [", yUnit, "]")) + ggtitle(label = "Microbenchmark: discretization function comparison", subtitle = "for different sizes of data sets") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), legend.position = "top") } bm_plot <- make_plot("time") bm_plot_log <- make_plot("log(time)") ``` ## Plots ```{r plots, eval=is.pkg, echo=FALSE} bm_plot bm_plot_log ``` # References