---
title: "refineR: Reference Interval Estimation using Real-World Data (RWD)"
author: "Tatjana Ammer & Christopher M Rank"
date: "`r Sys.Date()`"

output: 
  html_document:
    theme: default
    toc: true
    toc_depth: 3
    
vignette: >
  %\VignetteIndexEntry{refineR: Reference Interval Estimation using Real-World Data (RWD)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r global_options, echo=FALSE, eval=TRUE}
knitr::opts_chunk$set(fig.width=7, fig.height=5, fig.align='center', echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE)
			  
# increasing the width of the stdout-stream
options(width=200)
```

## Introduction

The R-package **refineR** implements the recently published, state-of-the-art
indirect method, refineR (Ammer et al. 2021, description of refineR v1.0) which
was developed for the estimation of reference intervals using Real-World Data
(RWD) [https://doi.org/10.1038/s41598-021-95301-2]. It takes routine
measurements of diagnostic tests, containing pathological and non-pathological
samples as input and uses sophisticated statistical methods to derive a model
describing the distribution of the non-pathological samples. This distribution
can then be used to derive reference intervals. The R-package offers various
convenience functions to estimate the model, compute reference intervals, as
well as to print and plot the estimated results. Additional guidance on the
usage of the **refineR** algorithm is given in Ammer et al. 2023
[https://doi.org/10.1093/jalm/jfac101].


## Input Data
The R-package **refineR** comes with five simulated datasets that mimick routine
measurements of biomarkers observed in laboratory practice. These datasets
describe a mixed distribution containing non-pathological and pathological
values. The datasets differ in number of samples, pathological fraction and
underlying non-pathological distribution. A detailed description of the
different testcases can be found in the package documentation.

Prefiltering, cleaning, and possible partitioning of the extracted data from the
laboratory information system is not included in the **refineR** package and has
to be carried out in advance. The simulated datasets used here for showcasing
the application reflect already preprocessed datasets. 

For demonstrating the application of the functions provided by the **refineR**
package, *testcase 4* is used as an example dataset:  

```{r load_testcase4, echo=TRUE}
# load refineR package and load data
library(refineR)
head(testcase4)
```

For evaluating your own dataset, you may import the data from a
*.CSV* file:

```{r load_mydata, echo=TRUE}

# open help of read.csv function to get familiar with its parameters
#?read.csv

# set the file path and parameters according to your input file and import dataset
#mydata <- read.csv(file = "file path to mydata.csv", header = TRUE, sep = ",", dec = ".")
#head(mydata)

# extract the column containing the numeric test results
#mydata2 <- mydata[, "column with test results"]
# example how to run refineR estimation
#fit <- findRI(Data = mydata2)
```

## Model Estimation and Presentation of Results

### Default Settings

The main function of **refineR** is *findRI()* that takes an input data set and
estimates the model parameters lambda, mu and sigma of a Box-Cox transformed
normal distribution that best explains the non-pathological distribution. The
optimization is carried out via a multi-level grid search to minimize the cost
function (negative log-likelihood with regularization). For a detailed
description of the method (v 1.0), please refer to Ammer et al., 2021. The
*findRI()* function takes the following main function arguments: 

- *Data* 			... (*numeric*) values specifying data points comprising
  pathological and non-pathological values
- *model*			... (*character*) specifying the applied model (can be
  either *BoxCox* (default, 1-parameter Box-Cox transformation), *modBoxCoxFast*
  or *modBoxCox* (2-parameter Box-Cox transformation), *modBoxCoxFast* offers 
  faster but less accurate results than *modBoxCox*
- *NBootstrap*		... (*integer*) specifying the number of bootstrap
  repetitions		
- *seed*			... (*integer*) specifying the seed used for bootstrapping  

To run the model estimation using the default parameters just the
(pre-processed) input data is required (*Data*):

```{r run_refineR_default, echo=TRUE}
# run refineR estimation and print resulting RWDRI object
fit <- findRI(Data = testcase4)
print(fit)
```

The *print()* method comprises an overview of the model estimation results.
First, it shows the estimated lower and upper reference limit, per default the
2.5% and 97.5% percentiles. Second, it depicts information about the used
refineR version, the applied model, as well as the number of of data points (*N
data*) and whether the input data was rounded or not (*rounded*). Further, it
shows the estimated model parameters (*lambda*, *mu*, *sigma*, *shift*), the
estimated costs and the estimated fraction of non-pathological values (*NP
fraction*). 

To calculate the estimated reference intervals, the function *getRI()* can be
used. Per default, again the 2.5% and 97.5% percentiles are computed using the
estimated model parameters. 

```{r getRI_default, echo=TRUE}
# compute reference intervals using the estimated model parameters
getRI(fit)
```

We can now also take a look at the result by using the *plot()* function. 

```{r plot_default, echo=TRUE}
# plot the estimated model 
plot(fit)
```

The plot shows the raw input data (gray histogram) as well as the estimated
model for the non-pathological distribution (green curve) and the estimated
reference intervals (per default again the 2.5% and 97.5% percentiles) as green
dashed lines.


### Advanced Settings 

#### Computation of Confidence Intervals 
As mentioned, the *findRI()* function can take additional function arguments. To
calculate confidence intervals for the estimation, bootstrapping is required.
The number of bootstrap iterations can be set by the argument *NBootstrap*
(default: *0*). For demonstration purposes and due to the increased computation
time, we used a small number of iterations (*NBootstrap = 20*). However, for use
in real-world analysis, we recommend to use *NBootstrap >= 200*. When
using bootstrapping, the *print()* function then also gives the estimated
confidence intervals for the reference limits (with a default confidence level
of 95%). 

```{r run_refineR_bootstrap, echo=TRUE}
# run refineR estimation with 20 bootstrap iterations
fit.bs <- findRI(Data = testcase4, NBootstrap = 20)
print(fit.bs)
```

#### Estimation using Two-Parameter Box-Cox Transformation 
In addition to the one-parameter Box-Cox transformation, the *refineR* algorithm
also offers the option to use the two-parameter (modified) Box-Cox
transformation. This model takes an additional shift parameter to better model
skewed distributions that are shifted away from zero. 

```{r run_refineR_modBoxCox, echo=TRUE}
# run refineR estimation with alternative model (two-parameter (modified) Box-Cox transformation)
fit.mbc <- findRI(Data = testcase4, model = "modBoxCox")
print(fit.mbc)
```

#### Print and getRI Function Arguments
The *getRI()* as well as the *print()* function can take additional function
arguments as well: 

 - *x*				... (*object*) of class **RWDRI** (estimated model)  
 - *RIperc*			... (*numeric*) value specifying the percentiles, which
   define the reference interval 
- *CIprop*			... (*numeric*) value specifying the confidence levels of
  the confidence intervals
- *pointEst* 		... (*character*) specifying the point estimate
  determination: (1)  using the full dataset (*fullDataEst*), (2) calculating
  the median from all bootstrap samples (*medianBS*), (3) calculating the mean
  from all bootstrap samples ("meanBS"), option (2) and (3) only work if
  NBootstrap>0 

To print and compute certain percentiles of the estimated model, set the
*RIperc* argument, e.g. to `c(0.025, 0.5, 0.975)`. 

```{r print_refineR_param, echo=TRUE}
# compute 2.5%, 50% (median), 97.5% percentiles for the estimated model 
getRI(fit, RIperc = c(0.025, 0.5, 0.975))

# print 2.5%, 50% (median), 97.5% percentiles and estimated model parameters
print(fit, RIperc = c(0.025, 0.5, 0.975))
```

To also compute confidence intervals for the estimated reference limits after
using bootstrapping, you can specify the confidence level by setting *CIprop*
(default: 0.95). Further, you can specify if you want to use the full data
estimate as point estimate or the median or mean from the bootstrap samples by
setting *pointEst* to *fullDataEst* (Default), *medianBS*, or *meanBS*. We would
recommend to use the median of all bootstrap samples (*medianBS*) here.  


```{r print_refineR_param_bs, echo=TRUE}
# compute percentiles for estimated model with bootstrapping using the median as point estimate  
getRI(fit.bs, RIperc = c(0.025, 0.975), pointEst = "medianBS")

# print percentiles for estimated model with bootstrapping using the median as point estimate and estimated model parameters
print(fit.bs, RIperc = c(0.025, 0.975), pointEst = "medianBS")
```


#### Plot Function Arguments
The *plot()* function can take additional function arguments as well: 

 - *x*				... (*object*) of class **RWDRI**, estimated model  
 - *RIperc*			... (*numeric*) value specifying the percentiles, which
   define the reference interval 
- *Nhist*			... (*integer*) number of bins in the histogram (derived
  automatically if not set)
- *showCI*			... (*logical*) specifying if the confidence intervals are
  shown
- *showPathol*		... (*logical*) specifying if the estimated pathological
  distribution shall be shown
- *showBSModels*	... (*logical*) specifying if the estimated bootstrapping
  models shall be shown
- *showValue*		... (*logical*) specifying if the exact value of the 
  estimated reference intervals shall be shown above the plot
- *CIprop*			... (*numeric*) value specifying the central region for
  estimation of confidence intervals
- *pointEst* 		... (*character*) specifying the point estimate
  determination: (1)  using the full dataset (*fullDataEst*), (2) calculating
  the median from all bootstrap samples (*medianBS*),(3) calculating the mean
  from all bootstrap samples ("meanBS"), option (2) and (3) only work if 
  NBootstrap>0 
- *scalePathol*		... (*logical*) specifying if the estimated pathological
  distribution shall be weighted with the ratio of pathol/non-pathol
- *xlim*			... (*numeric*) vector specifying the limits in x-direction
- *ylim*			... (*numeric*) vector specifying the limits in y-direction
- *xlab*			... (*character*) specifying the x-axis label
- *ylab*			... (*character*) specifying the y-axis label
- *title*			... (*character*) specifying plot title



When plotting a model estimated with bootstrapping, the confidence intervals are
shown per default as light green region. Further, you can specify if you want to
show the estimate for the full data set (*fullDataEst*), or the median
(*medianBS*) or mean (*meanBS*) of all bootstrap results by setting the argument
*pointEst*. Additionally, you can adjust for example the x-limit, the x-label
and the title: 

```{r plot_param, echo=TRUE}
# plot estimated model with bootstrapping with adjusted function arguments
plot(fit.bs, RIperc = c(0.025, 0.5, 0.975), pointEst = "medianBS", xlim = c(0, 100), xlab = "Concentration [U/L]", 
		title = "Testcase 4")
```

If you also want to show the estimated "pathological distribution", set the
*showPathol* argument to **TRUE**. This then adds a red curve to the plot,
representing the difference between the raw histogram and the estimated model of
the non-pathological distribution, i.e. interpretation should be taken with
care. 

If you don't want to show the estimated reference limits, you can disable this,
by setting *showValue* to **FALSE**.


```{r plot_showPathol, echo=TRUE}
# plot estimated model with bootstrapping showing the difference between raw input data and estimated model 
# 		(i.e. 'pathological distribution'), wihtout showing the estimated reference limits
plot(fit.bs, showPathol = TRUE, showValue = FALSE, pointEst = "medianBS", 
		title = "Testcase 4 with pathological distribution")

```


## References 
Ammer, T., Schuetzenmeister, A., Prokosch, HU., Rauh, M., Rank, C.M., Zierk, J.
refineR: A Novel Algorithm for Reference Interval Estimation from Real-World
Data. Scientific Reports 11, 16023 (2021).
https://doi.org/10.1038/s41598-021-95301-2.

Ammer, T., Schuetzenmeister, A., Rank, C.M., Doyle, K. Estimation of Reference
Intervals from Routine Data Using the refineR Algorithm — A Practical Guide. The
Journal of Applied Laboratory Medicine, 8(1):84-91 (2023).
https://doi.org/10.1093/jalm/jfac101.