---
title: "Extensions"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{extensions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy = TRUE,
  tidy.opts=list(arrow=TRUE,width.cutoff = 50),
  eval=T
)
```

We present some extensions to the package. These aim to demonstrate additional functionality of the package which has not yet been subject to extensive testing.

```{r setup, echo = F}
library(mlpwr)
set.seed(1)
```

Since the Monte Carlo simulations required to generate the results can be time-consuming, we have precomputed them and included them as a data file in the package. This enables a quick overview of the methods without having to rerun the simulations. Below, we load the precomputed results using the `data()` function. We also provide the code used to generate these results later in this vignette.

```{r}
data(extensions_results)
```

# Beyond Power Analysis

Here, we demonstrate two other application scenarios, sensitivity analysis and compromise analysis.

## Sensitivity Analysis

Sensitivity analysis involves fixing the sample size, desired power, and alpha level, and determining the effect size that can be detected with the desired power. We demonstrate this functionality using a t-test.

The fixed parameters are the sample size N, the alpha level, and the goal power. We want to determine the corresponding effect size:
```{r}
N = 100
alpha = .01
goal_power = .95
```

We define our simulation function:
```{r}
simfun_sensitivity = function(esize) {

  # Generate a data set
  dat <- rnorm(n = N, mean = esize)
  # Test the hypothesis
  res <- t.test(dat)
  res$p.value < alpha
}
```

We can find the corresponding effect size using: 
```{r}
# Effect Size Finding
# res1 <- find.design(simfun = simfun_sensitivity,
#                    boundaries = c(0,1),integer=FALSE, power = goal_power,surrogate = "gpr",evaluations=8000)
res1 = extensions_results[[1]]
summary(res1)
```

This leads us to an effect size of about 0.43, as measured by Cohen's d. To confirm the correctness of this result, we check it analytically using the pwr package.
```{r}
library(pwr)
esize_correct = pwr.t.test(n=N,sig.level=alpha,power=goal_power,type="one.sample")$d
esize_correct
```

The analytical result is very close to the result obtained with mlpwr. With a  small simulation study, we further confirm that the analytical result is correct.
```{r}
a = replicate(10000,simfun_sensitivity(esize_correct))
mean(a)
```


## Compromise Analysis

In compromise analysis, the sample size and effect size are fixed in the DGF, and the goal is to find a decision criterion that optimizes the desired ratio of alpha and beta errors. Again, we demonstrate this functionality using a t-test scenario.

The fixed parameters are the sample size N, the effect size (given by Cohen's d) and the desired ratio of the alpha and beta errors:
```{r}
N = 100
esize = .3
desired_ratio = 1
```

We now define our simulation function. Its argument is a given threshold for the test statistic. This function then generates two data sets - one under the null, one under the alternative hypothesis - and evaluates whether the use of this threshold leads to a Type I error (for the data set generated under the null hypothesis) or a Type II error (for the data set generated under the alternative hypothesis).
```{r}
simfun_compromise = function(crit) {

  # Generate a data set
  dat <- rnorm(n = N, mean = 0)
 
  # Test the hypothesis
  res <- t.test(dat)
  a = res$statistic>crit
  
  # Generate a data set
  dat <- rnorm(n = N, mean = esize)
  
  # Test the hypothesis
  res <- t.test(dat)
  b = res$statistic<crit

  return(c(a,b))
}
```

We run this function one to see how the output looks - TRUE indicates that a Type I (1st entry) or Type II error (2nd entry) was observed:
```{r}
simfun_compromise(.1)
```

However, we are not interested in the Type I and Type II errors themselves, but in their ratio. We therefore need to define a function that organizes how the individual simulations are aggregated: 
```{r}
aggregate_fun = function(x) {y=rowMeans(matrix(x,nrow=2));y[2]/y[1]}
```

We can find the optimal decision criterion for reaching the desired ratio of alpha and beta errors using: 
```{r}
# res2 <- find.design(simfun = simfun_compromise, boundaries = c(1,2), power = desired_ratio,integer = FALSE, aggregate_fun = aggregate_fun,surrogate="svr",use_noise=FALSE,evaluations=8000)
res2 = extensions_results[[2]]
summary(res2)
```
The analysis with mlpwr leads to a desired threshold of about 1.48. The ratio of Type I and Type II errors is given as "power" in this output and close to the desired value of 1.

Note that we used some additional arguments in this case. The 'integer' argument tells the algorithm whether to look for integer design parameters - this is set to 'FALSE' for our case of effect sizes. The 'use_noise=FALSE' indicates that we do not account for the Monte Carlo error at each design parameter. This functionality can be added in the future. Note also that we use 'power' as an argument to the find.design function and in the output of summary(res2) to indicate the desired ratio of alpha and beta error. This is a makeshift solution and  we consider rewording this argument to 'goal' in the future. 

To check the correctness of our result, we compare it with an analytical solution obtained from the t-distribution and an numerical optimizer:
```{r}
fun = \(alpha) {
  beta = alpha * desired_ratio
  abs(
    qt(1-alpha,ncp=0, df=N-1) # crit for alpha
    -qt(beta,ncp=esize*sqrt(N), df=N-1) # crit for beta
  )
}
alpha = optim(0.1,fun,lower=10e-10,upper=1-10e-10,method = "L-BFGS-B")$par
crit = qt(1-alpha,ncp=0, df=N-1)
crit
```

This leads to a threshold of about 1.50, which is very close to the solution of mlpwr. Using simulations, we can show that the analytical solution is indeed correct:
```{r}
a = replicate(10000,simfun_compromise(crit))
aggregate_fun(a)
```


# Inhomogeneous Cost Functions

Some uses cases feature more complex cost functions. We cover an example here. It builds on the "ANOVA" simulation function in the "simulation_functions" vignette.

```{r}
library(tidyr)
simfun_anova <- function(n, n.groups) {

  # Generate a data set
  groupmeans <- rnorm(n.groups, sd = 0.2)  # generate groupmeans using cohen's f=.2
  dat <- sapply(groupmeans, function(x) rnorm(n,
                                              mean = x, sd = 1))  # generate data
  dat <- dat |>
    as.data.frame() |>
    gather()  # format

  # Test the hypothesis
  res <- aov(value ~ key, data = dat)  # perform ANOVA
  summary(res)[[1]][1, 5] < 0.01  # extract significance
}

```

We assume that there are different costs for different clusters. We assume that the costs per cluster is cheaper if there is only a small number of clusters. Adding additional clusters gets increasingly expensive (for example, because assessing a large number of groups is more costly).

Specifically, we assume that the first five clusters cost 10, the next five cost 15, the next five cost 20, and so on.

```{r}
prices = c(rep(10,5),rep(15,5),rep(20,5),rep(25,5),rep(30,5),rep(35,5),rep(40,5))

costfun = function(n, n.groups) {
  5 * n + n.groups * sum(prices[1:n.groups])
}
```

We can now find a good design using:
```{r}
# res3 <- find.design(
#   simfun = simfun_anova,
#   costfun = costfun,
#   boundaries = list(n = c(10, 150), n.groups = c(5, 30)),
#   power = .95
# )
res3 = extensions_results[[3]]
summary(res3)
```

The proposed solution contains 10 groups and 135 participants per group.

# 3-Dimensional Designs

As an example for a study design with more than 2 dimensions, we consider a multilevel design with the three dimensions:

* Number of schools
* Number of students per school
* Number of observations per student


```{r}
library(lme4)
library(lmerTest)
```


```{r}
simfun_3d <- function(n.per.school,n.schools, n.obs) {

  # generate data
  school = rep(1:n.schools,each=n.per.school*n.obs)
  student = rep(1:(n.schools*n.per.school),each=n.obs)
  pred = factor(rep(c("old","new"),n.per.school*n.schools*n.obs,each=n.obs),levels=c("old","new"))
  dat = data.frame(school = school, student = student, pred = pred)
  

  params <- list(theta = c(.5,0,.5,.5), beta = c(0,1),sigma = 1.5)
  names(params$theta) = c("school.(Intercept)","school.prednew.(Intercept)","school.prednew","student.(Intercept)")
  names(params$beta) = c("(Intercept)","prednew")
  dat$y <- simulate.formula(~pred + (1 + pred | school) + (1 | student), newdata = dat, newparams = params)[[1]]
  
  # test hypothesis
  mod <- lmer(y ~ pred + (1 + pred | school) + (1 | student), data = dat)
  pvalue <- summary(mod)[["coefficients"]][2,"Pr(>|t|)"]
  pvalue < .01
}

costfun_3d <- function(n.per.school, n.schools,n.obs) {
  100 * n.per.school + 200 * n.schools + .1 * n.obs * n.per.school * n.schools
  }
```


```{r}
# res4 = find.design(simfun = simfun_3d, costfun = costfun_3d, boundaries = list(n.per.school = c(5, 25), n.schools = c(10, 30), n.obs = c(3,10)), power = .95)
res4 = extensions_results[[4]]
summary(res4)
```
In this case, going with the minimum number of observations per student seems to be the ideal choice.