--- title: "Practical examples (unequal variances; ranked data; ANOVA; DeLong's test)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{v_3_examples} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Below are some `POSSA` sequential design power calculation examples that cannot be done with previous common software. ```{r setup} library('POSSA') ``` ### Unequal variances This is a relatively small modification of the first t-test examples given in the [introduction](https://gasparl.github.io/possa/vignettes/v_1_intro.html): Here it's [Welch's test](https://doi.org/10.5334/irsp.82) expecting unequal variances. The effect sizes and total samples are always the same as in the intro examples, for the easier comparison. First, a function to have some added information along with the Welch's test p value. ```r wTestPlus = function(x, y) { m_diff = mean(x) - mean(y) sdx = sd(x) sdy = sd(y) n1 = length(x) n2 = length(y) sd_p = sqrt(((n1 - 1) * (sdx ** 2) + (n2 - 1) * (sdy ** 2)) / (n1 + n2 - 2)) return( list( pval = stats::t.test(x, y, 'less')$p.value, mean_diff = m_diff, sd1 = sdx, sd2 = sdy, smd = m_diff / sd_p ) ) } ``` Then a function that generates two different sample sizes depending on `SDs` parameter: one with equal SDs, one with unequal ones. ```r sampsUnequal = function(sample1, sample2_h, SDs) { if (SDs == 'EqualSDs') { sd1 = 10 sd2 = 10 } else { sd1 = 5.57 sd2 = 13 } list( sample1 = rnorm(sample1, mean = 0, sd = sd1), sample2_h0 = rnorm(sample2_h, mean = 0, sd = sd2), sample2_h1 = rnorm(sample2_h, mean = 5, sd = sd2) ) } ``` (The effect sizes are `0.5` for both SD variations: `neatStats::t_neat(bayestestR::distribution_normal(5000, 5, 10),bayestestR::distribution_normal(5000, 0, 10))$stats`, `neatStats::t_neat(bayestestR::distribution_normal(5000, 5, 13),bayestestR::distribution_normal(5000, 0, 5.57))$stats`.) Then the testing function with some extra information returned (per H0 and H1: mean difference, the two SDs, and the SMD). ```r wTestForUnequal = function(sample1, sample2_h0, sample2_h1) { t0 = wTestPlus(sample1, sample2_h0) t1 = wTestPlus(sample1, sample2_h1) return( c( p_h0 = t0$pval, H0mDiff = t0$mean_diff, H0sd1 = t0$sd1, H0sd2 = t0$sd2, H0smd = t0$smd, p_h1 = t1$pval, H1mDiff = t1$mean_diff, H1sd1 = t1$sd1, H1sd2 = t1$sd2, H1smd = t1$smd ) ) } ``` (Check via `do.call(wTestForUnequal, sampsUnequal(30, 60, 'EqualSDs'))`.) Now the sampling and testing passed to the `sim` function. ```r dfPvalsWelch = sim( fun_obs = list(sampsUnequal, SDs = c('EqualSDs', 'UnequalSDs')), n_obs = c(27, 54, 81), fun_test = wTestForUnequal ) ``` (The descriptives can be quickly checked via `neatStats::peek_neat(dfPvalsWelch, c("H0sd1", "H0sd2", "H1sd1", "H1sd2"), group_by = c('SDs'))`; `neatStats::peek_neat(dfPvalsWelch, c("H0smd", "H1smd"), group_by = c('SDs'))`; `neatStats::peek_neat(dfPvalsWelch, c("H0mDiff", "H1mDiff"), group_by = c('SDs'))`; all correspond to the intended factors.) Finally, the `pow` function. ```r pow(dfPvalsWelch, alpha_locals = NA) #> # POSSA pow() results # #> GROUP: pow_EqualSDs #> N(average-total) = 158.6 (if H0 true) or 98.8 (if H1 true) #> (p) Type I error: .05000; Power: .89931 #> Local alphas: (1) .02296; (2) .02296; (3) .02296 #> Likelihoods of significance if H0 true: (1) .02291; (2) .01633; (3) .01076 #> Likelihoods of significance if H1 true: (1) .42313; (2) .32331; (3) .15287 #> GROUP: pow_UnequalSDs #> N(average-total) = 158.7 (if H0 true) or 100.2 (if H1 true) #> (p) Type I error: .05000; Power: .89276 #> Local alphas: (1) .02204; (2) .02204; (3) .02204 #> Likelihoods of significance if H0 true: (1) .02329; (2) .01502; (3) .01169 #> Likelihoods of significance if H1 true: (1) .41013; (2) .32371; (3) .15891 ``` The results are pretty similar as with `var.equal = TRUE`: almost the same with equal variances ([as expected](https://doi.org/10.5334/irsp.82)), and just a little different with unequal variances. But let's now see with unequal sample sizes as well. ```r dfPvalsUnequal = sim( fun_obs = list(sampsUnequal, SDs = c('EqualSDs', 'UnequalSDs')), n_obs = list(sample1 = (c(27, 54, 81) - 15), sample2_h = (c(27, 54, 81) + 15)), fun_test = wTestForUnequal ) pow(dfPvalsUnequal, alpha_locals = NA) #> # POSSA pow() results # #> GROUP: pow_EqualSDs #> N(average-total) = 158.7 (if H0 true) or 109.3 (if H1 true) #> (p) Type I error: .05000; Power: .88098 #> Local alphas: (1) .02092; (2) .02092; (3) .02092 #> Likelihoods of significance if H0 true: (1) .02162; (2) .01736; (3) .01102 #> Likelihoods of significance if H1 true: (1) .27931; (2) .41667; (3) .18500 #> GROUP: pow_UnequalSDs #> N(average-total) = 158.7 (if H0 true) or 94.1 (if H1 true) #> (p) Type I error: .04998; Power: .92658 #> Local alphas: (1) .02334; (2) .02334; (3) .02334 #> Likelihoods of significance if H0 true: (1) .02344; (2) .01513; (3) .01140 #> Likelihoods of significance if H1 true: (1) .46102; (2) .33549; (3) .13007 ``` Now the results depend notably on SD equality: with equal SDs, the power is lower as compared to equal samples per group, but with unequal SDs it is larger (again in line with [the expected statistical relations](https://doi.org/10.5334/irsp.82)). Along with power, as again logical (due to more early stops), the expected average sample size changes substantially as well (in case of H0). (See also e.g. O'Brien-Fleming bounds; `pow(dfPvalsUnequal, alpha_locals = c(0.0015, 0.0181, 0.0437), alpha_global = 0.05, adjust = FALSE)`, etc.) ### Ranked data To simulate ranked data, such as the answers to a single Likert scale, [one elegant way](https://stats.stackexchange.com/a/374500/237231) is to first simulate the normally distributed underlying [latent variable](https://en.wikipedia.org/wiki/Latent_variable), then convert that to the desired interval data. See the function below. ```{r} createOrdinal = function(s, m) { findInterval(rnorm(s, mean = m, sd = 1.3), vec = c(-Inf, 1, 2, 3, 4, Inf)) } ``` The output can be checked e.g. as `neatStats::peek_neat(data.frame(x = createOrdinal(1000, 2)), 'x', f_plot = neatStats::plot_neat)`. Then the sampling function for two independent samples can be as follows. ```{r} sampleRank = function(sample_size) { list( sample1 = createOrdinal(sample_size, 2), sample2_h0 = createOrdinal(sample_size, 2), sample2_h1 = createOrdinal(sample_size, 3) ) } ``` (For paired samples, one could first generate normally distributed continuous values via [faux::rnorm_multi](https://debruine.github.io/faux/articles/rnorm_multi.html), then again convert them to interval with `findInterval()`.) Then the testing function with Wilcoxon rank-sum test. ```{r} testWilc = function(sample1, sample2_h0, sample2_h1) { c( p_h0 = wilcox.test(sample1, sample2_h0, alternative = 'less')$p.value, p_h1 = wilcox.test(sample1, sample2_h1, alternative = 'less')$p.value ) } ``` Then `sim()` and `pow()` as usual. ```r dfPvalsRank = sim(fun_obs = sampleRank, n_obs = c(15, 30, 45), fun_test = testWilc) pow(dfPvalsRank) #> # POSSA pow() results # #> N(average-total) = 90.0 (if H0 true) or 90.0 (if H1 true) #> (p) Type I error: .05113; Power: .96142 #> Local alphas: (1) none; (2) none; (3) .05000 #> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .05113 #> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .96142 ``` These results for fixed design can be verified via another power simulation package, `MKpower`, as follows. ```r f1 = function(n) { createOrdinal(n, 2) } f2 = function(n) { createOrdinal(n, 3) } MKpower::sim.ssize.wilcox.test( f1, f2, power = 0.9, n.min = 45, n.max = 47, step.size = 1, alternative = 'less' ) #> Wilcoxon rank sum test #> n = 45 #> emp.power = 0.9629 ``` But of course only `POSSA` allows sequential design; `pow(dfPvalsRank, alpha_locals = NA)`, etc. ### ANOVA The ANOVA power calculation here follows an example of the [Superpower](https://cran.r-project.org/package=Superpower) R package (_Example #1_ [here](https://github.com/cran/Superpower)), for comparability. Writing the code for ANOVA in base R is a fuss due to the weird input (and output) mode of the dedicated function. There are various extension packages available to make this easier via wrapper functions, but for simulations it is better to use the base function to make the process as fast as possible. Even so, packages like [faux](https://debruine.github.io/faux/) make simulated factorial designs less burdensome – although in the present case a special complication is that, due to the subsampling process for sequential designs, each variable has to be passed from the sampling function to the testing function separately. In any case, here is how a 2 by 2 mixed ANOVA can be implemented. From this one example, it should be clear how to do any other between or within factor combination. The details of the example scenario can be read at Superpower's [example #1 description](https://github.com/cran/Superpower) – here, just the following minimal necessary information is highlighted: There are two factors for a continuous _rating_ variable: **Voice** (**robot** vs. **human**) and **Emotion** (**sad** vs. **cheerful**). **Voice** is a between-subject factor. **Emotion** is a within-subject factor with a correlation of `.8` between **sad** and **cheerful** ratings. (The groups for **robot** and **human** are equal.) The null hypothesis (H0) is that all variable means are equal (a rating mean of `1.03`), while the alternative hypothesis (H1) is that the variable means have the specific differences as included in the code below (for example, a rating mean of `1.03` for "**human cheerful** voice", but a rating mean of `1.41` for "**human sad** voice"). Hence, for H0 as well as H1, two pairs of correlated ("rating") variables should be generated, one for **robot**, and one for **human**. Within each, one will be **sad**, the other **cheerful** (with a correlation of `.8`). Here is how this looks looks. (Note the grouping notation: `grp_1` for **human** and `grp_2` for **robot**.) ```r sampleAOV = function(samp_size) { dat_h0_human = faux::rnorm_multi( n = samp_size, vars = 2, mu = 1, sd = 1.03, r = 0.8 ) dat_h0_robot = faux::rnorm_multi( n = samp_size, vars = 2, mu = 1, sd = 1.03, r = 0.8 ) dat_h1_human = faux::rnorm_multi( n = samp_size, vars = 2, mu = c(1.03, 1.41), sd = 1.03, r = 0.8 ) dat_h1_robot = faux::rnorm_multi( n = samp_size, vars = 2, mu = c(0.98, 1.01), sd = 1.03, r = 0.8 ) list( # human_cheerful grp_1_human_cheerful_h0 = dat_h0_human$X1, grp_1_human_cheerful_h1 = dat_h1_human$X1, # human_sad grp_1_human_sad_h0 = dat_h0_human$X2, grp_1_human_sad_h1 = dat_h1_human$X2, # robot_cheerful grp_2_robot_cheerful_h0 = dat_h0_robot$X1, grp_2_robot_cheerful_h1 = dat_h1_robot$X1, # robot_sad grp_2_robot_sad_h0 = dat_h0_robot$X2, grp_2_robot_sad_h1 = dat_h1_robot$X2 ) } ``` Now the testing function will require a data frame to be constructed from the passed variables, to be appropriate for the `aov()` function. (Again, some wrapper function such as `ez::ezANOVA` could be used for readability, but this would make the simulation somewhat slower. However, if speed is of not much importance, those wrapper functions work just as well.) The way of extracting p values from the output (e.g., `aov_sum[[1]][[1]][['Pr(>F)']][1]`, where `aov_sum = summary(aov(...))`) also look quite extraordinary, but this is just how `aov()` [stores](https://stackoverflow.com/questions/3366506/extract-p-value-from-aov) this information. You can also use the `POSSA::get_p()` function to get more readable code (e.g., `aov_ps = get_p(aov(...))`, then `aov_ps$``Error:id``$voice`); though over many iterations this too adds a little bit of extra processing time. In either case, make sure you are extracting the correct p values (e.g., compare the values with the printed results of `summary(aov(...))`). Apart from the ANOVA, the function below also includes some specific pair-wise comparisons, namely, comparing the **robot** and **human** ratings separately in the cases of **sad** and **happy** voice. ```r testAOV = function(grp_1_human_cheerful_h0, grp_1_human_cheerful_h1, grp_1_human_sad_h0, grp_1_human_sad_h1, grp_2_robot_cheerful_h0, grp_2_robot_cheerful_h1, grp_2_robot_sad_h0, grp_2_robot_sad_h1) { len_grp1 = length(grp_1_human_cheerful_h0) len_grp2 = length(grp_2_robot_cheerful_h0) raw_data = data.frame( obs = c( grp_1_human_cheerful_h0, grp_1_human_sad_h0, grp_2_robot_cheerful_h0, grp_2_robot_sad_h0 ), id = c(paste0('g1_', c( 1:len_grp1, 1:len_grp1 )), paste0('g2_', c( 1:len_grp2, 1:len_grp2 ))), voice = c(rep('human', len_grp1 * 2), rep('robot', len_grp2 * 2)), emotion = c( rep('cheerful', len_grp1), rep('sad', len_grp1), rep('cheerful', len_grp2), rep('sad', len_grp2) ) ) aov_h0 = summary(aov(obs ~ voice * emotion + Error(id / emotion), data = raw_data)) raw_data$obs = c( grp_1_human_cheerful_h1, grp_1_human_sad_h1, grp_2_robot_cheerful_h1, grp_2_robot_sad_h1 ) aov_h1 = summary(aov(obs ~ voice * emotion + Error(id / emotion), data = raw_data)) return( c( p_voice_h0 = aov_h0[[1]][[1]][['Pr(>F)']][1], p_voice_h1 = aov_h1[[1]][[1]][['Pr(>F)']][1], p_emo_h0 = aov_h0[[2]][[1]][['Pr(>F)']][1], p_emo_h1 = aov_h1[[2]][[1]][['Pr(>F)']][1], p_interact_h0 = aov_h0[[2]][[1]][['Pr(>F)']][2], p_interact_h1 = aov_h1[[2]][[1]][['Pr(>F)']][2], p_sad_rob_vs_hum_h0 = t.test(grp_1_human_sad_h0, grp_2_robot_sad_h0, var.equal = TRUE)$p.value, p_sad_rob_vs_hum_h1 = t.test(grp_1_human_sad_h1, grp_2_robot_sad_h1, var.equal = TRUE)$p.value, p_cheer_rob_vs_hum_h0 = t.test( grp_1_human_cheerful_h0, grp_2_robot_cheerful_h0, var.equal = TRUE )$p.value, p_cheer_rob_vs_hum_h1 = t.test( grp_1_human_cheerful_h1, grp_2_robot_cheerful_h1, var.equal = TRUE )$p.value ) ) } ``` It's really good to verify this lengthy function pair via `do.call(testAOV, sampleAOV(100))`. The `sim()` and `pow()` function are then straightforward. (However, the simulation can take a while. For quick testing, lower the number of iterations; e.g. `n_iter = 1000` – same as `Superpower::ANOVA_power`'s default.) ```r dfPvalsAOV = sim(fun_obs = sampleAOV, n_obs = 40, fun_test = testAOV) #> Note: Observation numbers groupped as "grp_1" for grp_1_human_cheerful_h0, grp_1_human_cheerful_h1, grp_1_human_sad_h0, grp_1_human_sad_h1. #> Note: Observation numbers groupped as "grp_2" for grp_2_robot_cheerful_h0, grp_2_robot_cheerful_h1, grp_2_robot_sad_h0, grp_2_robot_sad_h1. pow(dfPvalsAOV) #> # POSSA pow() results # #> N(average-total) = 80.0 (if H0 true) or 80.0 (if H1 true) #> (p_voice) Type I error: .05096; Power: .17422 #> Local alphas: (1) .05000 #> (p_emo) Type I error: .04933; Power: .79371 #> Local alphas: (1) .05000 #> (p_interact) Type I error: .05020; Power: .65884 #> Local alphas: (1) .05000 #> (p_sad_rob_vs_hum) Type I error: .05047; Power: .40084 #> Local alphas: (1) .05000 #> (p_cheer_rob_vs_hum) Type I error: .05024; Power: .05398 #> Local alphas: (1) .05000 #> Global ("combined significance") type I error: .16453 (included: p_voice, p_emo, p_interact, p_sad_rob_vs_hum, p_cheer_rob_vs_hum; power for reaching the "combined significance": .94742) ``` Same in `Superpower` below. (Caution: this with 45000 iterations really takes very very long.) ```r designResult <- Superpower::ANOVA_design( design = "2b*2w", n = 40, mu = c(1.03, 1.41, 0.98, 1.01), sd = 1.03, r = 0.8, labelnames = c("voice", "human", "robot", "emotion", "cheerful", "sad"), plot = TRUE ) superPower <- Superpower::ANOVA_power( designResult, alpha = 0.05, nsims = 45000, seed = 1234, emm = FALSE ) #> Power and Effect sizes for ANOVA tests #> power effect_size #> anova_voice 17.74 0.02567 #> anova_emotion 79.58 0.10074 #> anova_voice:emotion 65.85 0.07811 #> Power and Effect sizes for pairwise comparisons (t-tests) #> power effect_size #> p_voice_human_emotion_cheerful_voice_robot_emotion_cheerful 5.507 -0.05149 #> p_voice_human_emotion_sad_voice_robot_emotion_sad 40.667 -0.39404 ``` Importantly, of course, with `POSSA` there is the option for sequential testing. Here, it is particularly important to pay attention which specific one(s) of the included tests (p values) should have stopping alphas (or futility bounds). For instance, as it often happens, let's say that the interaction is the only really important test in the entire ANOVA analysis. In that case, that could be used alone with stopping alphas and futility bounds. The rest may be included with non-stopping alphas to get the related power and T1ER. This can be done e.g. as below. ```r dfPvalsAovSeq = sim(fun_obs = sampleAOV, n_obs = c(15, 30, 45), fun_test = testAOV) #> Note: Observation numbers groupped as "grp_1" for grp_1_human_cheerful_h0, grp_1_human_cheerful_h1, grp_1_human_sad_h0, grp_1_human_sad_h1. #> Note: Observation numbers groupped as "grp_2" for grp_2_robot_cheerful_h0, grp_2_robot_cheerful_h1, grp_2_robot_sad_h0, grp_2_robot_sad_h1. pow( dfPvalsAovSeq, alpha_locals = list(p_interact = c(0.002, 0.018, 0.044)), fut_locals = list(p_interact = c(0.6, 0.4)), alpha_loc_nonstop = list( p_voice = 0.5, p_emo = 0.5, p_sad_rob_vs_hum = 0.5, p_cheer_rob_vs_hum = 0.5 ) ) #> # POSSA pow() results ## POSSA pow() results # #> N(average-total) = 56.4 (if H0 true) or 66.5 (if H1 true) #> (p_interact) Type I error: .05000; Power: .65662 #> Local alphas: (1) .00226; (2) .02030; (3) .04963 #> Likelihoods of significance if H0 true: (1) .00202; (2) .01920; (3) .02878 #> Likelihoods of significance if H1 true: (1) .04331; (2) .34111; (3) .27220 #> Futility bounds: (1) .60000; (2) .40000 #> Likelihoods of exceeding futility alpha if H0 true: (1) .07787; (2) .23536 #> Likelihoods of exceeding futility alpha if H1 true: (1) .15102; (2) .05380 #> (non-stopper: p_voice) Type I error: .50089; Power: .65498 #> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000 #> Likelihoods of significance if H0 true: (1) .19976; (2) .15896; (3) .14218 #> Likelihoods of significance if H1 true: (1) .11129; (2) .25760; (3) .28609 #> Futility bounds: none #> (non-stopper: p_emo) Type I error: .49547; Power: .95351 #> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000 #> Likelihoods of significance if H0 true: (1) .19758; (2) .15911; (3) .13878 #> Likelihoods of significance if H1 true: (1) .16751; (2) .37978; (3) .40622 #> Futility bounds: none #> (non-stopper: p_sad_rob_vs_hum) Type I error: .49291; Power: .81924 #> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000 #> Likelihoods of significance if H0 true: (1) .19140; (2) .15609; (3) .14542 #> Likelihoods of significance if H1 true: (1) .12289; (2) .33691; (3) .35944 #> Futility bounds: none #> (non-stopper: p_cheer_rob_vs_hum) Type I error: .49513; Power: .51316 #> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000 #> Likelihoods of significance if H0 true: (1) .19322; (2) .15704; (3) .14487 #> Likelihoods of significance if H1 true: (1) .10598; (2) .19667; (3) .21051 #> Futility bounds: none ``` Again, any different sample sizes could also be [specified](https://gasparl.github.io/possa/vignettes/v_1_intro.html) for each of the two groups via `n_obs` (e.g. `n_obs = list(grp_1 = c(10, 15, 30), grp_2 = c(20, 45, 60))`). ### DeLong's test This example is from a recent planning of a real experiment, where the diagnostic efficiencies were compared between a polygraph (autonomic response, AR)-based and a response time (RT)-based versions of a lie detection method. What's interesting here is that, while the RT predictor variable is expected to be normally distributed in case of both liars and truthtellers, the AR predictor variable (which is based on a certain integrated p value) is expected to follow a beta-distribution in case of liars and uniform distribution in case of truthtellers. A typical way to measure diagnostic efficiency (here: distinguishing liars from truthtellers) is area under the receiver operating characteristic curve (AUC), which in this case can be compared with a nonparametric DeLong's test. However, there is no software to calculate power for DeLong's test for independent samples, let alone for sequential analysis. Below is an example of how this could be done with `POSSA`. Here, the H0 is that the two AUCs are the same, and the H1 is that the RT version yields a larger AUC. (The expectations for the AUC sizes and the distributions are based on previous research. However, since the AUC size is not straightforward to calculate from the given distributions, in order to have equal AUCs for each condition for H0 simulation, the specific distributions below were arrived at by trial-and-error adjustments of the key characteristics of the distributions.) ```{r} sampleAUC = function(sampSize) { list( # AR ARtruth = -runif(n = sampSize, 0, 1), ARliar = sort(-rbeta(n = sampSize, shape1 = 0.2, 1)), # RT RTtruth = rnorm(sampSize, mean = 0, sd = 30), RTliar_h0 = rnorm(sampSize, mean = 41, sd = 30), RTliar_h1 = rnorm(sampSize, mean = 60, sd = 30) ) } ``` Below the testing function. To denote predictor conditions, `1` represents lies ("positive" cases), and `0` represents truth ("negative" cases, no lies). ```r testAUC = function(ARtruth, ARliar, RTtruth, RTliar_h0, RTliar_h1) { predictors = data.frame( guilt = c(rep(0, length(ARtruth)), rep(1, length(ARliar))), pred_ans = c(ARtruth, ARliar), pred_RTh0 = c(RTtruth, RTliar_h0), pred_RTh1 = c(RTtruth, RTliar_h1) ) AucAR = pROC::roc( response = predictors$guilt, predictor = predictors$pred_ans, levels = c(0, 1), direction = "<" # second expected larger ) AucRtH0 = pROC::roc( response = predictors$guilt, predictor = predictors$pred_RTh0, levels = c(0, 1), direction = "<" ) AucRtH1 = pROC::roc( response = predictors$guilt, predictor = predictors$pred_RTh1, levels = c(0, 1), direction = "<" ) return( c( p_h0 = pROC::roc.test( AucAR, AucRtH0, pair = TRUE, alternative = "less" )$p.value, p_h1 = pROC::roc.test( AucAR, AucRtH1, pair = TRUE, alternative = "less" )$p.value, AucAR = as.numeric(pROC::auc(AucAR)), AucRtH0 = as.numeric(pROC::auc(AucRtH0)), AucRtH1 = as.numeric(pROC::auc(AucRtH1)) ) ) } ``` Quick check via `do.call(testAUC, sampleAUC(100000))`, in particular to see (with this very large sample, which thereby gives fairly accurate estimates) that the AUCs are very close in case of H0. Then just the usual `sim()` and `pow()`. ```r dfPvalsAUC = sim(fun_obs = sampleAUC, n_obs = c(30, 60, 90), fun_test = testAUC) print(dfPvalsAUC) #> POSSA sim() results (p values) #> Sample: #> .iter .look .n_total ARtruth ARliar RTtruth RTliar_h p_h0 p_h1 AucAR AucRtH0 AucRtH1 #> 1: 1 1 120 30 30 30 30 0.7724812 1.820265e-01 0.9100000 0.8666667 0.9544444 #> 2: 1 2 240 60 60 60 60 0.3407161 3.844820e-03 0.8291667 0.8527778 0.9513889 #> 3: 1 3 360 90 90 90 90 0.1707108 1.842903e-05 0.8129630 0.8544444 0.9574074 #> Descriptives: #> AucAR: #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.5878 0.8083 0.8353 0.8334 0.8606 0.9844 #> AucRtH0: #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.5533 0.8088 0.8353 0.8332 0.8600 0.9844 #> AucRtH1: #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.7044 0.9058 0.9233 0.9214 0.9389 1.0000 pow(dfPvalsAUC) #> # POSSA pow() results # #> N(average-total) = 360.0 (if H0 true) or 360.0 (if H1 true) #> (p) Type I error: .04689; Power: .79536 #> Local alphas: (1) none; (2) none; (3) .05000 #> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .04689 #> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .79536 pow( dfPvalsAUC, alpha_locals = c(0.002, 0.018, 0.044), fut_locals = c(0.8, 0.4) ) #> # POSSA pow() results # #> N(average-total) = 259.3 (if H0 true) or 292.2 (if H1 true) #> (p) Type I error: .05000; Power: .78798 #> Local alphas: (1) .00211; (2) .01900; (3) .04645 #> Likelihoods of significance if H0 true: (1) .00187; (2) .01684; (3) .03129 #> Likelihoods of significance if H1 true: (1) .04847; (2) .41316; (3) .32636 #> Futility bounds: (1) .80000; (2) .40000 #> Likelihoods of exceeding futility alpha if H0 true: (1) .01180; (2) .17262 #> Likelihoods of exceeding futility alpha if H1 true: (1) .01216; (2) .03084 ``` Instead of the huge sample size of `90*4 = 360` with fixed design for any decent SESOI (here: an AUC improvement of about 9%), with sequential design the average required sample is reduced to 259.3 for H0 and 292.2 for H1, with only a practically negligible `.007` reduction in power (`.787` instead of `.795`). --- That's all for now.