Parallel Testing

library(segtest)

We provide a walk-through of how to run tests for segregation distortion at many loci in parallel.

Data sets ufit, ufit2, and ufit3 contain the genotyping output of updog::multidog() using three different models.

You can convert this genotyping output to what multi_lrt() expects using multidog_to_g().

If you did not use the f1pp or f1 models, use ether the all_gl (to run tests using genotype log-likelihoods) or all_g (to run tests assuming genotypes are known) options, and you must specify the ID’s of the parents.

o1 <- multidog_to_g(ufit, type = "all_g", p1 = "indigocrisp", p2 = "sweetcrisp")
o2 <- multidog_to_g(ufit, type = "all_gl", p1 = "indigocrisp", p2 = "sweetcrisp")

If you did use the f1pp or f1 models, use either the off_gl (to run tests using genotype log-likelihoods) or off_g (to run tests assuming genotypes are known) options.

o3 <- multidog_to_g(ufit2, type = "off_g")
o4 <- multidog_to_g(ufit2, type = "off_gl")
o5 <- multidog_to_g(ufit3, type = "off_g")
o6 <- multidog_to_g(ufit3, type = "off_gl")

We would recommend always using genotype log-likelihoods. But the option is there for known genotypes.

Parallelization support is provided through the future package. You specify how to implement parallelization using future::plan(). Then you run multi_lrt(). Then you shut down the parallelization with future::plan(). The most common plan is future::multisession(), where you specify the number of parallel processes with the workers argument. You can get the maximum number of workers via future::availableCores()

future::availableCores()
#> system 
#>     16

So a typically workload looks something like this:

future::plan(future::multisession(workers = 2))
mout <- multi_lrt(g = o2$g, p1 = o2$p1, p2 = o2$p2)
future::plan(future::sequential())

The output is a data frame. The most important parts of this are the snp and p_value columns. As a reminder, please ignore the alpha, xi1, and xi2 estimates. Those are noisy. Please don’t use them for real work.

mout[c("snp", "p_value")]
#>            snp     p_value
#> 1   1_44673475 0.205250847
#> 2  11_28836119 0.200142085
#> 3  12_31029646 0.018008460
#> 4   12_8487773 1.000000000
#> 5    2_1426329 0.453995964
#> 6   2_20070837 0.001131421
#> 7   2_40108108 0.410718596
#> 8   4_37820371 0.212515165
#> 9   6_30619745 0.617023681
#> 10   6_4350249 0.649505305

It looks like SNPs 12_31029646 and 2_20070837 are in possible segregation distortion. Let’s look at the posterior mode genotypes of SNP 2_20070837:

o1$g["2_20070837", ]
#>   0   1   2   3   4 
#>  40 166  32   2   0
o1$p1["2_20070837"]
#> 2_20070837 
#>          0
o1$p2["2_20070837"]
#> 2_20070837 
#>          2

So SNP 2_20070837 likely got flagged because of two individuals that are very likely a genotype of 3, which is impossible if the parents have genotypes 0 and 2.

For SNP 12_31029646, let’s compare the expected frequencies against the observed modes.

offspring_gf_2(alpha = 0.1249, xi1 = 1/3, p1 = 2, p2 = 0)
#> [1] 0.2083 0.5834 0.2083 0.0000 0.0000
o1$g["12_31029646", ] / sum(o1$g["12_31029646", ])
#>         0         1         2         3         4 
#> 0.1541667 0.5833333 0.2625000 0.0000000 0.0000000
o1$p1["12_31029646"]
#> 12_31029646 
#>           2
o1$p2["12_31029646"]
#> 12_31029646 
#>           0

So SNP 12_31029646 likely got flagged because there are too many individuals with a genotype of 2, and not enough with a genotype of 0.