--- title: "Hypothesis test in admixture models" author: "Xavier Milhaud" bibliography: references.bib output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Hypothesis test in admixture models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` ```{r setup} library(admix) ``` We remind that a random variable $X$ following an admixture distribution has cumulative distribution function (cdf) $L$ given by $$L(x) = pF(x) + (1-p)G(x), \qquad x \in \mathbb{R},$$ where $G$ is a mixture component whose distribution is perfectly known, whereas $p$ and $F$ are unknown. In this setting, if no parametric assumption is made on the unknown component distribution $F$, the mixture is considered as a semiparametric mixture. For an overview on semiparametric extensions of finite mixture models, see [@Xiang19]. The goal of this vignette is to introduce the functionalities that enable to perform hypothesis tests on the unknown component distribution $F$. We aim to test whether $F$ belongs to certain parametric family (e.g. the Gaussian one) in a 1-sample case, or if two different decontaminated versions of $F_1$ and $F_2$ (obtained from two observed samples $X_1$ and $X_2$) are similar in the K-sample case ($K \geq 2$). All the specific tests presented hereafter can be performed using one single generic function for testing with appropriate arguments, the so-called $admix\_test$ function. # The one-sample case only available to symmetric unknown density In this setting, the test to be performed is a parametric family testing, i.e. $$H_0: \, F\in \mathcal{F} \qquad \mbox{against} \qquad H_1: \, F\notin \mathcal{F},$$ where $\mathcal{F}=\left\{F_\theta:~\theta\in \Theta \right\}$. The support of the known component density has to be in line with the one of the unknown component density. Such tests have been introduced in [@PommeretVandekerkhove2019], and the idea underlying this hypothesis test follows these steps: 1. decompose the observed density and the known density in an orthonormal polynomial basis, 2. get the expansion coefficients of such densities, 3. reformulate the null hypothesis of the test using these coefficients, 2. adopt a $\chi^2$ test strategy that relies on Central Limit Theorem (CLT) results on estimators of the (unknown) weight related to the unknown component distribution $F$. Because of the use of asymptotically normal estimators, it is not possible to use the estimator provided in [@PatraSen2016] to perform hypothesis testing. On the contrary, @BordesVandekerkhove2010 propose an estimator that can be used if the unknown component density is assumed to be symmetric. More generally, @PommeretVandekerkhove2019 give more details about the distribution of the test statistic under the null (hypothesis $H_0$), and under the alternative $H_1$. Here, the implemented function allows to perform the so-called Gaussianity test, meaning that the parametric family against which the unknown component is tested belongs to Gaussian distributions. Below is an example of hypothesis testing in this 1-sample case: ```{r} ####### Under the null hypothesis H0. ## Simulate mixture data: mixt1 <- twoComp_mixt(n = 300, weight = 0.6, comp.dist = list("norm", "norm"), comp.param = list(c("mean" = 2, "sd" = 0.5), c("mean" = 0, "sd" = 1))) data1 <- getmixtData(mixt1) ## Define the admixture model: admixMod <- admix_model(knownComp_dist = mixt1$comp.dist[[2]], knownComp_param = mixt1$comp.param[[2]]) admix_test(samples = list(data1), admixMod = list(admixMod), conf_level = 0.95, test_method = "poly", ask_poly_param = FALSE, support = "Real") ``` The result of the test is that we cannot reject the null hypothesis $H_0$, which is in line with the specified distribution for the unknown component. Indeed, simulated data is a Gaussian mixture with two components, i.e. $F \sim \mathcal{N}(\mu,\sigma)$ where $\mu=2$ and $\sigma=0.5$. # The two-sample case Let us introduce two random samples $X_1$ and $X_2$ following admixture models, such that \begin{align*} \left\{ \begin{array}{l} L_1(x) = (1-p_1)G_1(x) + p_1F_1(x) \\ L_2(x) = (1-p_2)G_2(x) + p_2F_2(x), \end{array} \right. \end{align*} The goal here is to perform the following hypothesis test: $$H_0: ~ F_1=F_2 \qquad \mbox{against} \qquad H_1: F_1\neq F_2.$$ ## Case of symmetric unknown densities In this framework, we assume that $F_1$ and $F_2$ both have a symmetric density. This way the normally-distributed estimator of $p_1$ and $p_2$, proposed in [@BordesVandekerkhove2010], can be used together with the testing strategy suggested in [@MilhaudPommeretSalhiVandekerkhove2022]. This testing strategy is closely connected to [@PommeretVandekerkhove2019], where the computation of the expansion coefficients is duplicated on each of the two samples under study. In what follows, we simulate two samples under the null and check whether the test provides satisfactory results. ```{r} mixt1 <- twoComp_mixt(n = 600, weight = 0.8, comp.dist = list("norm", "norm"), comp.param = list(list("mean" = 3, "sd" = 0.5), list("mean" = 0, "sd" = 1))) mixt2 <- twoComp_mixt(n = 800, weight = 0.7, comp.dist = list("norm", "norm"), comp.param = list(list("mean" = 3, "sd" = 0.5), list("mean" = 0, "sd" = 1))) data1 <- getmixtData(mixt1) data2 <- getmixtData(mixt2) ## Define the admixture models: admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]], knownComp_param = mixt1$comp.param[[2]]) admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]], knownComp_param = mixt2$comp.param[[2]]) ## Using expansion coefficients in orthonormal polynomial basis: admix_test(samples = list(data1,data2), admixMod = list(admixMod1,admixMod2), conf_level = 0.95, test_method = "poly", ask_poly_param = FALSE, support = "Real") ``` The hypothesis test concludes that the null hypothesis cannot be rejected, once again in line with what was expected given the specified parameters when simulating the data. Note that the following arguments, involved in subroutines, were set to default values (but that the user can choose them setting parameter 'ask_poly_param' to TRUE): * 'est_method' is set to 'BVdk' to tell the program to estimate the unknown proportions $p_1$ and $p_2$ using the estimator proposed in [@BordesVandekerkhove2010], * 'K' equals 3 to mention that such expansions are computed up to the third order of the decomposition in the polynomial basis, * 's' equals 0.25 as the penalization rate involved in the penalization rule used in [@MilhaudPommeretSalhiVandekerkhove2022]. When the unknown component distributions are not supposed to have symmetric densities, another solution to perform the test is to set 'est_method' to 'PS' following 'ask_poly_param' set to TRUE, but keeping in mind that plugging in such estimators for the test procedure should not be allowed in theory. However, that works quite well in practice. Another solution to perform this test in full generality is to use the IBM method (see below). ## Case of fully unknown densities Estimation of the unknown quantities is made by the Inversion - Best Matching approach, see [@MilhaudPommeretSalhiVandekerkhove2024a]. In this case, one can still use the function with same first arguments except 'method', and 'method' should be set to 'icv'. The user also has to define the number of simulated gaussian processes used to tabulate the test statistic distribution ('n_sim_tab'), and can accelerate computations using parallel computations and choosing an adequate number of cpus. Other arguments such as $support$ are useless. # The K-sample case We introduce hereafter a natural extension of the two-sample case to the K-sample one, see [@MilhaudPommeretSalhiVandekerkhove2024b]. In what follows, the K-sample test is illustrated within the framework of the IBM approach, i.e. using the associated inner convergence property. Of course, in the case when all the unknown component densities are assumed to be symmetric, one could use a pairwise version of the two sample test using the comparison of expansion coefficients in a polynomial orthonormal basis, associated to the estimation method provided by [@BordesVandekerkhove2010]. Consider $K$ samples. For $i=1,...,K$, sample $X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})$ follows $$L_i(x) = p_i F_i(x) + (1-p_i) G_i, \qquad x \in \mathbb{R}.$$ The test to perform is given by $$H_0 : \; F_1 = ... = F_K \qquad \mbox{against} \qquad H_1: \; F_i \neq F_j \quad \mbox{for some} \quad i \neq j.$$ We use the IBM approach to do so, where assumptions are (straightforwardly) adapted to deal with the $K$ samples. Basically, we apply the theoretical results of IBM for each pair of populations $(i,j)$, and then build a series of embedded statistics. Consider the set of pair indices: ${\cal S}(K) = \{(i,j)\in \mathbb{N}^2 ; \; 1\leq i