#### Statistical Significance Tests
Distinguishing the strength of association from its statistical significance is important. Indeed, a strong association can be non-significant (*e.g.* some physical law with small sample size) and a weak association can be significant (*e.g.* epidemiological risk factor with big sample size). Significance can be accessed using p-values estimated using the theoretical null distribution or by resampling techniques [@sheskin_handbook_2007]. Because the theoretical null distribution of local association measures is unknown, the `zebu` package resorts to estimating p-values by a permutation test. This can be undertaken using the `permtest` function of the package.
The null hypothesis $H_0$ being tested is that the association measure $L$ is equal to 0, that is, there is no association. The observed association is $L_{obs}$ and the permuted associations are denoted by the set $L_{perm}$. Moreover, we write $\#(\ldots)$ as the number of times and $|\ldots|$ as the absolute value. The two-sided p-value can then be estimated as follows.
\[
p = \frac{\#(|L_{obs}| < |L_{perm}|)}{\#(L_{perm})}
\]
With chi-squared residuals in a two-dimensional setting, an alternative analytical method for estimating the p-values can be employed as implemented by the `chisqtest` function. This function calculates two-sided p-values for each local chi-squared residual, assuming that these residuals are distributed according to a standard normal distribution. Thus, the two-sided p-value $p$ can be computed as follows where $\Phi(\ldots)$ is the cumulative distribution function of the standard normal distribution, and $L_{obs}$ denotes the observed chi-squared residuals:
\[
p = 2 \times (1 - \Phi(|L_{obs}|))
\]
As these local association measures involve conducting multiple statistical tests, it is advisable to apply corrections for multiple testing, such as the method proposed by Benjamini and Hochberg.
#### Derivation of Multivariate Forms
Multivariate association measures may help identify complex association relationships that cannot be detected only with bivariate association measures. For example, in the XOR gate, the output of the gate is not associated with any of the two inputs individually [@jakulin_analyzing_2003]. The association is only revealed when the two inputs and the output are taken together.
To derive multivariate forms of these local association measures, we assume that events are mutually independent. This means that for $M$ random variables $X_1, \ldots, X_M$, independence is defined by: $p(x_1, \ldots, x_M) = \prod_{i=1}^{M} p(x_i)$. We can thus define the following measures.
\[
\begin{aligned}
D(x_1, \ldots, x_M) & = p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) \\
pmi(x_1, \ldots, x_M) & = - [h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) ]
\end{aligned}
\]
By dividing $D(x_1, \ldots, x_M)$ by the expected probability we obtain multivariate chi-squared residuals
\[
r_{\chi}(x_1, \ldots, x_M) = \sqrt{N} \; \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \sqrt{\prod_{i=1}^{M} p(x_i)} }
\]
To obtain a multivariate measure of Ducher's $Z$, we need to find the bounds of the observed probability $p(x_1, \ldots, x_M)$. We know that the upper bound will be the minimal marginal probability. Additionally, we find a formula to express the lower bound (proof at the end of the section). This leads to the following bounds.
\[
\max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \le p(x_1, \ldots, x_M) \le \min[x_1, \ldots, x_M]$.
\]
We thus propose the following multivariate form of Ducher's $Z$.
\[
Z(x_1, \ldots, x_M) =
\begin{cases}
\frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \min[p(x_1), \ldots, p(x_M)] - \prod_{i=1}^{M} p(x_i) }
& D(x_1, \ldots, x_M) > 0 \\
\\
\frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{\prod_{i=1}^{M} p(x_i)- \max[0, - M - 1 + \sum_{i=1}^M p(x_i)]}
& D(x_1, \ldots, x_M) < 0 \\
\\
0
& D(x_1, \ldots, x_M) = 0
\end{cases}
\]
For pointwise mutual information, the normalization technique suggested by @bouma_normalized_2009 is not bounded by 1 for more than two variables. To solve this, we suggest the following normalization scheme which we call $npmi_2$.
\[
npmi_2(x_1, \ldots, x_M) =
\begin{cases}
\frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{ \min[h(x_1), \ldots, h(x_M)] - \sum_{i=1}^{M} h(x_i) }
& pmi(x_1, \ldots, x_M) > 0 \\
\\
\frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{h(x_1, \ldots, x_M)}
& pmi(x_1, \ldots, x_M) < 0 \\
\\
0
& pmi(x_1, \ldots, x_M) = 0
\end{cases}
\]
#### Proof of lower bound formula
Using induction and the inclusion-exclusion principle we give a formula for the lower bound of the observed intersection probability of $M$ events.
\[
\min[ p(x_1, \ldots, x_M) ] = \max[0, -M - 1 + \sum_{i=1}^M p(x_i)]
\]
We first show that this is true for the base case $M=2$. In this case, $p(x_1) + p(x_2) -1 \le p(x_1, x_2)$. We proved this using the inclusion-exclusion principle in the section where we [derive a bivariate form of Ducher's Z](#section-lam1).
We now show that the induction step is true: let's assume that this formula is true for $M$ variables. For $M+1$ variables, the inclusion-exclusion principle tells us that:
\[
\begin{align}
p(x_1, \ldots, x_{M+1})
&= p(\{\cap_{i=1}^M x_i \} \cap x_{M+1}) \\
&= p(\cap_{i=1}^M x_i) + p(x_{M+1}) - p(\{\cap_{i=1}^M x_M \} \cup x_{M+1}) \\
\end{align}
\]
We assumed that the lower bound $p(\cap_{i=1}^M x_i)$ is $-M - 1 + \sum_{i=1}^M p(x_i)$ and we know that the upper bound of $p(\{\cap_{i=1}^M x_M \} \cup x_{M+1})$ is one. Replacing these values in the last line leads to
\[
\min[ p(x_1, \ldots, x_{M+1}) ] = -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i)
\]
However, given that probabilities can not be lower than zero, we can write the following equation which completes the proof.
\[
\min[ p(x_1, \ldots, x_{M+1}) ] = \max[0, -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i)]
\]
## User's Guide - An Example with Simulated Data
Once R is installed, the first step is to install the `zebu` package. You can install the released version from CRAN
```{R eval=FALSE}
install.packages("zebu")
```
We can then load the `zebu` R package.
```{r}
library(zebu)
```
### Simulating the Dataset
To illustrate local association measures and the `zebu` package, we'll simulate a small culinary dataset.
Each row corresponds to a client of a restaurant. We record the choices made by the client. There are three choices for each of the plates: starters, main dish, and dessert.
The clients take the starter with equal probability. The choice of the following dish depends only on the previous dish. The clients tend to avoid a dish with an ingredient that had in the previous dish. We define these probabilities hereunder.
```{r}
starter_prob <- c("Tomato Mozzarella Salad" = 1/3, "Rice Tuna Salad" = 1/3, "Lentil Salad" = 1/3)
starter_prob
```
```{r}
main_given_starter_prob <- matrix(c(5/11, 1/11, 5/11,
5/11, 5/11, 1/10,
1/11, 5/11, 5/11),
3, 3, byrow = TRUE)
rownames(main_given_starter_prob) <- names(starter_prob)
colnames(main_given_starter_prob) <- c("Sausage and Lentil Stew", "Pizza Margherita", "Pilaf Rice")
main_given_starter_prob
```
```{r}
dessert_given_main <- matrix(c(2/6, 2/6, 2/6,
7/12, 1/12, 2/6,
1/12, 7/12, 2/6),
3, 3, byrow = TRUE)
rownames(dessert_given_main) <- colnames(main_given_starter_prob)
colnames(dessert_given_main) <- c("Rice Pudding", "Apple Pie", "Fruit Salad")
dessert_given_main
```
We now simulate a dataset of 1000 clients given these probabilities.
```{r}
set.seed(0)
sample_size <- 1000
df <- t(sapply(seq_len(sample_size), function(i) {
starter <- sample(names(starter_prob), size = 1, prob = starter_prob)
main <- sample(colnames(main_given_starter_prob), size = 1, prob = main_given_starter_prob[starter, ])
dessert <- sample(colnames(dessert_given_main), size = 1, prob = dessert_given_main[main, ])
c(Starter = starter, Main = main, Dessert = dessert)
}))
df <- as.data.frame(df)
```
```{r}
head(df)
```
We look at the contingency table.
```{r}
table(df)
```
### Bivariate Association
The local (and global) association between these variables can be estimated using the `lassie` function. This function takes at least one argument: a `data.frame`. Columns are selected using the `select` arguments (column names or numbers). Variables are assumed to be categorical; continuous variables have to be specified using the `continuous` argument and the number of discretization bins with the `breaks` argument (as in the `cut` function). The local association measure that we use here is Ducher's Z as specified by setting the `measure` argument equal to `"z"`. We'll look at the relationship between the main dish and the dessert.
```{r}
las <- lassie(df, select = c("Main", "Dessert"), measure = "z")
```
The `permtest` function accesses the significance of local (and global) association using a permutation test. The number of iterations is specified by `nb` and the adjustment method of p-values for multiple comparisons by `p_adjust` (as in the `p.adjust` function).
```{r}
las <- permtest(las,
nb = 5000,
p_adjust = "BH")
```
The `lassie` and `permtest` functions return a `lassie` S3 object, as well as `permtest` for `permtest`. `lassie` objects can be visualized using the `plot` and `print` methods. The `plot` function returns a heatmap with the local association and p-values displayed between parenthesis.
```{r plot-local-association}
print(las)
plot(las)
```
Alternatively, for two dimensional chi-squared analysis, significance can be estimated using the `chisqtest` function.
Results can be saved in CSV format using `write.lassie`. To access the documentation of these functions, please type `help("print.lassie")`, `help("plot.lassie")` and `help(write.lassie)` in the R console.
In this example, we can see that the global association between the two variables is statistically significant. However, we see that there is only a local association between these two variables only for certain combinations. More specifically, knowing that a client took the lentils and sausages does not convey information about what dessert s/he will take, and likewise for the fruit salad. *A significant global association can hide a non-significant local association.*
### Multivariate Association
The number of variables that can be handled in the `zebu` package is not limited. We can use the `lassie` function to access the association between the three simulated variables.
In this case, we obtain a multidimensional local association `array`. Because of this, results cannot be plotted as a tile plot; the `plot` method is not available. The `print` method allows visualizing results by melting the `array` into a `data.frame`, by default sorted by decreasing local association.
```{r}
las2 <- lassie(df, measure = "z")
las2 <- permtest(las2, nb = 5000)
print(las2, what_sort = "local_p", decreasing = FALSE)
```
In this case, we see that there is no significant global association. However, we see that for certain combinations of variables, there is a significant local association. *A non-significant global association can hide a significant local association.*
## Future Research and Development
Local association measures are issued from empirical research. Although these have proven their interest in diverse applications, theoretical studies of their mathematical properties are sparse. A more theoretical approach to these measures could be of interest. For example, by determining the theoretical null distribution of these measures. Also, we have assumed mutual exclusivity of events for the multivariate association measures. This assumption may be too stringent for certain variables and usage of other independence models such as conditional independence may prove to be worthwhile.
In `zebu`, discretization is a necessary step for studying continuous variables. We have restrained ourselves to simple discretization methods: equal-width and user-defined. Other discretization algorithms exist [@dash_comparative_2011] and may be more adapted for the computation of association measures. Moreover, kernel methods could also be used to handle continuous variables better. Secondly, estimation of probabilities is done from the frequentist maximum-likelihood procedure which requires sufficiently large datasets. Unfortunately, in fields such as health sciences, datasets are sparse. Bayesian estimation methods are more robust to small sample sizes by not relying on asymptomatic assumptions and by allowing integration of prior knowledge [@wilkinson_bayesian_2007]. Such an implementation may also prove to be of interest.
## References