--- title: "zebu: Local Association Measures" author: - "Olivier M. F. Martin" - "Michel Ducher" date: "`r Sys.Date()`" output: rmarkdown::html_document bibliography: bibliography.bib abstract: |- Association measures can be local or global. Local association measures quantify the association between specific values of random variables (*e.g.* chi-squared residuals). Global association measures yield a single value used to summarize the association for all values taken by random variables (*e.g.* chi-squared). Classical data analysis has focused on global association and overlooked local association. Consequently, software presently available only allows computation of global association measures. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association. \ The `zebu` R package allows estimation of local association measures and implements local association subgroup analysis. It is of interest to a wide range of scientific disciplines such as health and computer sciences and can be used by anyone with a basic knowledge of the R language. It is available in the CRAN and its source code is available at https://github.com/oliviermfmartin/zebu. \ Keywords: measure of association, statistical independence, local association, pointwise mutual information, Lewontin's D, Ducher’s Z, chi-squared residuals. vignette: > %\VignetteIndexEntry{"zebu: Local Association Measures"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r opts = TRUE, setup = TRUE, include = FALSE} knitr::opts_chunk$set(echo = TRUE, comment = "") ``` ## Summary 1. [Introduction](#section-introdution) 2. [Background on Association and Independence](#section-background) 3. [Local Association Measures](#section-lam) - [Derivation of Bivariate Forms](#section-lam1) - [Global Association](#section-lam2) - [Statistical Significance Tests](#section-lam3) - [Derivation of Multivariate Forms](#section-lam3) 4. [User's Guide - An Example with Simulated Data](#section-ug) - [Simulating the Dataset](#section-ug1) - [Bivariate Association](#section-ug2) - [Multivariate Association](#section-ug3) 5. [Future Research and Development](#section-future) 6. [References](#section-references)
## Introduction Association measures can be local or global [@van_de_cruys_two_2011]. Local association measures quantify the association between specific values of random variables. In the case of a contingency table, they yield one value for each cell. An example is chi-squared residuals that are computed when constructing a chi-squared test. On the other hand, global association measures yield a single value used to summarize the association for all values taken by random variables. An example is the chi-squared statistic, the sum of squared residuals [@sheskin_handbook_2007]. Most often, we are only concerned with the global association and overlook local association. For example, analysis of chi-squared residuals is uncommon practice when compared to the chi-square independence test. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association [@anselin_lisa_1995]. Accordingly, analysis of association should not limit itself to the global perspective. Indeed, the association between two variables can depend on their values. For example, in threshold mechanisms, variables are only associated with each other when one takes values above a certain critical level. In this case, local association measures allow pinpointing values for which variables are associated. Moreover, the existence of an association between two variables may depend on the value of a third variable. For example, the effect of a drug will depend on the patient's sensibility to the drug. The local association between drug intake and recovery will not be the same for patients that are sensitive than for those that are resistant to the drug. The rest of the vignette is organized as follows. We first give the reader the necessary intuition and mathematical background about global and local associations. This leads to the description of chi-square residuals, Lewontin's $D$ [@Lewontin1964], Ducher's $Z$ [@ducher_statistical_1994], and pointwise mutual information [@van_de_cruys_two_2011]. We also introduce a multivariate and normalized measure of local association. Subsequently, we illustrate the usage of local association measures using the `zebu` R package. The vignette ends with a discussion about future development and research.
## Background on Association and Independence Throughout the vignette, we will suppose that all random variables are discrete and write them in capital letters, such as $A$ and $B$. Lower letters, such as $a$ and $b$, will denote possible values taken by these random variables (*i.e.* events). One way to think about a statistical association is as events co-occurring. For example, if event $a$ always occurs with event $b$, then these events are said to be associated. An intuitive measure of association could be the joint probability: $p(a, b)$, the long-term frequency of events showing up together. However, this measure fails if $a$ or $b$ is a rare event. Indeed, joint probabilities are always as small as its individual events are rare: $p(a, b) \leq \min p(a), p(b)$. As a consequence, it is necessary to compare *observed* probabilities $p(a, b)$ to *expected* probabilities in which the variables are considered independent. The expected probability, if events are independent, is the factor of marginalized probabilities of events: $p(a) \, p(b)$. Independence is then defined by the following mathematical relation, $p(a, b) = p(a) \, p(b)$, and local association measures are defined to be equal to zero. Independence implies that knowing one or more variables does not give us any information about the others. This is what we are not interested in. It is, however, possible to define two cases where the former equality does not hold: co-occurrence and mutual exclusivity. Co-occurrence is defined as events showing up more often than expected: $p(a, b) > p(a) \, p(b)$ and local association measures are positive. Mutual exclusivity is defined as events showing up less often than expected: $p(a, b) < p(a) \, p(b)$ and local association measures are negative. Statistical independence is, however, not the only manner to construct an association measure. Other possibilities are based on the proportion of explained variance such as Pearson's r. These former measures are parametric and suppose linear or at least monotone relationships between variables. Although intuitive and convenient, this assumption is not always justified. Measures based on statistical independence provide a non-parametric alternative that can detect non-linear relationships.
## Local Association Measures
#### Derivation of Bivariate Forms For two random variables, $A$ and $B$, we can estimate the local association for each combination of events $A = a$ and $B = b$. This is accomplished by comparing the observed from the expected probability of events $a$ and $b$. If these probabilities are equal, then events $a$ and $b$ are independent. If not, these events are associated; the sign of the measure indicates the orientation of the relationship, and the absolute value indicates its strength. There are different measures to compare observed and expected probabilities, for example, by using subtraction and division. Hereunder, we define the difference or Lewontin's $D$ [@Lewontin1964] and the pointwise mutual information $pmi$ [@van_de_cruys_two_2011]. To simplify notation, and to show similarities between local association measures, we define $h(a) = - \log p(a)$ as the self-information of $a$. \[ \begin{aligned} D(a, b) & = p(a, b) - p(a) \, p(b) \\ pmi(a, b) & = \log \frac{p(a, b)} {p(a) p(b)} = - (h(a, b) - h(a) - h(b)) \end{aligned} \] The bounds of these two measures depend on the frequency of events, which makes it difficult to compare local association values for different combinations of events. For this reason, it is desirable to express association relative to the frequency of events. One common way to do this is using chi-squared residuals $r_{\chi}$ as follows where $N$ is the sample size. \[ r_{\chi}(a,b) = \sqrt{N} \; \frac{p(a, b) - p(a) \, p(b)}{\sqrt{p(a) \, p(b)}} \] We may wish to normalize local association so that values are between -1 and 1 included. This can be done by using dividing the non-normalized values by their minimal or maximal values. To identify the theoretical minimal and maximal values of $D$, we will find the bounds of the observed bivariate probability $p(a, b)$ as a function of the marginal probabilities $p(a)$ and $p(b)$. Using the inclusion-exclusion principle, we know that: \[ p(a, b) = p(a \cap b) = p(a) + p(b) - p(a \cup b) \] The intersection probability $p(a, b)$ will be maximized when the union probability $p(a \cup b)$ is equal to zero and minimized when the union probability will be equal to one. \[ p(a) + p(b) - 1 \le p(a, b) \le p(a) + p(b) \] Given the intersection probability can not be smaller than zero and can not be larger than the smallest marginal probability, we have: \[ \max[0, \, p(a) + p(b) - 1] \le p(a, b) \le min[p(a), \, p(b)] \] Using this result, we can divide $D$ by its theoretical minimal or maximal value which leads to Lewontin's $D'$ [@Lewontin1964]. However, this removes the sign of the association. To preserve the sign, in the case where Lewontin's $D$ is negative, we divide it by the negative theoretical minimal value. We call this measure Ducher's $Z$. It should however be noted that the original definition of Ducher's $Z$ did not probably consider the lower bound of the intersection probability as we do here [@ducher_statistical_1994]. \[ Z(a, b) = \begin{cases} \frac{ p(a, b) - p(a) \, p(b) }{ \min[p(a), \, p(b)] - p(a) \, p(b) } & D(a, b) > 0 \\ \\ \frac{ p(a, b) - p(a) \, p(b) }{p(a) p(b) - \max[0, \, p(a) + p(b) - 1]} & D(a, b) < 0 \\ \\ 0 & D(a, b) = 0 \end{cases} \] Normalization of case of $pmi$ is more subtle because $pmi(a, b)$ tends to $\infty$ when $p(a, b)$ tends to 0. Nonetheless, dividing $pmi(a, b)$ by $- h(a, b)$ solves this problem by making $npmi(a, b)$ tend to -1 when $p(a, b)$ tends to 0 and equal to 1 when $p(a, b) = \min[p(a), p(b)]$ [@bouma_normalized_2009]. \[ npmi(a, b) = \frac{ pmi(a, b) }{- h(a, b) } = \frac{ h(a, b) - h(a) - h(b) }{ h(a, b) } \] The `zebu` package includes a function called `lassie` allowing estimation of Lewontin's $D$, Ducher's $Z$, $pmi$, $npmi$, and $r_{\chi}$.
#### Global Association Global association measures yield a single value used to summarize the association for all values taken by the random variables. For example, mutual information is computed as the sum for all events of their observed probability times their pointwise mutual information. Most global association measures in `zebu` are defined likewise. \[ \begin{aligned} GD(A, B) &= \sum_{a, b} p(a, b) D(a, b) \\ GZ(A, B) &= \sum_{a, b} p(a, b) Z(a, b) \\ MI(A, B) &= \sum_{a, b} p(a, b) pmi(a, b) \\ NMI(A, B) &= \sum_{a, b} p(a, b) npmi(a, b) \\ \end{aligned} \] The global association measure related to chi-squared residuals is the chi-squared $\chi^2$. It is defined as the sum of its squared residuals. \[ \chi^2 = \sum_{a, b} r_{\chi}(a,b)^2 \]
#### Statistical Significance Tests Distinguishing the strength of association from its statistical significance is important. Indeed, a strong association can be non-significant (*e.g.* some physical law with small sample size) and a weak association can be significant (*e.g.* epidemiological risk factor with big sample size). Significance can be accessed using p-values estimated using the theoretical null distribution or by resampling techniques [@sheskin_handbook_2007]. Because the theoretical null distribution of local association measures is unknown, the `zebu` package resorts to estimating p-values by a permutation test. This can be undertaken using the `permtest` function of the package. The null hypothesis $H_0$ being tested is that the association measure $L$ is equal to 0, that is, there is no association. The observed association is $L_{obs}$ and the permuted associations are denoted by the set $L_{perm}$. Moreover, we write $\#(\ldots)$ as the number of times and $|\ldots|$ as the absolute value. The two-sided p-value can then be estimated as follows. \[ p = \frac{\#(|L_{obs}| < |L_{perm}|)}{\#(L_{perm})} \] With chi-squared residuals in a two-dimensional setting, an alternative analytical method for estimating the p-values can be employed as implemented by the `chisqtest` function. This function calculates two-sided p-values for each local chi-squared residual, assuming that these residuals are distributed according to a standard normal distribution. Thus, the two-sided p-value $p$ can be computed as follows where $\Phi(\ldots)$ is the cumulative distribution function of the standard normal distribution, and $L_{obs}$ denotes the observed chi-squared residuals: \[ p = 2 \times (1 - \Phi(|L_{obs}|)) \] As these local association measures involve conducting multiple statistical tests, it is advisable to apply corrections for multiple testing, such as the method proposed by Benjamini and Hochberg.
#### Derivation of Multivariate Forms Multivariate association measures may help identify complex association relationships that cannot be detected only with bivariate association measures. For example, in the XOR gate, the output of the gate is not associated with any of the two inputs individually [@jakulin_analyzing_2003]. The association is only revealed when the two inputs and the output are taken together. To derive multivariate forms of these local association measures, we assume that events are mutually independent. This means that for $M$ random variables $X_1, \ldots, X_M$, independence is defined by: $p(x_1, \ldots, x_M) = \prod_{i=1}^{M} p(x_i)$. We can thus define the following measures. \[ \begin{aligned} D(x_1, \ldots, x_M) & = p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) \\ pmi(x_1, \ldots, x_M) & = - [h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) ] \end{aligned} \] By dividing $D(x_1, \ldots, x_M)$ by the expected probability we obtain multivariate chi-squared residuals \[ r_{\chi}(x_1, \ldots, x_M) = \sqrt{N} \; \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \sqrt{\prod_{i=1}^{M} p(x_i)} } \] To obtain a multivariate measure of Ducher's $Z$, we need to find the bounds of the observed probability $p(x_1, \ldots, x_M)$. We know that the upper bound will be the minimal marginal probability. Additionally, we find a formula to express the lower bound (proof at the end of the section). This leads to the following bounds. \[ \max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \le p(x_1, \ldots, x_M) \le \min[x_1, \ldots, x_M]$. \] We thus propose the following multivariate form of Ducher's $Z$. \[ Z(x_1, \ldots, x_M) = \begin{cases} \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \min[p(x_1), \ldots, p(x_M)] - \prod_{i=1}^{M} p(x_i) } & D(x_1, \ldots, x_M) > 0 \\ \\ \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{\prod_{i=1}^{M} p(x_i)- \max[0, - M - 1 + \sum_{i=1}^M p(x_i)]} & D(x_1, \ldots, x_M) < 0 \\ \\ 0 & D(x_1, \ldots, x_M) = 0 \end{cases} \] For pointwise mutual information, the normalization technique suggested by @bouma_normalized_2009 is not bounded by 1 for more than two variables. To solve this, we suggest the following normalization scheme which we call $npmi_2$. \[ npmi_2(x_1, \ldots, x_M) = \begin{cases} \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{ \min[h(x_1), \ldots, h(x_M)] - \sum_{i=1}^{M} h(x_i) } & pmi(x_1, \ldots, x_M) > 0 \\ \\ \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{h(x_1, \ldots, x_M)} & pmi(x_1, \ldots, x_M) < 0 \\ \\ 0 & pmi(x_1, \ldots, x_M) = 0 \end{cases} \] #### Proof of lower bound formula Using induction and the inclusion-exclusion principle we give a formula for the lower bound of the observed intersection probability of $M$ events. \[ \min[ p(x_1, \ldots, x_M) ] = \max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \] We first show that this is true for the base case $M=2$. In this case, $p(x_1) + p(x_2) -1 \le p(x_1, x_2)$. We proved this using the inclusion-exclusion principle in the section where we [derive a bivariate form of Ducher's Z](#section-lam1). We now show that the induction step is true: let's assume that this formula is true for $M$ variables. For $M+1$ variables, the inclusion-exclusion principle tells us that: \[ \begin{align} p(x_1, \ldots, x_{M+1}) &= p(\{\cap_{i=1}^M x_i \} \cap x_{M+1}) \\ &= p(\cap_{i=1}^M x_i) + p(x_{M+1}) - p(\{\cap_{i=1}^M x_M \} \cup x_{M+1}) \\ \end{align} \] We assumed that the lower bound $p(\cap_{i=1}^M x_i)$ is $-M - 1 + \sum_{i=1}^M p(x_i)$ and we know that the upper bound of $p(\{\cap_{i=1}^M x_M \} \cup x_{M+1})$ is one. Replacing these values in the last line leads to \[ \min[ p(x_1, \ldots, x_{M+1}) ] = -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i) \] However, given that probabilities can not be lower than zero, we can write the following equation which completes the proof. \[ \min[ p(x_1, \ldots, x_{M+1}) ] = \max[0, -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i)] \]
## User's Guide - An Example with Simulated Data Once R is installed, the first step is to install the `zebu` package. You can install the released version from CRAN ```{R eval=FALSE} install.packages("zebu") ``` We can then load the `zebu` R package. ```{r} library(zebu) ```
### Simulating the Dataset To illustrate local association measures and the `zebu` package, we'll simulate a small culinary dataset. Each row corresponds to a client of a restaurant. We record the choices made by the client. There are three choices for each of the plates: starters, main dish, and dessert. The clients take the starter with equal probability. The choice of the following dish depends only on the previous dish. The clients tend to avoid a dish with an ingredient that had in the previous dish. We define these probabilities hereunder. ```{r} starter_prob <- c("Tomato Mozzarella Salad" = 1/3, "Rice Tuna Salad" = 1/3, "Lentil Salad" = 1/3) starter_prob ``` ```{r} main_given_starter_prob <- matrix(c(5/11, 1/11, 5/11, 5/11, 5/11, 1/10, 1/11, 5/11, 5/11), 3, 3, byrow = TRUE) rownames(main_given_starter_prob) <- names(starter_prob) colnames(main_given_starter_prob) <- c("Sausage and Lentil Stew", "Pizza Margherita", "Pilaf Rice") main_given_starter_prob ``` ```{r} dessert_given_main <- matrix(c(2/6, 2/6, 2/6, 7/12, 1/12, 2/6, 1/12, 7/12, 2/6), 3, 3, byrow = TRUE) rownames(dessert_given_main) <- colnames(main_given_starter_prob) colnames(dessert_given_main) <- c("Rice Pudding", "Apple Pie", "Fruit Salad") dessert_given_main ``` We now simulate a dataset of 1000 clients given these probabilities. ```{r} set.seed(0) sample_size <- 1000 df <- t(sapply(seq_len(sample_size), function(i) { starter <- sample(names(starter_prob), size = 1, prob = starter_prob) main <- sample(colnames(main_given_starter_prob), size = 1, prob = main_given_starter_prob[starter, ]) dessert <- sample(colnames(dessert_given_main), size = 1, prob = dessert_given_main[main, ]) c(Starter = starter, Main = main, Dessert = dessert) })) df <- as.data.frame(df) ``` ```{r} head(df) ``` We look at the contingency table. ```{r} table(df) ```
### Bivariate Association The local (and global) association between these variables can be estimated using the `lassie` function. This function takes at least one argument: a `data.frame`. Columns are selected using the `select` arguments (column names or numbers). Variables are assumed to be categorical; continuous variables have to be specified using the `continuous` argument and the number of discretization bins with the `breaks` argument (as in the `cut` function). The local association measure that we use here is Ducher's Z as specified by setting the `measure` argument equal to `"z"`. We'll look at the relationship between the main dish and the dessert. ```{r} las <- lassie(df, select = c("Main", "Dessert"), measure = "z") ``` The `permtest` function accesses the significance of local (and global) association using a permutation test. The number of iterations is specified by `nb` and the adjustment method of p-values for multiple comparisons by `p_adjust` (as in the `p.adjust` function). ```{r} las <- permtest(las, nb = 5000, p_adjust = "BH") ``` The `lassie` and `permtest` functions return a `lassie` S3 object, as well as `permtest` for `permtest`. `lassie` objects can be visualized using the `plot` and `print` methods. The `plot` function returns a heatmap with the local association and p-values displayed between parenthesis. ```{r plot-local-association} print(las) plot(las) ``` Alternatively, for two dimensional chi-squared analysis, significance can be estimated using the `chisqtest` function. Results can be saved in CSV format using `write.lassie`. To access the documentation of these functions, please type `help("print.lassie")`, `help("plot.lassie")` and `help(write.lassie)` in the R console. In this example, we can see that the global association between the two variables is statistically significant. However, we see that there is only a local association between these two variables only for certain combinations. More specifically, knowing that a client took the lentils and sausages does not convey information about what dessert s/he will take, and likewise for the fruit salad. *A significant global association can hide a non-significant local association.*
### Multivariate Association The number of variables that can be handled in the `zebu` package is not limited. We can use the `lassie` function to access the association between the three simulated variables. In this case, we obtain a multidimensional local association `array`. Because of this, results cannot be plotted as a tile plot; the `plot` method is not available. The `print` method allows visualizing results by melting the `array` into a `data.frame`, by default sorted by decreasing local association. ```{r} las2 <- lassie(df, measure = "z") las2 <- permtest(las2, nb = 5000) print(las2, what_sort = "local_p", decreasing = FALSE) ``` In this case, we see that there is no significant global association. However, we see that for certain combinations of variables, there is a significant local association. *A non-significant global association can hide a significant local association.*
## Future Research and Development Local association measures are issued from empirical research. Although these have proven their interest in diverse applications, theoretical studies of their mathematical properties are sparse. A more theoretical approach to these measures could be of interest. For example, by determining the theoretical null distribution of these measures. Also, we have assumed mutual exclusivity of events for the multivariate association measures. This assumption may be too stringent for certain variables and usage of other independence models such as conditional independence may prove to be worthwhile. In `zebu`, discretization is a necessary step for studying continuous variables. We have restrained ourselves to simple discretization methods: equal-width and user-defined. Other discretization algorithms exist [@dash_comparative_2011] and may be more adapted for the computation of association measures. Moreover, kernel methods could also be used to handle continuous variables better. Secondly, estimation of probabilities is done from the frequentist maximum-likelihood procedure which requires sufficiently large datasets. Unfortunately, in fields such as health sciences, datasets are sparse. Bayesian estimation methods are more robust to small sample sizes by not relying on asymptomatic assumptions and by allowing integration of prior knowledge [@wilkinson_bayesian_2007]. Such an implementation may also prove to be of interest.
## References