---
title: "Joint Entropies and Association Graphs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Joint Entropies and Association Graphs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(out.width = "100%",
  cache = FALSE
)
```
In the section on [univariate, bivariate and trivariate entropies](univariate_bivariate_trivariate.html),
we saw that the bivariate entropy of two variables $X$ and $Y$ is bounded according to 
$$H(X) \leq H(X,Y) \leq H(X)+H(Y) \ .$$
The increment between the upper bound and the bivariate entropy is equal to the joint entropy given by
$$J(X,Y) = H(X)+H(Y)-H(X,Y)$$
and is a non-negative measure of dependence or association between variable $X$ and $Y$. It is equal to 0 if and only if $X$ and $Y$ are stochastically independent  $X\bot Y$ such that the probability of any bivariate outcome is the product of the probabilities of the univariate outcomes, i.e. $p(x,y)=p(x,+)p(+,y)$.

Trivariate entropies for triples of variables $X,Y,Z$ are bounded by
$$ H(X,Y) \leq H(X,Y,Z) \leq H(X,Z) + H(Y,Z) - H(Z) $$
The increment between the upper bound and the trivariate entropy is equal to the expected joint entropy given by
$$EJ(X,Y|Z) = H(X,Z)+H(Y,Z)-H(Z)-H(X,Y,Z)$$
which is non-negative and equal to 0 if and only if we have conditional independence of the form $X\bot Y | Z$. Thus, it can be used to measure the deviation from conditional independence.

The joint entropies $J(X,Y)$ for all pairs of variables $X$ and $Y$ are used to check dependence structure by looking at a sequence of association graphs constructed with nodes representing the variables and links between any pair of nodes corresponding to two variables with joint entropy above a chosen threshold. By successively lowering the threshold from the maximum joint entropy to smaller occurring values, the sequence of graphs get more and more links. Connected components that are cliques represent dependent subsets of variables, and different components represent independent subsets of variables. Conditional independence between subsets of variables can be identified by omitting the subset corresponding to the conditioning variables. A threshold that gives a graph with reasonably many small independent or conditionally independent subsets of variables can be considered to represent a multivariate model for further testing.

________________________________________________________________________________

## Example: joint entropies and association graphs
```{r load_library, eval=TRUE}
library(netropy)
```

We create a dataframe `dyad.var` consisting of dyad variables as described and created in [variable domains and data editing](variable_domains.html). Similar analyses can be performed on observed and/or transformed dataframes with vertex or triad variables.

```{r load_data, eval=TRUE, include=FALSE}
data(lawdata) 
adj.advice <- lawdata[[1]]
adj.friend <- lawdata[[2]]
adj.cowork <-lawdata[[3]]
df.att <- lawdata[[4]]
```
```{r edit_data, eval=TRUE, include=FALSE, results ='hide'}
att.var <-
  data.frame(
    status   = df.att$status-1,
    gender   = df.att$gender,
    office   = df.att$office-1,
    years    = ifelse(df.att$years<=3,0,
                      ifelse(df.att$years<=13,1,2)),
    age      = ifelse(df.att$age<=35,0,
                      ifelse(df.att$age<=45,1,2)),
    practice = df.att$practice,
    lawschool= df.att$lawschool-1
    )
dyad.status    <- get_dyad_var(att.var$status, type = 'att')
dyad.gender    <- get_dyad_var(att.var$gender, type = 'att')
dyad.office    <- get_dyad_var(att.var$office, type = 'att')
dyad.years     <- get_dyad_var(att.var$years, type = 'att')
dyad.age       <- get_dyad_var(att.var$age, type = 'att')
dyad.practice  <- get_dyad_var(att.var$practice, type = 'att')
dyad.lawschool <- get_dyad_var(att.var$lawschool, type = 'att')
dyad.cwk    <- get_dyad_var(adj.cowork, type = 'tie')
dyad.adv    <- get_dyad_var(adj.advice, type = 'tie')
dyad.frn    <- get_dyad_var(adj.friend, type = 'tie')
dyad.var <-
  data.frame(cbind(status    = dyad.status$var,
                  gender    = dyad.gender$var,
                  office    = dyad.office$var,
                  years     = dyad.years$var,
                  age       = dyad.age$var,
                  practice  = dyad.practice$var,
                  lawschool = dyad.lawschool$var,
                  cowork    = dyad.cwk$var,
                  advice    = dyad.adv$var,
                  friend    = dyad.frn$var)
                  )
```
```{r show_data, eval=TRUE, include=TRUE, results ='markup'}
head(dyad.var)
```

The function `joint_entropy()` computes the joint entropies between all pairs of variables in a given dataframe and returns a list consisting of the upper triangular joint entropy matrix (univariate entropies in the diagonal) and a dataframe giving the frequency distributions of unique joint entropy values. A function argument specifies the precision given in number of decimals for which the frequency distribution of unique entropy values is created (default is 3). Applying the function on the  dataframe `dyad.var` with two decimals:
```{r joint_ent, eval=TRUE, include=TRUE, results ='markup'}
J <- joint_entropy(dyad.var, 2)
J$matrix
J$freq
```
As seen, the strongest association is between the variables `status` and `years` with joint entropy values of 0.79. We have independence (joint entropy value of 0) between two pairs of variables: (`status`,`practice`), (`practise`,`gender`), (`cowork`,`gender`),and  (`cowork`,`lawschool`).

These results can be illustrated in a association graph using the function `assoc_graph()` which returns a `ggraph` object in which nodes represent variables and links represent strength of association (thicker links indicate stronger dependence). To use the function we need to load the `ggraph` library and to determine a threshold which the graph drawn is based on. We set it to 0.15 so that we only visualize the strongest associations:
```{r assoc_g, eval=TRUE, fig.align='center',fig.width=9, message =FALSE}
library(ggraph)
assoc_graph(dyad.var, 0.15)
```

Given this threshold, we see isolated and disconnected nodes representing independent variables. 
We note strong dependence between the three dyadic variables `status`,`years` and `age`, but also a somewhat strong dependence among the three variables  `lawschool`, `years` and `age`, and the three variables `status`, `years` and `gender`. The association graph can also be interpreted as a tendency for  relations `cowork` and `friend` to be independent conditionally on relation `advice`, that is, any dependence between dyad variables `cowork` and `friend` is explained by `advice`.


## References
> Frank, O., & Shafie, T. (2016). Multivariate entropy analysis of network data.
*Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique*, 129(1), 45-63. [link](https://doi.org/10.1177%2F0759106315615511)