--- title: "CliquePercolation" author: "Jens Lange" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{CliquePercolation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bibtex.bib csl: apa_NoIssueNum_NoGivenNameDisamb.csl --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(Matrix) library(qgraph) library(CliquePercolation) library(colorspace) ``` The **CliquePercolation** package entails a number of functions related to the clique percolation community detection algorithms for undirected, unweighted and weighted networks [as described in @Palla+et+al:2005; @Farkas+et+al:2007]. The package entails functions for... * helping to optimize parameters of the algorithms * running the algorithms * plotting the results. This document provides an introduction to this workflow with some random example data sets. ## An Introduction to the Clique Percolation Community Detection Algorithm Interconnected entities can be represented as networks [@Barabasi:2011]. Each network entails two sets, namely *nodes*, which are the entities, and *edges*, which are the connections between entities. Many networks are *undirected* such that edges simply connect two nodes with each other without any directionality in the edges. For instance, nodes could represent people and edges could represent whether they are friends or not. Or nodes could represent cities and edges could represent whether there is a street connecting the cities. Such networks, in which edges are either present or absent, are called *unweighted*. In other cases, nodes could represent people and edges could represent the frequency with which these people meet. Or nodes could represent questionnaire items and edges could represent the correlations of these items. In such a case, the edges are not only present or absent, but may count the frequency or strength of interactions. Such networks are called *weighted*. The edges in the network can further be *directed*, such that edges can point from one node to another. For instance, the internet is a network of webpages in which a directed edge could represent a hyperlink from one webpage to another. Here, only undirected networks are discussed. Analyzing the structure of such networks is a key task across sciences. One structural feature that is often investigated is the identification of strongly connected subgraphs in the network, which is called *community detection* [@Fortunato:2010]. Most community detection algorithms thereby put each node in only one community. However, it is likely that nodes are often shared by a number of communities. This could occur, for instance, when a friend is part of multiple groups. One community detection algorithm that is aimed at identifying overlapping communities is the *clique percolation algorithm*, which has been developed for unweighted [@Palla+et+al:2005] and weighted networks [@Farkas+et+al:2007]. The clique percolation algorithm for unweighted networks proceeds as follows: * First, you identify a network of nodes and edges. I will use an unweighted network with eight nodes. ```{r, echo = FALSE, dpi = 300, fig.cap = "**Unweighted network with eight nodes.**", fig.align = "center", out.extra = 'style = "border:none"', out.width = "60%"} W <- matrix(c(0,1,1,1,0,0,0,0, 0,0,1,1,0,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,1,1,1,0, 0,0,0,0,0,1,1,0, 0,0,0,0,0,0,1,0, 0,0,0,0,0,0,0,1, 0,0,0,0,0,0,0,0), nrow = 8, ncol = 8, byrow = TRUE) W <- forceSymmetric(W) rownames(W) <- letters[seq(from = 1, to = nrow(W))] colnames(W) <- letters[seq(from = 1, to = nrow(W))] qgraph(W, edge.width = 4) ``` * Second, you identify *k-cliques*, which are fully connected networks with $k$ nodes. The smallest possible $k$ would be $k = 3$. Otherwise, the cliques would be only edges. Applying $k = 3$ to the example leads to the identification of six 3-cliques, namely ```{r, echo = FALSE, dpi = 300, fig.cap = "**Six 3-cliques in unweighted network.**", fig.align = "center", out.extra = 'style = "border:none"', out.width = "60%"} color1 <- c("#ff9600","#ff9600","#ff9600","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f", "#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f") color2 <- c("#ff9600","#7f7f7f","#7f7f7f","#ff9600","#ff9600","#7f7f7f","#7f7f7f", "#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f") color3 <- c("#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#ff9600","#ff9600", "#ff9600","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f") color4 <- c("#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#ff9600","#7f7f7f", "#7f7f7f","#ff9600","#ff9600","#7f7f7f","#7f7f7f") color5 <- c("#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#ff9600", "#7f7f7f","#ff9600","#7f7f7f","#ff9600","#7f7f7f") color6 <- c("#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f","#7f7f7f", "#ff9600","#7f7f7f","#ff9600","#ff9600","#7f7f7f") layout(matrix(c(1,2,3, 4,5,6), ncol = 3, nrow = 2, byrow = TRUE)) qgraph(W, edge.color = color1, edge.width = 4) qgraph(W, edge.color = color2, edge.width = 4) qgraph(W, edge.color = color3, edge.width = 4) qgraph(W, edge.color = color4, edge.width = 4) qgraph(W, edge.color = color5, edge.width = 4) qgraph(W, edge.color = color6, edge.width = 4) ``` * Finally, a community is defined as a set of adjacent $k$-cliques, that is, $k$-cliques that share exactly $k-1$ nodes. With $k = 3$, two 3-cliques are adjacent if they share exactly two nodes (equivalent to an edge). In the example, this implies that there are two communities (see below). The green edges entail the 3-cliques $a$--$b$--$c$ and $a$--$b$--$d$. The pink edges entail the 3-cliques $d$--$e$--$f$, $d$--$e$--$g$, $d$--$f$--$g$, and $e$--$f$--$g$. ```{r, echo = FALSE, dpi = 300, fig.cap = "**Two communities in unweighted network.**", fig.align = "center", out.extra = 'style = "border:none"', out.width = "60%"} color7 <- c("#00AD9A","#00AD9A","#00AD9A","#00AD9A","#00AD9A","#E16A86","#E16A86", "#E16A86","#E16A86","#E16A86","#E16A86","#7f7f7f") qgraph(W, edge.color = color7, edge.width = 4) ``` This also showcases that the clique percolation algorithm can lead to nodes that are *shared* by communities. In the current example, node $d$ is part of both the green and the pink community. Nodes $a$, $b$, and $c$ are part of only the green community and nodes $e$, $f$, and $g$ are part of only the pink community. Importantly, clique percolation can also lead to nodes that are part of no community. These are called *isolated nodes*, in the current example, node $h$. One way to plot the community partition on the original network would be to color the nodes according to the communities they belong to. Shared nodes have multiple colors and isolated nodes remain white. ```{r, echo = FALSE, results = FALSE, dpi = 300, fig.cap = "**Community partition by node coloring in unweighted network.**", fig.align = "center", out.extra = 'style = "border:none"', out.width = "60%"} cp <- cpAlgorithm(qgraph(W, DoNotPlot = TRUE), k = 3, method = "unweighted") cpColoredGraph(qgraph(W, DoNotPlot = TRUE), cp$list.of.communities.labels, edge.width = 4) ``` For weighted networks, the algorithm has just one intermediate additional step. Specifically, after identifying the $k$-cliques, they are considered further only if their *Intensity* exceeds a specified threshold $I$. The Intensity of a $k$-clique is defined as the geometric mean of the edge weights, namely $$ I_C = \Bigg(\prod_{i