The SpeakEasy 2: Champagne (SE2) community detection algorithm groups together graph nodes into communities based on their connectivity. The detection algorithm works on weighted or unweighted graphs and directed or undirected graphs.
This speakeasyR package provides a convenient way to
run the SE2 algorithm in R using the cluster()
function.
The cluster()
function accepts graphs in the form of
adjacency matrices (both base R matrices and Matrix
sparse matrices), igraph graphs, or anything that can
be coerced into an adjacency matrix using an as.matrix()
generic function.
Before getting started, we will create a wrapper function that
returns a randomly generated igraph graph with clear
community structure. This function accepts number of nodes, number of
communities (or types), and a mixing parameter. The graph starts as a
regular block adjacency matrix—where each community is fully connected
within the community and has no edges leaving the community. It then
uses the mixing parameter (mu
) to determine what proportion
of intracommunity edges should be rewired to connect nodes between
communities, the larger the mixing parameter, the weaker the community
structure.
igraph_from_mixing_param <- function(n_nodes, n_types, mu) {
pref <- matrix(mu, n_types, n_types)
diag(pref) <- 1 - mu
igraph::sample_pref(n_nodes, types = n_types, pref.matrix = pref)
}
With this function, a small, easy to cluster graph can be created.
if (requireNamespace("igraph")) {
g <- igraph_from_mixing_param(n_nodes = 50, n_types = 5, mu = 0.1)
}
#> Loading required namespace: igraph
The graph can be clustered by calling cluster
:
The membership vector returned contains the community ids for each
vector, such that memb[vid]
is the community id of vertex
vid
. The seed argument is given to ensure reproducible
results.
The graph can be visualized using a heatmap to display the edges. To
view the detected communities, the nodes need to be reordered to group
members of the same community together. For this the
order_nodes()
function is used.
if (requireNamespace("igraph")) {
ordering <- speakeasyR::order_nodes(g, memb)
adj <- as(g[], "matrix")[ordering, ordering]
color <- rainbow(length(unique(memb)))[memb[ordering]]
heatmap(adj, scale = "none", Rowv = NA, Colv = NA, RowSideColors = color)
}
The graph is converted to an adjacency matrix to simplify indexing
and working with heatmap()
. The color bar shows which nodes
each community encompasses.
SE2 can easily recover the community structure of a graph with little crosstalk between communities, but how about in a larger graph with more mixing?
Using the same wrapper function as before, we can generate and cluster a tougher graph:
if (requireNamespace("igraph")) {
g <- igraph_from_mixing_param(n_nodes = 1000, n_types = 10, mu = 0.4)
memb <- speakeasyR::cluster(g, seed = 222, max_threads = 2)
ordering <- speakeasyR::order_nodes(g, memb)
adj <- as(g[], "matrix")[ordering, ordering]
color <- rainbow(length(unique(memb)))[memb[ordering]]
heatmap(adj, scale = "none", Rowv = NA, Colv = NA, RowSideColors = color)
}
Even with 40% of each community’s edges rewired to other communities, the original membership is still closely recovered.
Out of the box, SE2 should work well. But one of the main goals of SE2 is to ensure consistent results. Some more challenging graphs may need extra runs to get past the variance and converge on a stable community structure.
SE2 works by running the same core clustering algorithm multiple
times independently, starting with different initial guesses, and then
looking for stable patterns across runs to come to conclusions about
community structure. By default it performs 10
independent_runs
, in many cases this is more than enough,
but if multiple calls to cluster()
return significantly
different membership vectors, it’s worth increasing the number of runs
by setting the independent_runs
parameter to a larger
value. The target_partitions
and
discard_transient
parameters may also help produce a more
robust result.
Each independent run tracks multiple possible partitions throughout
the run. The run ends when it has found target_partitions
number of partitions. In cases where there isn’t a clear structure to
converge on, increasing this parameter can provide more examples of
where the nodes naturally group.
If the graph is particularly noisy, it may take a longer time to find
useful communities. The discard_transient
allows you to
adjust how many partitions to discard before the run starts tracking
partitions. By discarding a higher number of initial partitions, you
give SE2 more time to generate a sound foundation, giving less influence
to earlier, weaker communities structures in determining the final
membership vector.
SE2 also has a few options that don’t impact how it clusters the
graph. These are seed
, verbose
, and
max_threads
. The seed
parameter was mentioned
above and sets the seed of the RNG used by SE2. While SE2 uses a random
number generator (RNG) different from R’s, the default value of the seed
is generated using R’s RNG. Because of this it’s possible to ensure
reproducible results over multiple SE2 calls by setting the R RNG seed
instead of setting the cluster()
parameter.
Setting verbose
to TRUE
will cause SE2 to
print some information about clustering while it runs. This can be used
to ensure SE2 identifies the graph as expected but is more useful to
give information about the status of clustering during longer runs.
Lastly, max_threads
provides the maximum number of threads
to run in parallel. SE2, can run each independent run in a separate
thread, each of which can be run on separate processing cores. By
default, SE2 will use as many threads as cores available but this can be
reduced if desired.
Using publicly available single-cell RNA sequencing (scRNAseq) data, we can test out SE2 on a real world example and attempt to cluster individual cells into cell types based on the similarity of their gene expression.
We will use the scRNAseq
package to obtain a clean data
set:
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
expression <- scRNAseq::FletcherOlfactoryData()
counts <- SummarizedExperiment::assay(expression, "counts")
cell_types <- expression$cluster_id
}
#> Loading required namespace: scRNAseq
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
The cell_types
above are the published cell types
identified through a non-graph based classification method. It’s
important to note that there is no “correct” answer to this
classification problem, so differences in results does not imply one
method is incorrect.
Since single-cell data is noisy due to the small amounts of gene
expression on an individual cell level and hard to correct for
differences in cell size, the counts
must be filtered and
normalized before continuing. We start by removing any genes that are
expressed too little across the data set:
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
indices <- rowSums(counts > 0) > 10
counts <- counts[indices, ]
}
Any genes that have recorded counts in less than 10 cells are discarded. We then use a simple normalization technique, shifted logarithm on the counts matrix:
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
target <- median(colSums(counts))
size_factors <- colSums(counts) / target
counts_norm <- log(t(t(counts) / size_factors + 1))
}
Lastly, to further reduce noise, we use PCA to create 50 composite features made by combining the gene expression from different sets of genes:
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
counts_norm <- t(prcomp(t(counts_norm), scale. = FALSE)$x)[1:50, ]
}
Now that we have preprocessed the data, it’s possible to convert it
to a similarity based graph. We will use the 50 PCA components to
determine how similar the gene expression between each of the cells are
and use this to create a k-nearest neighbors (kNN) graph. That is a
directed graph where each cell has k
unweighted edges
leaving it going to the k
most similar cells. Luckily, the
speakeasyR package provides a function to create this
graph:
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
adj <- speakeasyR::knn_graph(counts_norm, 10)
}
This graph can be passed to SE2 and visualized using a heatmap as before:
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
memb <- speakeasyR::cluster(adj, verbose = TRUE, seed = 222, max_threads = 2)
ordering <- speakeasyR::order_nodes(adj, memb)
}
#> Approximate edge density is 0.0162602.
#> Input type treated as unweighted.
#>
#> Calling main routine at level 1.
#>
#> Completed generating initial labels.
#> Produced 10 seed labels, while goal was 10.
#>
#> Starting level 1 clustering
#> ; independent runs might not be displayed in order - that is okay...
#>
#> Starting independent run #1 of 10
#>
#> Starting independent run #2 of 10
#>
#> Starting independent run #4 of 10
#>
#> Starting independent run #3 of 10
#>
#> Starting independent run #6 of 10
#>
#> Starting independent run #5 of 10
#>
#> Starting independent run #8 of 10
#>
#> Starting independent run #7 of 10
#>
#> Starting independent run #10 of 10
#>
#> Starting independent run #9 of 10
#>
#>
#> Generated 50 partitions at level 1.
#>
#> Mean of all NMIs is 0.87488.
The kNN graph is initially in a sparse matrix form, this is converted
to a base matrix
type to make it easier to reorder the
nodes.
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
adj <- as.matrix(as(adj, "dMatrix"))
row_colors <- rainbow(length(unique(cell_types)))[cell_types[ordering]]
col_colors <- rainbow(length(unique(memb)))[memb[ordering]]
heatmap(
adj[ordering, ordering],
Colv = NA, Rowv = NA,
labRow = NA, labCol = NA,
RowSideColors = row_colors, ColSideColors = col_colors,
xlab = "SE2 Cluster", ylab = "Cell Type",
scale = "none"
)
}
Here the color bar on the x-axis shows the communities based on SE2 while the y-axis color bar displays the communities the publication found. Ordering is based on the SE2 results so there is some mixing of the previous cell types communities but not of SE2 in the visual.
Real networks often have hierarchical structures, within a friend clique there can be smaller groups of closer friendships, or in the publication landscape, collaboration is likely to be common between members of the same lab but there can also be collaboration at a higher scale between labs and between universities or even fields.
SE2 provides subclustering where the individual communities of the
initial clustering will in turn be clustered into smaller communities.
This behavior can be turned on by setting the subclusters
to parameter to a value greater than 1. (The min_clust
parameter determines the smallest community to consider for
subclustering, if a community has fewer than min_clust
nodes, it will not be subclustered further.)
We can use this subclustering feature to cluster the cell types in the previous example into subtypes.
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
memb <- speakeasyR::cluster(
adj,
subcluster = 2,
verbose = TRUE,
seed = 222,
max_threads = 2
)
ordering <- speakeasyR::order_nodes(adj, memb)
}
#> Approximate edge density is 0.0162602.
#> Input type treated as unweighted.
#>
#> Calling main routine at level 1.
#>
#> Completed generating initial labels.
#> Produced 37 seed labels, while goal was 10.
#>
#> Starting level 1 clustering
#> ; independent runs might not be displayed in order - that is okay...
#>
#> Starting independent run #1 of 10
#>
#> Starting independent run #2 of 10
#>
#> Starting independent run #3 of 10
#>
#> Starting independent run #4 of 10
#>
#> Starting independent run #5 of 10
#>
#> Starting independent run #6 of 10
#>
#> Starting independent run #7 of 10
#>
#> Starting independent run #8 of 10
#>
#> Starting independent run #9 of 10
#>
#> Starting independent run #10 of 10
#>
#>
#> Generated 50 partitions at level 1.
#>
#> Mean of all NMIs is 0.89693.
#>
#>
#> Subclustering at level 2.
memb
is now a matrix with one row per level of
subclustering instead of a vector. Similarly, there are two sets of
order indices in ordering
, one in row 1 and the second in
row 2. The order_nodes()
function recursively orders nodes
by community. So each community in the previous level is reordered
rather than completely reordering all nodes based only on the second
level of clustering. This leaves some remnants of previous levels’
ordering in the resulting heatmaps.
if (requireNamespace("scRNAseq") && requireNamespace("SummarizedExperiment")) {
level <- 2
level_order <- ordering[level, ]
level_memb <- memb[level, level_order]
row_colors <- rainbow(length(unique(cell_types)))[cell_types[level_order]]
col_colors <- rainbow(length(unique(level_memb)))[level_memb]
heatmap(
adj[level_order, level_order],
Colv = NA, Rowv = NA,
labRow = NA, labCol = NA,
RowSideColors = row_colors, ColSideColors = col_colors,
xlab = "SE2 Cluster", ylab = "Cell Type",
scale = "none", main = paste0("Level ", level, "structure")
)
}
Here the original communities have been further broken down into groups of cells that are particularly similar within cell types. When comparing this heatmap to the previous heatmap, notice some of the communities at level 1 combine multiple of the publication’s communities but at level 2, they are separated out into distinct cell types.