The Mercator package is intended to facilitate the exploratory analysis of data sets. It consists of two main parts, one devoted to tools for binary matrices, and the other focused on visualization. These visualization tools can be used with binary, continuous, categorical, or mixed data, since they only depend on a distance matrix. Each distance matrix can be visualized with multiple techniques, providing a consistent interface to thoroughly explore the data set. In thus vignette, we illustrate the visualization of a continuous data set.
First we load the package.
suppressMessages( suppressWarnings( library(Mercator) ) )
Now we load a “fake” set of synthertic continuous data that comes with the Mercator package. We will use this data set to illustrate the visualization methods.
set.seed(36766)
data(fakedata)
ls()
## [1] "fakeclin" "fakedata"
dim(fakedata)
## [1] 776 300
dim(fakeclin)
## [1] 300 4
The Mercator Package currently supports visualization of data with methods that include standard techniques (hierarchical clustering) and large-scale visualizations (multidimensional scaling (MDS),T-distributed Stochastic Neighbor Embedding (t-SNE), and iGraph.) In order to create a Mercator object, we must provide
We are going to start with hierarchical clustering, with an arbitrarily assigned number of 4 groups.
<- Mercator(dist(t(fakedata)), "euclid", "hclust", 4)
mercury summary(mercury)
## An object of the 'Mercator' class, using the ' euclid ' metric, of size
## [1] 300 300
## Contains these visualizations: hclust
Here is a “view” of the dendrogram produced by hierarchical clustering. Note that view is an argument to the plot function for Mercator objects. If omitted, the first view in the list is used.
plot(mercury, view = "hclust")
The dendrogram suggests that there might actually be more than 4 subtypes in the data, but we’re going to wait until we see some other views of the data before doing anything about that.
Mercator can use t-distributed Stochastic Neighbor Embedding (t-SNE) plots for visualizing large-scale, high-dimensional data in a 2-dimensional space.
<- addVisualization(mercury, "tsne")
mercury plot(mercury, view = "tsne", main="t-SNE; Euclidean Distance")
The t-SNE plot also suggests more than four subtypes; perhaps as many as seven or eight.
Optional t-SNE parameters, such as perplexity, can be used to fine-tune the plot when the visualization is created. Using addVisualization to create a new, tuned plot of an existing type overwrites the existing plot of that type.
<- addVisualization(mercury, "tsne", perplexity = 15) mercury
## Warning in addVisualization(mercury, "tsne", perplexity = 15): Overwriting an
## existing visualization:tsne
plot(mercury, view = "tsne", main="t-SNE; Euclidean Distance; perplexity = 15")
Mercator allows visualization of multi-dimensional scaling (MDS) plots, as well.
<- addVisualization(mercury, "mds")
mercury plot(mercury, view = "mds", main="MDS; Euclidean Distance")
Interestingly, the MDS plot (which is equivalent to principal components analysis, PCA, when used with Euclidean distances) doesn’t provide clear evidence of more than three or four subtypes. That’s not surprising, since groups separated in high dimensions can easily be flattened by linear projections.
Mercator can visualize complex networks using iGraph. IN the next chunk of code, we add an iGraph visualization. We then look at the resulting graph, using three different “layouts”. The Q parameter is a cutoff (qutoff?) on the distance used to include edges; if omitted, it defaults to the 10th percentile. We arrived at the value Q=24 shown here by trial-and-error, though one could plot a histogram of the distances (via hist(mercury@distance)) to make a more informed choice.
set.seed(73633)
<- addVisualization(mercury, "graph", Q = 24) mercury
## Warning in layout_nicely(myg): Non-positive edge weight found, ignoring all
## weights during graph layout.
plot(mercury, view = "graph", layout = "tsne", main="T-SNE Layout")
plot(mercury, view = "graph", layout = "mds", main = "MDS Layout")
plot(mercury, view = "graph", layout = "nicely", main = "'Nicely' Layout")
The last layout, in this case, is possibly not so nice.
We can use the getClusters function to determine the cluster assignments and use these for further manipulation. For example, we can easily determine cluster size.
<- getClusters(mercury)
my.clust table(my.clust)
## my.clust
## 1 2 3 4
## 82 68 74 76
We might also compare the cluster labels to the “true” subtypes in our “fake” data set.
table(my.clust, fakeclin$Type)
##
## my.clust 1 2 3 4 5 6 7 8
## 1 0 40 0 1 0 41 0 0
## 2 30 0 36 0 0 0 2 0
## 3 4 0 0 34 0 0 36 0
## 4 0 0 0 0 41 0 0 35
The barplot method produces a version of the “silhouette width” plot from Kaufman and Rouseeuw (and borrowed from the cluster package).
barplot(mercury)
For each observation in the data set, the silhouette width is a measure of how much we believe that it is placed in the correct cluster. Here we see that about 10% to 20% of the observations in each cluster may be incorrectly classified, since their silhouette widths are negative.
We can “recluster” by specifying a different number of clusters.
<- recluster(mercury, K = 8)
mercury plot(mercury, view = "tsne")
The silhouette-width barplot changes with the number of clusters. In this case, it suggests that eight clusters may not describe the data as well as four. However, the previous t-SNE plot also shows that the algorithmically derived cluster labels don’t seem to match the visible clusters very well.
barplot(mercury)
The clustering algorithm used within Mercator is partitioning around medoids (PAM). You can run any clustering algorithm of your choice and assign the resulting cluster labels to the Mercator object. As part of our visualizations, we have laready pefomred hierarchcai clustering. So, we can assign cluster labels by cutting the branches of the dendrogram. We can use the cutree function after extracting the dendrogram from the view. (Note that we use the remapColors function here to try to keep the same color assignments for the PAM-defined clusters and the hierarchical clusters.)
<- cutree(mercury@view[["hclust"]], k = 8)
hclass <- setClusters(mercury, hclass)
neptune <- remapColors(mercury, neptune) neptune
plot(neptune, view = "tsne")
The assignments by hierarchical clustering appear to more consistent thant eh PAM clusters with the t-SNE plot, though one suspect that the assignemnts among the pink, red, and orchid groups may be difficult. The silhouette width barplot (below) confirms that hierarchical clustering works better than PAM on this data set. Only the “red” group #4 contains a large number of apparently misclassified samples.
barplot(neptune)
For our fake data set, since we simulated it, we know the “true” labels. So, we can “recluster” using the true assignments.
<- setClusters(neptune, fakeclin$Type)
venus <- remapColors(neptune, venus)
venus plot(venus, view = "tsne")
barplot(venus)
We can also see how the hierarchical clustering compare to the true cluster assignments.
table(getClusters(neptune), getClusters(venus))
##
## 1 2 3 4 5 6 7 8
## 1 41 0 2 1 0 0 0 0
## 2 0 21 0 0 0 0 0 1
## 3 0 0 38 0 0 0 0 0
## 4 0 4 0 34 0 0 0 0
## 5 0 0 0 0 35 2 0 0
## 6 0 0 0 0 0 39 0 0
## 7 0 0 0 0 0 0 36 0
## 8 0 9 0 0 0 0 2 35
This analaysis was performed in the following environment:
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Mercator_1.1.5 Thresher_1.1.4 PCDimension_1.1.13
## [4] ClassDiscovery_3.4.4 oompaBase_3.2.9 cluster_2.1.6
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.46 bslib_0.8.0
## [4] ggplot2_3.5.1 lattice_0.22-6 vctrs_0.6.5
## [7] tools_4.4.1 generics_0.1.3 stats4_4.4.1
## [10] flexmix_2.3-19 Polychrome_1.5.1 tibble_3.2.1
## [13] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
## [16] Matrix_1.7-0 KernSmooth_2.23-24 scatterplot3d_0.3-44
## [19] lifecycle_1.0.4 kohonen_3.0.12 compiler_4.4.1
## [22] munsell_0.5.1 movMF_0.2-8 htmltools_0.5.8.1
## [25] sass_0.4.9 yaml_2.3.10 pillar_1.9.0
## [28] jquerylib_0.1.4 MASS_7.3-60.2 openssl_2.2.0
## [31] cachem_1.1.0 viridis_0.6.5 mclust_6.1.1
## [34] RSpectra_0.16-2 cpm_2.3 tidyselect_1.2.1
## [37] digest_0.6.36 Rtsne_0.17 slam_0.1-52
## [40] dplyr_1.1.4 kernlab_0.9-32 changepoint_2.2.4
## [43] ade4_1.7-22 fastmap_1.2.0 grid_4.4.1
## [46] oompaData_3.1.3 colorspace_2.1-1 cli_3.6.3
## [49] magrittr_2.0.3 utf8_1.2.4 scales_1.3.0
## [52] rmarkdown_2.27 umap_0.2.10.0 igraph_2.0.3
## [55] nnet_7.3-19 reticulate_1.38.0 gridExtra_2.3
## [58] png_0.1-8 askpass_1.2.0 zoo_1.8-12
## [61] modeltools_0.2-23 evaluate_0.24.0 knitr_1.48
## [64] viridisLite_0.4.2 rlang_1.1.4 Rcpp_1.0.13
## [67] dendextend_1.17.1 glue_1.7.0 jsonlite_1.8.8
## [70] R6_2.5.1