--- title: "Using ClusVis with RMixtComp Output for Visualization" author: "Quentin Grimonprez" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Using ClusVis with RMixtComp Output for Visualization} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.align = "center", fig.width = 7, fig.height = 5) ``` ```{r, echo=FALSE} Sys.setenv(MC_DETERMINISTIC = 2) ``` ## ClusVis [ClusVis](https://cran.r-project.org/package=ClusVis) is an R package that performs a gaussian-based visualization of gaussian and non-gaussian Model-Based Clustering. This visualization is based on the probabilities of classification. See this [preprint](https://hal.science/hal-01949155v1/document) for more details about the method. It allows to visualize clusters as bivariate spherical gaussian. ## ClusVis and RMixtComp First, we load the required packages. ```{r, echo = TRUE, results = 'hide'} library(RMixtComp) library(ClusVis) ``` To illustrate the use of ClusVis with RMixtComp output, we use the iris dataset and the congress dataset. ### Example 1: iris dataset The iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. ```{r} data("iris") head(iris) ``` First, we learn a mixture model with 3 classes for the 4 measurements varaibles. ```{r} res <- mixtCompLearn(iris[, -5], nClass = 3, criterion = "BIC", nRun = 3, nCore = 1, verbose = FALSE) ``` Then, we apply the `clusvis` function. This function requires 2 parameters: the logarithm of the probabilities of classification of every individuals and the proportion of the mixture. ```{r} logTik <- getTik(res, log = TRUE) prop <- getProportion(res) resVisu <- clusvis(logTik, prop) ``` The results can be displayed using the `plotDensityClusVisu` function. The first graph is generated with the parameter `add.obs = TRUE`. It overlays on the most discriminative map the curve of iso-probabilities of classification and the cloud of observations. ```{r} plotDensityClusVisu(resVisu, add.obs = TRUE) ``` With `add.obs = FALSE`, the goal of the plot is to represents the overlap between the clusters. Each clusters is represented by its centers and a 95% confidence level border. The differene between entropies displayed in the title defines the accuracy of the representation. A difference closed to 0 means that the representation is accurate. ```{r} plotDensityClusVisu(resVisu, add.obs = FALSE) ``` Here, we note that two clusters are closed and so they contains flowers with similar measures whereas the other cluster contains flowers with very different measures from the two others. ### Example 2: congress dataset This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA in 1984. ```{r} data("congress") head(congress) ``` First, we change the format of the data. The vote "n" is refactored as 1 and "y" as 2. "democrat" is refactored as 1 and "republican" as 2. ```{r} ## MixtComp Format congress$V1 = refactorCategorical(congress$V1, c("democrat", "republican", "?"), c(1, 2, "?")) for(i in 2:ncol(congress)) congress[, i] = refactorCategorical(congress[, i], c("n", "y", "?"), c(1, 2, "?")) head(congress) ``` We run MixtComp with a Multinomial model for each variable. ```{r, results = 'hide'} model <- rep("Multinomial", ncol(congress)) names(model) = colnames(congress) res <- mixtCompLearn(congress, model = model, nClass = 4, criterion = "BIC", nRun = 3, nCore = 1) ``` As before, we extract the required parameters. ```{r} logTik <- getTik(res, log = TRUE) prop <- getProportion(res) head(logTik) ``` It is important to notice that there are a lot of `-Inf` values in the variable `logTik` because some probabilities to be in a cluster are exactly 0. If there are too many infinite values, it is a problem for the `cluvis` function. One way to avoid this problem is to replace infinite values with the logarithm of a epsilon. ```{r} logTik[is.infinite(logTik)] = log(1e-20) head(logTik) ``` Now, the `clusvis` function can be run. ```{r} resVisu <- clusvis(logTik, prop) ``` And the two associated plots generated. ```{r} plotDensityClusVisu(resVisu, add.obs = TRUE) ``` ```{r} plotDensityClusVisu(resVisu, add.obs = FALSE) ``` ```{r, echo=FALSE} Sys.unsetenv("MC_DETERMINISTIC") ```