iClusterVB

R-CMD-check

iClusterVB allows for fast integrative clustering and feature selection for high dimensional data.

Using a variational Bayes approach, its key features - clustering of mixed-type data, automated determination of the number of clusters, and feature selection in high-dimensional settings - address the limitations of traditional clustering methods while offering an alternative and potentially faster approach than MCMC algorithms, making iClusterVB a valuable tool for contemporary data analysis challenges.

Installation

You can install iClusterVB from CRAN with:

install.packages("iClusterVB")

You can install the development version of iClusterVB from GitHub with:

# install.packages("devtools")
devtools::install_github("AbdalkarimA/iClusterVB")

iClusterVB - The Main Function

Mandatory arguments

Optional arguments

Simulated Data

We will demonstrate the clustering and feature selection performance of iClusterVB using a simulated dataset comprising \(N = 240\) individuals and \(R = 4\) data views with different data types. Two views were continuous, one was count, and one was binary – a setup commonly found in genomics data where gene or mRNA expression (continuous), DNA copy number (count), and mutation presence (binary) are observed. The true number of clusters (\(K\)) was set to 4, with balanced cluster proportions (\(\pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25\)). Each data view consisted of \(p_r = 500\) features (\(r = 1, \dots, 4\)), totaling \(p = \sum_{r=1}^4 p_r = 2000\) features across all views. Within each view, only 50 features (10%) were relevant for clustering, while the remaining features were noise. The relevant features were distributed across clusters as described in the table below:

Data View Cluster Distribution
1 (Continuous) Cluster 1 \(\mathcal{N}(10, 1)\) (Relevant)
Cluster 2 \(\mathcal{N}(5, 1)\) (Relevant)
Cluster 3 \(\mathcal{N}(-5, 1)\) (Relevant)
Cluster 4 \(\mathcal{N}(-10, 1)\) (Relevant)
\(\mathcal{N}(0, 1)\) (Noise)
2 (Continuous) Cluster 1 \(\mathcal{N}(-10, 1)\) (Relevant)
Cluster 2 \(\mathcal{N}(-5, 1)\) (Relevant)
Cluster 3 \(\mathcal{N}(5, 1)\) (Relevant)
Cluster 4 \(\mathcal{N}(10, 1)\) (Relevant)
\(\mathcal{N}(0, 1)\) (Noise)
3 (Binary) Cluster 1 \(\text{Bernoulli}(0.05)\) (Relevant)
Cluster 2 \(\text{Bernoulli}(0.2)\) (Relevant)
Cluster 3 \(\text{Bernoulli}(0.4)\) (Relevant)
Cluster 4 \(\text{Bernoulli}(0.6)\) (Relevant)
\(\text{Bernoulli}(0.1)\) (Noise)
4 (Count) Cluster 1 \(\text{Poisson}(50)\) (Relevant)
Cluster 2 \(\text{Poisson}(35)\) (Relevant)
Cluster 3 \(\text{Poisson}(20)\) (Relevant)
Cluster 4 \(\text{Poisson}(10)\) (Relevant)
\(\text{Poisson}(2)\) (Noise)

Distribution of relevant and noise features across clusters in each data view

The simulated dataset is included as a list in the package.

Data pre-processing

library(iClusterVB)

# Input data must be a list

dat1 <- list(gauss_1 = sim_data$continuous1_data,
             gauss_2 = sim_data$continuous2_data,
             multinomial_1 = sim_data$binary_data,
             poisson_1 = sim_data$count_data)

dist <- c("gaussian", "gaussian",
          "multinomial", "poisson")

# Re-code `0`s to `2`s. This must be done for feature selection 
# and clustering to work properly.
dat1$multinomial_1[dat1$multinomial_1 == 0] <- 2

Running the model

set.seed(123)
fit_iClusterVB <- iClusterVB(
  mydata = dat1,
  dist = dist,
  K = 8,
  initial_method = "VarSelLCM",
  VS_method = 1, # Variable Selection is on
  max_iter = 100,
  per = 100
)
#> ------------------------------------------------------------
#> Pre-processing and initializing the model
#> ------------------------------------------------------------
#> ------------------------------------------------------------
#> Running the CAVI algorithm
#> ------------------------------------------------------------
#> iteration = 100 elbo = -43591384.761314

Comparing to True Cluster Membership

table(fit_iClusterVB$cluster, sim_data$cluster_true)
#>    
#>      1  2  3  4
#>   2 60  0  0  0
#>   4  0 60  0  0
#>   6  0  0 60  0
#>   8  0  0  0 60

Summary of the Model

# We can obtain a summary using summary()
summary(fit_iClusterVB)
#> Total number of individuals:
#> [1] 240
#> 
#> User-inputted maximum number of clusters: 8
#> Number of clusters determined by algorithm: 4
#> 
#> Cluster Membership:
#>  2  4  6  8 
#> 60 60 60 60 
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 1 - gaussian
#> [1] "58 out of a total of 500"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 2 - gaussian
#> [1] "59 out of a total of 500"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 3 - multinomial
#> [1] "62 out of a total of 500"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 4 - poisson
#> [1] "69 out of a total of 500"

Generic Plots

plot(fit_iClusterVB)

Probability of Inclusion Plots

# The `piplot` function can be used to visualize the probability of inclusion

piplot(fit_iClusterVB)

Heat maps to visualize the clusters

# The `chmap` function can be used to display heat maps for each data view

list_of_plots <- chmap(fit_iClusterVB, rho = 0,
      cols = c("green", "blue",
               "purple", "red"),
      scale = "none")
# The `grid.arrange` function from gridExtra can be used to display all the 
# plots together
gridExtra::grid.arrange(grobs = list_of_plots, ncol = 2, nrow = 2)