---
title: "DCEM"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{DCEM}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Package Overview
Implements the Expectation Maximisation Algorithm for clustering the multivariate and univariate datasets. There are two versions of EM implemented-EM* (converge faster by avoiding revisiting the data) and EM. For more details on EM\*, see the 'References' section below.
The package has been tested with both real and simulated datasets. The package comes bundled with a dataset for demonstration (ionosphere_data.csv). More help about the package can be seen by typing `?DCEM` in the R console (after installing the package).
**Currently, data imputation is not supported and user has to handle the missing data before using the package.**
## Contact
### For Reporting Issues
[Issues](https://github.com/parichit/DCEM/issues)
### GitHub Repository Link
[Github Repository](https://github.com/parichit/DCEM)
## Installation Instructions
__Dependencies__ First, install all the required packages as follows:
install.packages(c("matrixcalc", "mvtnorm", "MASS", "Rcpp"))
### Installing from CRAN
Use install.packages() in the R console as follow:
```
install.packages("DCEM")
```
**_Installing from the Source Package_**
Download the source tar ball for DCEM from [Github](https://github.com/parichit/DCEM/releases) and install as follows:
```
R CMD install DCEM_2.0.5.tar.gz
```
## Quick Start
- For demonstration purpose, use the `dcem_test()` function from the R console. This function invokes the dcem_star_train() on the bundled `ionosphere_data`.
- The function ```dcem_test()``` returns a list containing the output i.e., posterior probabilities, meu, sigma, prior and cluster membership for data. The parameters can be accessed as follows where ```sample_out``` is the list containing the output:
### Call the dcem_test()
```
library("DCEM")
sample_out = dcem_test()
```
### Access the output
1. ```sample_out$prob```: A matrix of posterior-probabilities
1. ```sample_out$meu```: A matrix of cluster centers.
1. ````sample_out$sigma```
* For multivariate data: List of co-variance matrices for the Gaussian(s).
* For univariate data: Vector of standard deviation for the Gaussian(s))
1. ```sample_out$prior```: A vector of prior.
1. ```sample_out4membership```: A vector of cluster membership for data.
### Use EM* (improved EM algorithm) to cluster the Ionosphere data
DCEM comes bundeled with the Ionosphere data. The example below shows how to use EM* for clustering this data set.
```
# Make sure you have imported the package.
library("DCEM")
# Set the file path
data_file = file.path(trimws(getwd()), "data", "ionosphere_data.csv")
# Reading the input file into a dataframe.
ionosphere_data = read.csv2(
file = data_file,
sep = ",",
header = FALSE,
stringsAsFactors = FALSE
)
# Cleaning the data by removing the 35th and 2nd column as they contain the
# labels and 0's respectively.
ionosphere_data = trim_data("2, 35", ionosphere_data)
# Call the dcem_star_train() function on the cleaned data.
sample_out = dcem_star_train(ionosphere_data)
```
### Use EM-T (original EM algorithm) to cluster the Ionosphere data
The example below shows how to use EM-T for clustering the Ionosphere data set.
```
# Make sure you have imported the package.
library("DCEM")
# Set the file path
data_file = file.path(trimws(getwd()), "data", "ionosphere_data.csv")
# Reading the input file into a dataframe.
ionosphere_data = read.csv2(
file = data_file,
sep = ",",
header = FALSE,
stringsAsFactors = FALSE
)
# Cleaning the data by removing the 35th and 2nd column as they contain the
# labels and 0's respectively.
ionosphere_data = trim_data("2, 35", ionosphere_data)
# Call the dcem_star_train() function on the cleaned data.
sample_out = dcem_train(ionosphere_data)
```
### Parameters in the function call:
Both dcem_star_train() and dcem_train() calls share the same parameters except the argument 'threshold' which is only present in dcem_train(). This is because for EM*, threshold is empirically found to not affect the clustering results significantly. The function arguments are described below:
__* data (dataframe)__: Dataframe containing the user specified data.
__* threshold (decimal)__: Convergence threshold (if meu are within this
threshold then the algorithm stops and exit (default = 0.0001).
__* iteration_count (numeric)__: Number of iterations for which the algorithm should run, if the
convergence is not achieved then the algorithm stops and exit (default = 200).
__* num_clusters (numeric)__: The number of clusters (default = 2).
__* seed_meu (matrix)__: User specified set of meu to be used as initial centers (default = None).
__* seeding (string)__: The initialization scheme (choices = 'rand' or 'improved', default = rand).
## Initialization schemes
In case of iterative clustering algorithm like EM, choice of initial cluster centers can affect the rate of convergence in terms of execution time and number of iterations. Therefore, DCEM allows the users to choose from the following initialization schemes according to their requirement.
1. __Random initialization__: This is the (default) choice for both EM-T and EM* procedures. Cluster centers are selected by randomly sampling from the input data. It is fast, though it may not result in the best performance i.e., selected centers may result in the algorithm taking more iterations hence longer execution time. The following code illustrate how to use random initialization explicitly.
Set the seed and create a mixture of gaussians.
```
R> set.seed(49)
R> sample_uv_data = as.data.frame(c(rnorm(500, 5, 0.5), rnorm(1000, 20, 1),
+ rnorm(100, 31, 2)))
```
Randomly shuffle the data, set the seed and call the dcem\_train() function.
```
R> sample_uv_data = as.data.frame(sample_uv_data[sample(nrow(sample_uv_data)),])
R> set.seed(21)
R> sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3,
+ iteration_count = 100, threshold = 0.0001, seeding = "rand")
```
Note: The run with random initialization took 14 iterations to converge.
```
[1] "Specified threshold = 1e-04"
[1] "Specified iterations = 100"
[1] "Specified number of clusters = 3"
[1] "Using the improved Kmeans++ initialization scheme."
[1] "Convergence at iteration number: 14"
```
1. __Improved initialization__: This approach utilizes an advanced initialization procedure based on the k++ technique. It was originally proposed to improve the initialization of the K-means algorithm. When compared to the random initialization, this method has been empirically shown to improve both the speed and accuracy. It can be used as follows.
Use the same seed and call the dcem_train() function with seeding set to 'improved'.
```
R> set.seed(21)
R> sample_uv_out = dcem_train(sample_uv_data, num_clusters = 3,
+ iteration_count = 100, threshold = 0.0001, seeding = "improved")
```
Note: The run with improved initialization took 9 iterations to converge.
```
[1] "Specified threshold = 1e-04"
[1] "Specified iterations = 100"
[1] "Specified number of clusters = 3"
[1] "Using the improved Kmeans++ initialization scheme."
[1] "Convergence at iteration number: 9"
```