\documentclass{article} \usepackage{url} %\VignetteIndexEntry{Model based clustering via mode identification} %\VignetteDepends{mvtnorm, zoo, parallel, class} \usepackage{Sweave} \addtolength\topmargin{-1in} \addtolength\textheight{1.5in} \addtolength\textwidth{1.5in} \addtolength\oddsidemargin{-.75in} \addtolength\evensidemargin{-.75in} \addtolength\parskip{1ex} \author{Surajit Ray and Yansong Cheng} \title{Hierarchical Modal Association Clustering} % \setkeys{Gin}{width=.8\textwidth,height=4in} \renewcommand{\theenumi}{\roman{enumi}} \renewcommand{\labelenumi}{(\theenumi)} \begin{document} \maketitle \section{Introduction to Modal Clustering} Model-based clustering has been widely used for clustering heterogeneous populations. But standard model based clsutering are often limited by the shape of the component densities. In this document, we describe a mode associated clustering approach (Li et al 2007) applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This clustering method shares the major advantages of mixture model based clustering. Moreover, it requires no model fitting and ensures that every cluster corresponds to a bump of the density. A hierarchical clustering algorithm is also developed by applying MEM recursively to kernel density estimators with increasing bandwidths. The kernel density estimate becomes smoother with the increase of smoothing parameters and more points tend to converge to the same mode. This suggests a hierachical clustering approach by choosing a serial of increasing smoothing parameters, namely Hierarchical Modal Association Clustering(HMAC). Historically cluster analysis techniques has been approached from either a fully parametric view (the model based approach), e.g. mixture model based clustering, or a distribution free approach, e.g. the linkage based clustering. While the parametric paradigm provides the inferential framework and accounts for the sampling variability, it often lacks the flexibility to accommodate complex clusters and are not scalable to high dimensional data. On the other hand, the distribution free approaches are often fast and are capable of uncovering complex clusters by making use of different distance measures, but the inferential framework is distinctly missing. Modal clustering kneads the strengths of these two seemingly different approaches and to develop the new paradigm of nonparametric modal clustering which can accommodate flexible subpopulation structures and are computationally fast and scalable to high dimension, but retains the much desired natural inferential framework of parametric mixtures. \subsection{Parallel implementation of Modal Clustering} In this package we also develop a methodology for parallelization of modal clustering. We simply divide the data (number of samples) into the number of parallel nodes specified by the user. Ideally one would want to use all available processors of one's machine or nodes in a cluster. The first step is to construct modal clusters in parallel for each of these partitions and then merge the modes for higher bandwidths to provide a full hierarchical cluster. One should note that even if the user has one processor available, the user, partitioning large data into several partitions gives the advantage of ``divide and conquer'' by significantly reducing the computational burden. The parallel hierarchical modal association clustering (pHMAC) algorithm provides the main function \texttt{phmac} in this package. \section{Use of functions} In this section we demonstrate the use of various functions in this package by applying it on a few example datasets. First, consider the data in a dataset which resembles broken parts of a disc as given in following figure. We simulated this dataset with non-gaussian cluster structures. In fact each cluster is embedded in a lower dimensional manifold (1-d curve). \begin{center} <>= library(Modalclust) data(disc2d) plot(disc2d) @ \end{center} One can use Modal clustering to cluster this dataset in the following three ways. \begin{enumerate} \item Standard Modal clustering, i.e. without partition \item Partitioning the dataset into \textit{npart} partitions but running them sequentially in a single processor. \item Partitioning the dataset into \textit{npart} partitions but running them in parallel using \textit{npart} processors. This requires the use of package \textit{multicore} \end{enumerate} To cluster the \texttt{disc2d} using the non-partition mode (provided by default parameters in function \texttt{phmac}) <<>>= disc2d.hmac=phmac(disc2d,npart=1,parallel=F) @ On the other hand one can use the parallel mode to cluster the same data. To partition the dataset into 2 partitions and to run them sequentially in a single processor one can use \begin{verbatim} > disc2d.part=phmac(disc2d,npart=2,parallel=F) \end{verbatim} \noindent To partition the dataset into 2 partitions and to run them in parallel in 2 processor one should use <<>>= disc2d.phmac=phmac(disc2d,npart=2,parallel=T) #npart can be made equal to the number of processors the computer has. @ Note that the output of the parallel and the non-parallel version will be slightly different, as the random partitions of the data may provide overlapping clusters in the first stage. We are in the process of developing methods to overcome this. But partitioning the data provides a huge numerical advantage and it might be the most practical solution for clustering large number of observations. The function \texttt{phmac} returns an object of class ``hmac''. The summary of the output of such ``hmac'' objects can be obtained by invoking the command <>= summary(disc2d.phmac) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%% NEW section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Plotting methods and extracting clusters at specific level} There are several plotting function one can use to visualize the output of \texttt{phmac} function, i.e. an ``hmac'' object. To visualize the full hierarchical tree one can use the following command. \setkeys{Gin}{width=\textwidth,height=.5\textwidth} \begin{center} <>= plot.hmac(disc2d.hmac) @ \end{center} A clearer picture of the tree structure along with their original membership is visible in the following example with 80 data points having 2 big clusters and 4 sub clusters. This hierarchical layer is visible in the following graph. Note that one of the subclass was not separable as their modes merged the two clusters. <>= set.seed(20) mix4=data.frame(rbind(rmvnorm(20,rep(0,4)), rmvnorm(20,rep(2,4)), rmvnorm(20,rep(10,4)),rmvnorm(20,rep(13,4)))) mix4.hmac=phmac(mix4,npart=1) plot(mix4.hmac) @ For validation, One can also verify the output of the clustering and hence the tree with groups provided by user, using the \texttt{userclus} option, such as <>= plot(mix4.hmac,userclus=rep(c(1,2,3,4),each=20)) @ Moreover one can specify the number of clusters or the level from which one wishes to start the tree. <>= plot(mix4.hmac,n.cluster=2) @ Alternatively one can use the \texttt{hard.hmac} or \texttt{soft.hmac} to visualize or extract the hard or soft clustering of a specified level or a user specified number of cluster. To get the hard and soft cluster plot, either the number of clusters or the clustering level needs to be specified. <>= par(mfrow=c(1,2)) hard.hmac(disc2d.hmac,n.cluster=2) soft.hmac(disc2d.hmac,level=3) par(mfrow=c(1,1)) @ With the \texttt{plot=F} option \texttt{hard.hmac} will return the membership of each obsearvation while \texttt{soft.hmac} will return the posterior probability of each point and the boundary point as well. <<>>= member=hard.hmac(disc2d.hmac,n.cluster=2,plot=F) member member.soft=soft.hmac(disc2d.hmac,level=3,plot=F) head(member.soft$post.prob) @ \setkeys{Gin}{width=.6\textwidth,height=.6\textwidth} For the convenience of application, users can also identify whole clusters of point by either clicking the point on the graph or specifying the point as an input in the \texttt{choose.cluster} function. The clicked/specified point along with other points within the same cluster will be highlighted. \begin{center} <>= choose.cluster(disc2d.hmac,n.cluster=2,x=c(0,0)) @ \end{center} In addition in this package, one can also visualize the contour plot with specified number of clusters. \begin{center} <>= contour.hmac(disc2d.hmac,n.cluster=2,col=gray(0.7)) @ \end{center} \section{Further Examples} In the previous section we demonstrated the use of the functions in this package using a two dimensional data. Our method is neither restricted to two dimensions nor is it restricted to specific shapes of the clusters. Here we provide more examples, some on real data and some on simulated data to demonstrate other features of the functions available in this package. First we focus on a two dimensional data obtained from high throughput genotyping from Illumina technology named \textbf{GOLDEN GATE}. [ \url{http://www.illumina.com/technology/goldengate_genotyping_assay.ilmn}] The data values are actual measurements made by the machine (intensity), after these are normalized (background subtracted etc). The data set is used for making genotype calls by Illumina. The data around X- and Y-axes represents the two homozygous genotypes (e.g. AA and TT), while the cluster along the 45-degree line represents the heterozygous (e.g. AT) genotype. Due to noisy reads, the data points often lie in-between the axes, and cluster detection is used for making automatic genotype calls. The scatter plot of the original data is shown below. The non-gaussian density shape and the abundance of points neear the axis plays an important role in clustering these points and Modal Clustering captures it very well. \begin{center} <>= data(cta20) data(cta20.hmac) plot(cta20,pch=16) @ \end{center} One can always apply a log transformation or some other more general transformations like Box Cox or logicle, but they are not designed to make clustered data normal. In fact even after log transformation the data are skewed and hence standard clustering algorithm failed to capture the true clustering whereas the Modal Clustering method could capture the three clusters in the transformed scale too. Here we provide the clustering of the data in transformed and untransformed scale \setkeys{Gin}{width=\textwidth,height=.5\textwidth} %\setkeys{Gin}{width=\textwidth,height=4in} <>= data(logcta20.hmac) par(mfrow=c(1,2)) hard.hmac(cta20.hmac,n.cluster=3) hard.hmac(logcta20.hmac,n.cluster=3) par(mfrow=c(1,1)) @ Now we demonstrate the output of the algorithm for a one dimensional example. <>= data(oned.hmac) data(oned) par(mfrow=c(1,2)) hist(oned,n=20,col="lavender") hard.hmac(oned.hmac,level=3) par(mfrow=c(1,1)) @ Finally we demonstrate the output of our algorithm for data with more than two dimensions The \emph{iris} data from R datafile has five dimensions. The last column is categorical variable so we use the first four columns to extract the clusters in the data. The \texttt{plot} function will provide trees similar to two dimensional examples, but now the hard.hmac provides the pairs plot of the pairwise dimensions. <>= iris.hmac=phmac(iris[,-5],npart=1,parallel=F) plot(iris.hmac,sep=.02) @ \setkeys{Gin}{width=\textwidth,height=\textwidth} <>= hard.hmac(iris.hmac,n.cluster=2) @ \section*{Acknowledgement} We thank Yeojin Chung and Jia Li for providing some of the initial codes for the \textit{hmac} function used in this package \section*{References} \begin{enumerate} \item Li. J, Ray. S, Lindsay. B. G, "A nonparametric statistical approach to clustering via mode identification," \emph{Journal of Machine Learning Research}, 8(8):1687-1723, 2007. \item Lindsay, B.G., Markatou M., Ray, S., Yang, K., Chen, S.C. "Quadratic distances on probabilities: the foundations," \emph{The Annals of Statistics} Vol. 36, No. 2, page 983--1006, 2008. \end{enumerate} \end{document}