%\VignetteIndexEntry{OOMPA ClassComparison} %\VignetteKeywords{OOMPA,Class Comparison,Differential Expression, Multiple Testing} %\VignetteDepends{oompaBase,ClassComparison} %\VignettePackage{ClassComparison} \documentclass{article} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \title{Class Comparison with OOMPA} \author{Kevin R. Coombes} \begin{document} \maketitle \tableofcontents \section{Introduction} OOMPA is a suite of object-oriented tools for processing and analyzing large biological data sets, such as those arising from mRNA expression microarrays or mass spectrometry proteomics. The \Rpackage{ClassComparison} package in OOMPA provides tools to work on the ``class comparison'' problem. Class comparison is one of the three primary types of applications of microarrays described by Richard Simon and colleagues. The point of these problems is to identify genes that behave differently in known classes; in other words, a typical class comparison problem is to find the genes that are differentially expressed between two types of samples. \section{Getting Started} No one will be surprised to learn that we start by loading the package into the current R session: <>= library(ClassComparison) @ The main functions and classes in the ClassComparison package work either with data matrices or with \Robject{ExpressionSet} objects from the BioConductor \Rpackage{Biobase} package. For the first set of examples in this vignette, we will use simulated data that represents different groups of samples: <>= set.seed(6781252) # for reproducibility nGenes <- 5000 nSamp <- 15 nDif <- 150 delta <- 1 fake.class <- factor(rep(c('A', 'B'), each=nSamp)) fake.data <- matrix(rnorm(nGenes*nSamp*2), nrow=nGenes, ncol=2*nSamp) fake.data[1:nDif, 1:nSamp] <- fake.data[1:nDif, 1:nSamp] + delta fake.data[(nDif+1):(2*nDif), 1:nSamp] <- fake.data[(nDif+1):(2*nDif), 1:nSamp] - delta @ \section{Gene-by-gene t-tests} The simplest way to find differentially expressed genes is to perform a two-sample t-test on each gene. The \Rfunction{MultiTtest} class handles this operation, with a summary that carefully ensures that you know which class is associated with a positive t-statistic. <>= mtt <- MultiTtest(fake.data, fake.class) summary(mtt) @ \begin{figure} <>= hist(mtt, breaks=101) @ \caption{Histogram of the gene-by-gene two-sample t-statistics} \label{fig:hist} \end{figure} \section{Beta-uniform mixture models to account for multiple testing} As everyone now knows, an inherent difficulty with performing a separate test for each gene is that the $p$-values must be adjusted to account for multiple testing. A simple approach models the set of $p$-values using a beta-uniform mixture (BUM). We can perform this analysis with a single command: <>= bum <- Bum(mtt@p.values) summary(bum) @ The default value of the \Rfunction{summary} command is not very enlightening, but we can get a graphical overview of the distribution. The region below the horizontal blue line in Figure~\ref{fig:histbum} represents the uniform component of the mixture (i.e., genes that are not differentially expressed); the region between the blue line and the green curve represents the beta component (i.e., genes that are differentially expressed). If we set a threshold for significance using some cutoff on the $p$-value (such as the one indicated by the vertical purple line in Figure~\ref{fig:histbum}), then we can divide the area into four regions representing true positives, false positives, true negatives, and false negatives. These areas can then be used to estimate the false discovery rate (FDR) as a function of the threshold (Figure~\ref{fig:imagebum}). \begin{figure} <>= hist(bum) abline(v=0.05, col="purple", lwd=2) @ \caption{Results of the BUM analysis of the $p$-values.} \label{fig:histbum} \end{figure} \begin{figure} <>= image(bum) @ \caption{Results of the BUM analysis of the $p$-values.} \label{fig:imagebum} \end{figure} The usual application of this idea is to choose a threshold that achieves a desired level of FDR. For example, selecting genes with a $p$-value less than <>= cutoffSignificant(bum, alpha=0.10, by="FDR") @ should keep the FDR less than $10\%$. The number of such genes is easily obtained with the command: <>= countSignificant(bum, alpha=0.10, by="FDR") @ You can also get a logical vector that selects the significant genes: <>= selected <- selectSignificant(bum, alpha=0.10, by="FDR") @ In our example, the truly significant genes are among the first $300$ genes. We can use this information to find out how close we are to the truth; the achieved FDR in this simulated example is pretty close to the target value of $10\%$. <>= truth <- rep(FALSE, nGenes) truth[1:(2*nDif)] <- TRUE sum(selected & truth) mean(!truth[selected]) @ \section{Wilcoxon rank sum tests and empirical Bayes} In many applications of microarrays, it is unclear how the data should be transformed to achieve the approximate normality needed to justify a t-test. It may just be simpler to ignore the transformation problem and use nonparametric methods, like the Wilcoxon rank-sum test, that only use the ranks of the samples for the expression of each gene. <>= mw <- MultiWilcoxonTest(fake.data, fake.class) summary(mw) @ A histogram (Figure~\ref{fig:wil}) of the Wilcoxon statistics indicates that the observed values have larger tails than expected by chance, suggesting that we ought to be able to pick out some genes that are significantly different. To do this, we use an empirical Bayes method originally suggested by Efron and Tibshirani. The idea is that we can decompose the Wilcoxon statistics as a mixture of those that arise from the null distribution (which is Wilcoxon with parameters based on the number of samples in each group) and some other component representing the differentially expressed genes. In that case, we can write the observed distribution $f(x)$ in the form: \[ f(x) = \pi f_0(x) + (1 - \pi) f_1(x) \] where $f_0(x)$ is the known Wilcoxon distribution and $f_1(x)$ is unknown. Since we can estimate $f(x)$ from the observed data, we can simply solve for the unknown distribution $f_1(x)$ provided we know the mixing parameter $\pi$, which represents the prior probability that a gene is not differentially expressed. The ``empirical'' part of this empirical Bayes method comes down to selecting the prior $\pi$ after looking at the data. For, if we start with $\pi=1$, the posterior probability of being differentially expressed as a function of the observed statistic ends up taking on negative values (Figure~\ref{fig:wil2}), which is rather unpleasant. \begin{figure} <>= hist(mw) @ \caption{Histogram of the observed gene-by-gene Wilcoxon statistics.} \label{fig:wil} \end{figure} \begin{figure} <>= plot(mw) abline(h=0) @ \caption{Plot of the posterior probability of being differentially expressed, assuming \textit{a priori} that no genes are different.} \label{fig:wil2} \end{figure} \begin{figure} <>= plot(mw, prior=0.92, signif=0.9) abline(h=0) @ \caption{Plot of the posterior probability of being differentially expressed, assuming \textit{a priori} that $92\%$ of the genes are not different.} \label{fig:wil3} \end{figure} By trial and error, we can find a value for $\pi$ that ensures that the posterior probabilities are always positive (Figure~\ref{fig:wil3}). In this case, something close to $0.94$ works okay. We can then use a threshold on the posterior probabilities to set a significance cutoff on the Wilcoxon statistics. <>= cutoffSignificant(mw, prior=0.94, signif=0.8) countSignificant(mw, prior=0.94, signif=0.8) wilsel <- selectSignificant(mw, prior=0.94, signif=0.8) sum(selected & wilsel) sum(truth & wilsel) @ \section{Permutation based methods} The \Rfunction{Bum} method, as applied to the gene-by-gene t-tests of the \Rpackage{MultiTtest} class or to $p$-values fmo other tests, assumes that genes are independent. This assumption is clearly false, and so various researchers have proposed permutation-based methods that retain the correlation structure between genes when trying to estimate the distribution of $p$-values. \subsection{Dudoit's method based on Westfall and Young} %' Sandrine Dudoit and colleagues introduced the idea of using the Westfall-Young stepdown procedure to control the family-wise error rate in a microarray study. In our example, we can perform this analysis as follows: <>= dudoit <- Dudoit(fake.data, fake.class, nPerm=100) summary(dudoit) @ \begin{figure} <>= plot(dudoit) @ \caption{Plot of the unadjusted (blue) and adjusted (black) $p$-values.} \label{fig:dud} \end{figure} To get good results, we probably need more than $100$ permutations, but this implementation (completely in R) is rather slow. The default plot routine (Figure~\ref{fig:dud}) shows both the unadjusted and adjusted $p$-values. In most cases, controlling the family-wise error rate (FWER) is viewed as overly conservative, since it tries to ensure that there are no false positive findings instead of trying to estimate the number or fraction of false positives. In our example, using the Dudoit correction with FWER = $10\%$ finds very few differentially expressed genes: <>= countSignificant(dudoit, 0.10) @ \section{Significance Analysis of Microarrays} Significance Analysis of Microarrays (SAM) is an alternative procedure, which is based on permutations but tries to control the FDR instead of the FWER. We can get the results by: <>= sam <- Sam(fake.data, fake.class) summary(sam) @ \begin{figure} <>= plot(sam, tracks=seq(0.5, 2, by=0.5)) @ \caption{Quantile-quantile plot of the observed t-statistics against the t-statistics expected from the permutation-based null distribution.} \label{fig:sam} \end{figure} Based on the figure, we can probably take a cutoff of $1$ to define significance, which yields the following results <>= cutoff <- 1 countSignificant(sam, cutoff) sum(selectSignificant(sam, cutoff) & truth) @ \section{Other class comparison approaches} The package contains several other methods for finding genes that are differentially expressed between known classes: \begin{enumerate} \item Total Number of Misclassification (\Rfunction{TNoM}): This method was introduced by Yakhini and Ben-Dor and applied by Bittner and colleagues in 2000. It has probably seen fewer applications than it deserves, which this implementation may help rectify. \item The smooth (regularized) t-test (\Rfunction{SmoothTtest}): This method was introduced by Baggerly and Coombes in 2001, but a large number of authors have proposed similar ideas. The basic idea is that (even after log transformation) genes of similar intensity appear to have similar variance, and that one can borrow strength across genes to get better estimates of the variability even in small microarray studies. \item Linear models (\Rfunction{MultiLinearModel}) can be constructed using the usual R formula, providing generalization to one-way designs with more than two classes to compare, or to factorial designs. \item Gene-by-gene paired t-tests (\Rfunction{MultiTtestPaired}). \item Gene-by-gene t-tests assuming unequal variance in the two groups (\Rfunction{MultiTtestUnequal}). \end{enumerate} A future version of this vignette may include more examples of their use; for now, you can read the examples in the help pages. \end{document}