--- title: "Getting Started with the PCMBase R-package" author: "Venelin Mitov" date: '`r Sys.Date()`' output: rmarkdown::html_vignette: fig_caption: yes toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Getting started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: REFERENCES.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(ape) library(PCMBase) FLAGSuggestsAvailable <- PCMBase::RequireSuggestedPackages() options(rmarkdown.html_vignette.check_title = FALSE) ``` # Data The input data for the phylogenetic comparative methods covered in the PCMBase package consists of a phylogenetic tree of $N$ tips and a $k\times N$ matrix, $X$ of observed trait-values, where $k$ is the number of traits. The matrix $X$ can contain `NA`s corresponding to missing measurements and `NaN`'s corresponding to non-existing traits for some of the tips. # Models Given a number of traits $k$, a *model* is defined as a set of parameters with dimensionality possibly depending on $k$ and a rule stating how these parameters are used to calculate the model likelihood for an observed tree and data and/or to simulate data along a tree. Often, we use the term *model type* to denote a family of models sharing the same rule. ## The $\mathcal{G}_{LInv}$-family of models Currently, PCMBase supports Gaussian model types from the so called $\mathcal{G}_{LInv}$-family [@Mitov:2018fl]. Specifically, these are models representing branching stochastic processes and satisfying the following two conditions [@Mitov:2018fl]: 1. after a branching point on the tree the traits evolve independently in the two descending lineages, 2. the distribution of the trait $\vec{X}$, at time $t$ conditional on the value at time $s < t$ is Gaussian with the mean and variance satisfying 2.a The expectation depends linearly on the ancestral trait value, i.e. $\text{E}\big[{\vec{X}(t) \vert \vec{X}(s)}\big] = \vec{\omega}_{s,t} + \mathbf{\Phi}_{s,t} \vec{X}(s)$; 2.b The variance is invariant (does not depend on) with respect to the ancestral trait, i.e. $\text{V}\big[{\vec{X}(t) \vert \vec{X}(s)}\big] = \mathbf{V}_{s,t}$, where $\vec{\omega}$ and the matrices $\mathbf{\Phi}$, $\mathbf{V}$ may depend on $s$ and $t$ but do not depend on the previous trajectory of the trait $\vec{X}(\cdot)$. ## Example: Ornstein-Uhlenbeck model types As an example, let's consider a model representing a $k$-variate Ornstein-Uhlenbeck branching stochastic process. This is defined by the following stochastic differential equation: \begin{equation}\label{eq:mammals:OUprocess} d\vec{X}(t)=\mathbf{H}\big(\vec{\theta}-\vec{X}(t)\big)dt+\mathbf{\mathbf{\Sigma}}_{\chi} dW(t). \end{equation} \noindent In the above equation, $\vec{X}(t)$ is a $k$-dimensional real vector, $\mathbf{H}$ is a $k\times k$-dimensional eigen-decomposable real matrix, $\vec{\theta}$ is a $k$-dimensional real vector, $\mathbf{\Sigma}_{\chi}$ is a $k\times k$-dimensional real positive definite matrix and $W(t)$ denotes the $k$-dimensional standard Wiener process. Biologically, $\vec{X}(t)$ denotes the mean values of $k$ continuous traits in a species at a time $t$ from the root, the parameter $\mathbf{\Sigma}=\mathbf{\Sigma}_{\chi}\mathbf{\Sigma}_{\chi}^T$ defines the magnitude and shape of the momentary fluctuations in the mean vector due to genetic drift, the matrix $\mathbf{H}$ and the vector $\vec{\theta}$ specify the trajectory of the population mean through time. When $\mathbf{H}$ is the zero matrix, the process is equivalent to a Brownian motion process and the parameter $\vec{\theta}$ is irrelevant. When $\mathbf{H}$ has strictly positive eigenvalues, the population mean converges in the long term towards $\vec{\theta}$, although the trajectory of this convergence can be complex. So, the OU-model defines the following set of parameters: - $\vec{X}_{0}$ (coded `X0`) : a $k$-vector of initial values; - $\mathbf{H}$ (coded `H`) : a $k\times k$ matrix denoting the selection strength of the process; - $\vec{\theta}$ (coded `Theta`) : a $k$-vector of long-term optimal trait values; - $\mathbf{\Sigma}_{\chi}$ : (coded `Sigma_x`) : a $k\times k$ matrix denoting the an upper triangular factor or the stochastic drift variance-covariances; The rule defining how the parameters of the OU-model are used to calculate the model likelihood and to simulate data consists in the definition of the functions $\vec{\omega}$ and matrices $\mathbf{\Phi}$, $\mathbf{V}$ [@Mitov:2018fl]: \begin{equation}\label{eq:mammals:omegaPhiVOU} \begin{array}{l} \vec{\omega}_{s,t}=\bigg(\mathbf{I}-\text{Exp}\big(-(t-s)\mathbf{H}\big)\bigg)\vec{\theta} \\ \mathbf{\Phi}_{s,t}=\text{Exp}(-(t-s)\mathbf{H}) \\ \mathbf{V}_{s,t}=\int_{0}^{t-s}\text{Exp}(-v\mathbf{H})(\mathbf{\Sigma}_{\chi}\mathbf{\Sigma}_{\chi}^T)\text{Exp}(-v\mathbf{H}^T)dv \end{array} \end{equation} Together, $\vec{X}_{0}$, $\vec{\omega}$, $\mathbf{\Phi}$, $\mathbf{V}$ and the tree define a $kN$-variate Gaussian distribution for the vector of trait values at the tips. This is the defining property of all Gaussian phylogenetic models. Hence, calculating the model likelihood is equivalent to calculating the density of this Gaussian distribution at observed trait values, and simulating data under the model is equivalent to drawing a random sample from this distribution. The functions `PCMMean()` and `PCMVar()` allow to calculate the mean $kN$-vector and the $kN\times kN$ variance covariance matrix of this distribution. This can be useful, in particular, to compare two models by calculating a distance metric such as the Mahalanobis distance, or the Bhattacharyya distance. However, for big $k$ and/or $N$, it is inefficient to use these functions in combination with a general purpose multivariate normal implementation (e.g. the `mvtnorm::dmvnorm` and `mvtnorm::rmvnorm`), to calculate the likelihood or simulate data assuming an OU model. The main purpose of PCMBase package is to provide a generic and computationally efficient way to perform these two operations. ## Groups of model types It is convenient to group model types into smaller subsets with named elements. This allows to use simple names, such as letters, as aliases for otherwise very long model type names. For example, the PCMBase package defines a subset of six so called "default model types", which are commonly used in macroevolutionary studies. These are model types based on parameterizations of the $k$-variate Ornstein-Uhlenbeck (OU) process. All of these six models restrict $\mathbf{H}$ to have non-negative eigenvalues - a negative eigenvalue of $\mathbf{H}$ transforms the process into repulsion with respect to $\vec{\theta}$, which, while biologically plausible, is not identifiable in a ultrametric tree. The six default models are defined as follows: - $BM_{A}$ ($\mathbf{H}=0$, diagonal $\mathbf{\Sigma}$): BM, uncorrelated traits; - $BM_{B}$ ($\mathbf{H}=0$, symmetric $\mathbf{\Sigma}$): BM, correlated traits; - $OU_{C}$ (diagonal $\mathbf{H}$, diagonal $\mathbf{\Sigma}$): OU, uncorrelated traits; - $OU_{D}$ (diagonal $\mathbf{H}$, symmetric $\mathbf{\Sigma}$): OU, correlated traits, but simple (diagonal) selection strength matrix; - $OU_{E}$ (symmetric $\mathbf{H}$, symmetric $\mathbf{\Sigma}$): An OU with non-diagonal symmetric $\mathbf{H}$ and non-diagonal symmetric $\mathbf{\Sigma}$; - $OU_{F}$ (asymmetric $\mathbf{H}$, symmetric $\mathbf{\Sigma}$): An OU with non-diagonal asymmetric $\mathbf{H}$ and non-diagonal symmetric $\mathbf{\Sigma}$; Calling the function `PCMDefaultModelTypes()` returns a named vector of the technical class-names for these six models: ```{r} # scroll to the right in the following listing to see the full model type names # and their corresponding alias: PCMDefaultModelTypes() ``` The PCMBase package comes with numerous other predefined model types. At present, all these are members of the $\mathcal{G}_{LInv}$-family. A list of these model types is returned from calling the function `PCMModels()`. For each model type, it is possible to check how the conditional distribution of $\vec{X}$ at the end of a time interval of length $t$ is defined from an ancestral value $X_{0}$. In particular, for $\mathcal{G}_{LInv}$ models, this is the definition of the functions $\vec{\omega}$, $\mathbf{\Phi}$ and $\mathbf{V}$. For example, ```{r} PCMFindMethod(PCMDefaultModelTypes()["A"], "PCMCond") # The complex maths is implemented in the function PCMCondVOU. You can see its # R-code by typing : # PCMBase::PCMCondVOU ``` ## Creating PCM objects In the computer memory, models are represented by S3 objects, i.e. ordinary R-lists with a class attribute. The base S3 class of all models is called `"PCM"`, which is inherited by more specific model-classes. Let us create a BM PCM for two traits: ```{r} modelBM <- PCM(model = "BM", k = 2) ``` Printing the model object shows a short verbal description, the S3-class, the number of traits, k, the number of numerical parameters of the model, p, the model regimes and the current values of the parameters for each regime (more on regimes in the next sub-section): ```{r} modelBM ``` One may wonder why in the above description, p = 8 instead of 10 (see also `?PCMParamCount`). The reason is that both, the matrix Sigma and the matrix Sigmae, are symmetric matrices and their matching off-diagonal elements are counted only one time. ## Model regimes Model *regimes* are different models associated with different parts of the phylogenetic tree. This is a powerful concept allowing to model different evolutionary modes on different lineages on the tree. Let us create a 2-trait BM model with two regimes called a and b: ```{r} modelBM.ab <- PCM("BM", k = 2, regimes = c("a", "b")) modelBM.ab ``` Now, we can set some different values for the parameters of the model we've just created. First, let us specify an initial value vector different from the default 0-vector: ```{r} modelBM.ab$X0[] <- c(5, 2) ``` X0 is defined as a parameter with S3 class `class(modelBM.ab$X0)`. This specifies that `X0` is global vector parameter shared by all model regimes. This is also the reason, why the number of parameters is not the double of the number of parameters in the first model: ```{r} PCMParamCount(modelBM) PCMParamCount(modelBM.ab) ``` The other parameters, Sigma_x and Sigmae_x are local for each regime: ```{r} # in regime 'a' the traits evolve according to two independent BM processes (starting from the global vecto X0). modelBM.ab$Sigma_x[,, "a"] <- rbind(c(1.6, 0), c(0, 2.4)) modelBM.ab$Sigmae_x[,, "a"] <- rbind(c(.1, 0), c(0, .4)) # in regime 'b' there is a correlation between the traits modelBM.ab$Sigma_x[,, "b"] <- rbind(c(1.6, .8), c(.8, 2.4)) modelBM.ab$Sigmae_x[,, "b"] <- rbind(c(.1, 0), c(0, .4)) ``` The above way of setting values for model parameters, while human readable, is not handy during model fitting procedures, such as likelihood maximization. Thus, there is another way to set (or get) the model parameter values from a numerical vector: ```{r} param <- double(PCMParamCount(modelBM.ab)) # load the current model parameters into param PCMParamLoadOrStore(modelBM.ab, param, offset=0, load=FALSE) print(param) # modify slightly the model parameters param2 <- jitter(param) print(param2) # set the new parameter vector PCMParamLoadOrStore(modelBM.ab, param2, offset = 0, load=TRUE) print(modelBM.ab) ``` ## Printing models in the form of a table It can be handy to print the parameters of a model in the form of a table with rows corresponding to the different regimes. For that purpose we use the `PCMTable()` function, which generates a `PCMTable` object for a given `PCM` object. The `PCMTable` S3 class inherits from `data.table` and implements a print method that allows for beautiful formatting of matrix and vector parameters. Here is an example: ```{r, results='asis'} options(digits = 2) print(PCMTable(modelBM.ab), xtable = TRUE, type="html") ``` Check the help-page for the `PCMTable`-function for more details. ## Mixed Gaussian models One of the features of the package is the possibility to specify models in which different types of processes are associated with different regimes. We call these "mixed Gaussian" models. To create a mixed Gaussian model, we use the constructor function `MixedGaussian`: ```{r} model.OU.BM <- MixedGaussian( k = 3, modelTypes = c( BM = paste0("BM", "__Omitted_X0", "__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x", "__Omitted_Sigmae_x"), OU = paste0("OU", "__Omitted_X0", "__H", "__Theta", "__UpperTriangularWithDiagonal_WithNonNegativeDiagonal_Sigma_x", "__Omitted_Sigmae_x")), mapping = c(a = 2, b = 1), Sigmae_x = structure( 0, class = c("MatrixParameter", "_Omitted"), description = "upper triangular factor of the non-phylogenetic variance-covariance")) ``` In the above, snippet, notice the long names for the BM and the OU model types. These are the model types that can be mapped to regimes. It is important to notice that both of these model types omit the parameter `X0`, which is a global parameter shared by all regimes. Similarly the two model types omit the parameter `Sigmae_x`. However, in this example, we specify that the whole mixed Gaussian model does not have a `Sigmae_x` parameter, i.e. the model assumes that the trait variation is completely explainable by the mixed Gaussian phylogenetic process and there is no non-heritable component. For more information on model parametrizations, see the [PCMBase parameterizations](https://venelin.github.io/PCMBase/articles/PCMParam.html) guide. For further information on mixed Gaussian models, see the `?MixedGaussian` help page. We can set the parameters of the model manually as follows: ```{r} model.OU.BM$X0[] <- c(NA, NA, NA) model.OU.BM$`a`$H[,,1] <- cbind( c(.1, -.7, .6), c(1.3, 2.2, -1.4), c(0.8, 0.2, 0.9)) model.OU.BM$`a`$Theta[] <- c(1.3, -.5, .2) model.OU.BM$`a`$Sigma_x[,,1] <- cbind( c(1, 0, 0), c(1.0, 0.5, 0), c(0.3, -.8, 1)) model.OU.BM$`b`$Sigma_x[,,1] <- cbind( c(0.8, 0, 0), c(1, 0.3, 0), c(0.4, 0.5, 0.3)) ``` We can print the mixed Gaussian model as a table: ```{r, results='asis'} print(PCMTable(model.OU.BM), xtable = TRUE, type="html") ``` # Simulating data on a phylogenetic tree The first functionality of the PCMBase package is to provide an easy way to simulate multiple trait data on a tree under a given (possibly multiple regime) PCM. For this example, first we simulate a birth death tree with two parts "a" and "b" using the `ape` R-package: ```{r} # make results reproducible set.seed(2, kind = "Mersenne-Twister", normal.kind = "Inversion") # number of regimes R <- 2 # number of extant tips N <- 100 tree.a <- PCMTree(rtree(n=N)) PCMTreeSetLabels(tree.a) PCMTreeSetPartRegimes(tree.a, part.regime = c(`101` = "a"), setPartition = TRUE) lstDesc <- PCMTreeListDescendants(tree.a) splitNode <- names(lstDesc)[which(sapply(lstDesc, length) > N/2 & sapply(lstDesc, length) < 2*N/3)][1] tree.ab <- PCMTreeInsertSingletons( tree.a, nodes = as.integer(splitNode), positions = PCMTreeGetBranchLength(tree.a, as.integer(splitNode))/2) PCMTreeSetPartRegimes( tree.ab, part.regime = structure(c("a", "b"), names = as.character(c(N+1, splitNode))), setPartition = TRUE) ``` ```{r, eval=FLAGSuggestsAvailable} if(requireNamespace("ggtree")) { palette <- PCMColorPalette(2, c("a", "b")) # Plot the tree with branches colored according to the regimes. # The following code works only if the ggtree package is installed, which is not on CRAN. # The tree would not be depicted correctly if ggtree is not installed. plTree <- PCMTreePlot(tree.ab) plTree <- plTree + ggtree::geom_nodelab(size = 2) plTree } ``` Now we can simulate data on the tree using the modelBM.ab$X0 as a starting value: ```{r} traits <- PCMSim(tree.ab, modelBM.ab, modelBM.ab$X0) ``` # Calculating likelihoods Calculating a model likelihood for a given tree and data is the other key functionality of the PCMBase package. ```{r} PCMLik(traits, tree.ab, modelBM.ab) ``` For faster and repeated likelihood evaluation, I recommend creating a likelihood function for a given data, tree and model object. Passing this function object to `optim` would save the need for pre-processing the data and tree at every likelihood evaluation. ```{r} # a function of a numerical parameter vector: likFun <- PCMCreateLikelihood(traits, tree.ab, modelBM.ab) likFun(param2) ``` # References