\documentclass{article} \setlength{\parindent}{0pt} % Eliminate the indent at the beginning of a new paragraph \setcounter{secnumdepth}{0} % Elimate the section numbering starting from a specific depth (see WikiBook) \usepackage[round,sort]{natbib} \usepackage{fixltx2e} \usepackage{graphicx} % To manage external pictures \usepackage{float} \usepackage{subfig} % To add subfigures within figures, with labels (see WikiBooks) \usepackage{verbatim} \usepackage[colorlinks=true,linkcolor=blue,citecolor=blue,urlcolor=blue]{hyperref} \usepackage{amssymb,amsbsy,amsmath} \usepackage{epsfig} \usepackage[left=3cm,top=3cm,bottom=3.5cm,right=3cm]{geometry} % For easy management of document margins \usepackage{fancyhdr} % To customize the header/footer (see WikiBooks) %\usepackage{rotating} \numberwithin{equation}{section} % Equation numbers relative to sections % --------------------------------------------------------------------------------------------------------------------------------------- % \VignetteIndexEntry{ABC} %\VignetteDepends{quantreg,nnet} %\VignettePackage{abc} %\documentclass{amsart} \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\pkg}[1]{{\texttt{#1}}} \newcommand{\class}[1]{{\textit{#1}}} \newcommand{\R}{{\normalfont\textsf{R }}{}} \begin{document} % <>= options(width = 60) options(SweaveHooks = list(fig = function() par(mar=c(3,3,1,0.5),mgp = c(2,1,0)))) @ \SweaveOpts{prefix.string=fig,include=F,keep.source=T,eps=FALSE,resolution=72} <>= options(continue=" ") @ %@% TO ELIMINATE THE "+" IN CONSECUTIVE SCRIPT LINES \title{Approximate Bayesian Computation (ABC) in R: A Vignette} \author{K Csill\'ery, L Lemaire, O Fran\c cois, MGB Blum} \date{\pkg{abc} version \Sexpr{packageDescription("abc")[["Version"]]} , \Sexpr{Sys.Date()} } \maketitle \tableofcontents \setcounter{footnote}{1} \footnotetext{This document is included as a vignette (a \LaTeX\ document created using the \R function \code{Sweave}) of the package \pkg{abc}. It is automatically dowloaded together with the package and can be accessed through \R typing \code{vignette("abc")}.} \newpage \setlength{\parskip}{4pt} % Space between paragraphs <>= rm(list=ls()) @ \section{Summary} An implementation of Approximate Bayesian Computation (ABC) methods in the \R language is available in the package \pkg{abc} with associated example data sets in the \pkg{abc.data} package. The aim of this vignette is to provide an extended overview of the capabilities of the package, with a detailed example of the analysis of real data. Some general information on installation procedures and object oriented programming in \R are given in Section \emph{Introduction}. The ABC methodology is briefly illustrated in Section \emph{A brief introduction to ABC}. A detailed example of an ABC analysis, where we aim to estimate the ancestral human population size is provided in Section \emph{A walk-though example: human demographic history}. Users already familiar with the \R and ABC methods may skip the first two Sections and start with the example. \section{Introduction} \subsection{Installation} \R is a open source software project and can be freely downloaded from the CRAN website. There are already several resources for an introduction to \R. The CRAN website link to contributed documentation also offers excellent introductions in several languages. Once \R is running the installation of additional packages is quite straightforward. To install the \pkg{abc} and \pkg{abc.data} packages from \R simply type: \vspace{2mm} \noindent \texttt{> install.packages("abc")} \texttt{> install.packages("abc.data")} \vspace{2mm} \noindent Once the packages are installed, thely need to be made accessible to the current \R session by the commands: <>= library(abc) require(abc.data) @ For online help facilities or the details of a particular command (such as the function \texttt{abc}) you can type: <>= help(package="abc") help(abc) @ The first command gives a brief summary of the available commands in the package, the second give more detailed information about a specific command. \R help files can also be used to illustrate how commands are executed, so they can be pasted into an \R session, or run as a whole with the following command: <>= example(abc) @ \subsection{Methods and classes in \R} This is only a brief reminder that expressions in \R manipulate objects, which may be data (a scalar or a matrix for example), but objects may also be functions, or more complex collections of objects. All objects have a class, which enables functions to act on them appropriately. Thus, for example, when the function \code{summary} is applied on an object of class \verb|"abc"| (i.e. an object that had been generated by the function \code{abc}), it would act differently than on a simple matrix. In fact, the function \code{summary} on an \verb|"abc"| object calls the function \code{summary.abc}. Likewise, the \code{plot}, \code{hist} and \code{print} functions will recognize the classes of objects generated by the various functions of the \pkg{abc} package and act accordingly. \section{A brief introduction to ABC} This section contains a short introduction to the ABC methodology. Recall that the main steps of an ABC analysis follow the general scheme of any Bayesian analysis: formulating a model, fitting the model to data (parameter estimation), and improving the model by checking its fit (posterior predictive checks) and comparing it to other models \citep{gelmanetal03, csilleryetal10}. In the following sections, we detail each of these steps and highlight the appropriate functions of the \pkg{abc} package \citep{csilleryetal12}. \subsection{Parameter inference} Suppose that we want to compute the posterior probability distribution of a univariate or multivariate parameter, $\theta$. A parameter value $\theta_i$, is sampled from its prior distribution to simulate a dataset $y_i$, for $i=1,\dots, n$ where $n$ is the number of simulations. A set of summary statistics $S(y_i)$ is computed from the simulated data and compared to the summary statistics obtained from the actual data $S(y_0)$ using a distance measure $d$. We consider the Euclidean distance for $d$, where each summary statistic is standardized by a robust estimate of the standard deviation (the median absolute deviation). If $d(S(y_i),S(y_0))$ (i.e. the distance between $S(y_i)$ and $S(y_0)$) is less than a given threshold, the parameter value $\theta_i$ is accepted. In order to set a threshold above which simulations are rejected, the user has to provide the tolerance rate, which is defined as the percentage of accepted simulation. The accepted $\theta_i$'s form a sample from an approximation of the posterior distribution. The estimation of the posterior distribution can be improved by the use of regression techniques (see below). The \pkg{abc} package implements three ABC algorithms for constructing the posterior distribution from the accepted $\theta_i$'s: a rejection method, and regression-based correction methods that use either local linear regression \citep{beaumontetal02} or neural networks \citep{blumfrancois10}. When the rejection method is selected, the accepted $\theta_i$'s are considered as a sample from the posterior distribution \citep{pritchardetal99}. The two regression methods implement an additional step to correct for the imperfect match between the accepted, $S(y_i)$, and observed summary statistics, $S(y_0)$, using the following regression equation in the vicinity of $S(y_0)$ $$ \theta_i=m(S(y_i))+ \epsilon_i, $$ where $m$ is the regression function, and the $\epsilon_i$'s are centered random variables with a common variance. Simulations that closely match $S(y_0)$ are given more weight by assigning to each simulation $(\theta_i,S(y_i))$ the weight $K[d(S(y_i),S(y_0))]$, where $K$ is a statistical kernel. The local linear model assumes a linear function for $m$ (\code{"loclinear"} method in the \code{abc} function), while neural networks (\code{"neuralnet"} method) account for the non-linearity of $m$ and allow users to reduce the dimension of the set of summary statistics. Once the regression is performed, a weighted sample from the posterior distribution is obtained by correcting the $\theta_i$'s as follows, $$ \theta_i^*=\hat{m}(S(y_0))+\hat{\epsilon_i}, $$ where $\hat{m}(\cdot)$ is the estimated conditional mean and the $\hat{\epsilon_i}$'s are the empirical residuals of the regression \citep{beaumontetal02}. Additionally, a correction for heteroscedasticity is applied (by default) $$ \theta_i^*=\hat{m}(S(y_0))+\frac{\hat{\sigma}(S(y_0))}{\hat{\sigma}(S(y_i))} \hat{\epsilon_i} $$ where $\hat{\sigma}(\cdot)$ is the estimated conditional standard deviation \citep{blumfrancois10}. When using the \code{"loclinear"} method, a warning about the collinearity of the design matrix of the regression might be issued. In that situation, we recommend to rather use the related \code{"ridge"} method that performs local-linear ridge regression and deals with the collinearity issue. Finally, we note that alternative algorithms exist that sample from an updated distribution that is closer in shape to the posterior than to the prior \citep{beaumontatal09, marjorametal03, wegmannetal10}. However, we do not implement these methods in \R \pkg{abc} because they require the repeated use of the simulation software. \subsection{Cross-validation} \R \pkg{abc} implements a leave-one-out cross-validation to evaluate the accuracy of parameter estimates and the robustness of the estimates to the tolerance rate. To perform cross-validation, the $i^{th}$ simulation is randomly selected as a validation simulation, its summary statistic(s) $S(y_i)$ are used as pseudo-observed summary statistics, and its parameters are estimated with the function \code{abc} using all simulations except the $i^{th}$ simulation. Ideally, the process is repeated $n$ times, where $n$ is the number of simulations (so-called $n$-fold cross-validation). However, performing an $n$-fold cross-validation could be very time consuming, so the cross-validation is often performed for a subset of a few 100 randomly selected simulations. The prediction error is calculated as $$ E_{\rm pred}=\frac{\sum_i(\tilde{\theta}_i-\theta_i)^2}{Var(\theta_i)}, $$ where $\theta_i$ is the true parameter value of the $i^{th}$ simulated data set and $\tilde{\theta}_i$ is the estimated parameter value (the posterior median, mean or mode). \subsection{Model selection} Model selection is part of any Bayesian analysis, and can be performed in an ABC framework. The aim is generally to estimate the posterior probability of a model $M$ as $Pr(M | S(y_0))$. Three different methods are implemented in the \pkg{abc} package. With the rejection method, the posterior probability of a given model is approximated by the proportion of accepted simulations given this model. The two other methods are based on multinomial logistic regression or neural networks. In these two approaches, the model indicator is treated as the response variable of a polychotomous regression, where the summary statistics are the independent variables \citep{fagundesetal2007,beaumont08}. Using neural networks can be efficient when highly dimensional statistics are used. Any of these methods are valid when the different models to be compared are, a priori, equally likely, and the same number of simulations are performed under each model. A further function, \code{expected.deviance}, is implemented to guide the model selection procedure. The function computes an approximate expected deviance from the posterior predictive distribution. Thus, in order to use the function, users have to re-use the simulation tool and to simulate data from the posterior parameter values. The method is particularly advantageous when it is used with one of the regression methods. Further details on the method can be found in \citep{francoislaval11} and fully worked out examples are provided in the package manual. \subsection{Model misclassification} A cross-validation tool is available for model selection as well. The objective is to evaluate if model selection with ABC is able to distinguish between the proposed models by making use of the existing simulations. The summary statistics from one of the simulations are considered as pseudo-observed summary statistics and classified using all the remaining simulations. Then, one expects that a large posterior probability should be assigned to the model that generated the pseudo-observed summary statistics. Two versions of the cross-validation are implemented. The first version is a ``hard'' model classification. We consider a given simulation as the pseudo-observed data and assign it to the model with the highest posterior model probability. This procedure is repeated for a given number of simulations for each model. The results are summarized in a so-called {\it confusion matrix} \citep{hastieetal03}. Each row of the confusion matrix represents the number of simulations under a true model, while each column represents the number of simulations under a model assigned by \code{postpr}. If all simulations had been correctly classified, only the diagonal elements of the matrix are non-zero. The second version is called ``soft'' classification. Here we do not assign a simulation to the model with the highest posterior probability, but average the posterior probabilities over many simulations for a given model. This procedure is again summarized as a matrix, which is similar to the confusion matrix. However, the elements of the matrix do not give model counts, but the average posterior probabilities across simulations for a given model. \subsection{Goodness-of-fit} A formal hypothesis testing procedure for goodness-of-fit is available since the version 2.0 of the package \pkg{abc}. The test statistic is the median of the distance between accepted summary statistics and observed ones. Instead of the mean, other choices are possible including mean or maximum values. To obtain the null distribution of the test statistic, we consider a given number of pseudo-observed summary statistics as in the cross-validation experiments. Ideally, simulations should be performed with the posterior distributions but it is more convenient to use the prior distribution because it does not require an additional use of the simulation software. In addition to the hypothesis-testing procedure, we provide a graphical method to check the Goodness-of-fit. Our advocated approach is based on Principal Component Analysis, which is useful for graphical display when there are many summary statistics. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{A walk-though example: human demographic history} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Getting started} In following sections we show how the \pkg{abc} package can be used for the inference and model comparison steps, and we indicate how generic tools of the \R package can be used for model checking. The package is a so-called ``post-processing tool'', so the simulations and the calculation of the summary statistics have to be performed by the user's preferred software. Alternatively, simulation software might be called in an \R session, which opens up the possibility for a highly interactive ABC analysis. For example, for coalescent models, users might want to use one of the many existing software for simulating genetic data such as \pkg{ms} \citep{hudson02} or \pkg{simcoal2} \citep{simcoal2}. Note that the package's data set \code{human}, for example, has been generated using \pkg{ms}. The exact code can be found in \texttt{inst/doc/runms.R} and may be re-run or modified by interested users. Since the release of the \pkg{phyclust} package, a function called \code{ms} is also available, which allows users to call \pkg{ms} directly from an \R session. The calculation of summary statistics for could either be done using our own \R code or, for \pkg{ms} outputs, one could use the software \pkg{msABC} \citep{pavlidisetal10} (\url{http://www.bio.lmu.de/~pavlidis/home/?Software:msABC}). In order to use the package certain \R objects always have to be prepared. These are a vector of the observed summary statistics, a matrix of the simulated summary statistics, where each row corresponds to a simulation and each column corresponds to a summary statistic, and a matrix of the simulated parameter values, where each row corresponds to a simulation and each column corresponds to a parameter. If users want to perform model selection, an additional vector with the model indices has to be ready. The \pkg{abc} package contains two examples: \code{musigma2} and \code{human}. The \code{musigma2} data is very simple, so users might prefer to try this first. The example is motivated by Anderson, Edgar's iris data, so the observed statistics are the mean and variance of the sepal of \emph{Iris setosa}, estimated from part of the \code{iris} data. The aim is to estimate two parameters: $\mu$ and $\sigma^2$. The advantage of this simple example is that the true posterior distributions are known, so the accuracy of parameter inference can be easily checked. The data set can be loaded using the following code: <<>>= library(abc) require(abc.data) data(musigma2) @ Tha data set contains five \R objects. The manual page of these \R objects can be accessed by typing the name of one of the five objects, such as \code{stat.obs}, which contains the observed summary statistics (\texttt{?stat.obs}). At the end of the manual page of the function \code{abc}, several examples can be found using the \code{musigma2} data set, so we leave the user to discover this simple example on their own. This example can be accessed by typing \texttt{?abc}. \subsection{The human data} The \code{human} data set is more realistic. Our aim is to estimate the human ancestral population size and to discriminate between different demographic models. We discuss this example in detail. \subsection{Background} Several population genetic studies have found evidence that African human populations have been expanding, while human populations outside of Africa went through a out-of-Africa population bottleneck and then expanded. We re-examined this problem by re-analyzing published data of 50 unlinked autosomal non-coding regions from a Hausa (Africa), a Chinese (Asia), and an Italian (Europe) population \citep{voightetal05}. Data is summarized using three summary statistics: the average nucleotide diversity ($\bar\pi$), and the mean and the variance of Tajima's D (Table 1 in \cite{voightetal05}). Both Tajima's D and $\bar\pi$ have been classically used to detect historical changes in population size. The observed summary statistics (our data!) can be accessed with the following code: <<>>= require(abc.data) data(human) stat.voight @ A negative Tajima's D signifies an excess of low frequency polymorphisms, indicating population size expansion. While a positive Tajima's D indicates low levels of both low and high frequency polymorphisms, thus a sign of a population bottleneck. In constant size populations, Tajima's D is expected to be zero. The data set \code{human} contains four objects, and \code{stat.voight} is just one of them. The manual page of all objects can be accessed by referring to one of its objects that we can achieved by typing for example, \texttt{?stat.voight}. Other objects of \code{human} contain the simulated summary statistics (\code{stat.3pops.sim}) and the model indices (\code{models}). Using these three objects, we can estimate the posterior probabilities of different demographic scenarios (see below) in the three human populations: Hausa, Italian, and Chinese. The fourth object, \code{par.italy.sim} may be used to estimate the ancestral population size of the European population (represented by an Italian sample). \subsection{The demographic models} Since we do not know which demographic model would be the most appropriate to estimate the ancestral human population size, first, we propose three different models to see which model is the most supported by the data. The three models of demographic history are: constant population size, exponential growth after a period of constant population size, and population bottleneck, where, after the bottleneck, the population recovers to its original size. All three models can be defined by a number of demographic parameters, such as population sizes, rates and timings of changes in population size, which we do not detail here. The data has been generated for the users, however, can easily be re-generated and the demographic parameters may be altered. We simulated $50,000$ data sets under each of the three demographic models using the software \pkg{ms} \citep{hudson02}, and calculated the three summary statistics in each model. The exact \R code to simulate the \code{human} data can be found in \verb|/inst/doc/runms.R|, thus can be viewed, modified and re-run by interested users, given that \pkg{ms} and its sister package \verb|sample_stats| are installed and configured. To illustrate the informativeness of our summary statistics we can plot them as a function of the model (Figure \ref{fig:ss}). <>= par(mfcol = c(1,3), mar=c(5,3,4,.5)) boxplot(stat.3pops.sim[,"pi"]~models, main="Mean nucleotide diversity") boxplot(stat.3pops.sim[,"TajD.m"]~models, main="Mean Tajima's D") boxplot(stat.3pops.sim[,"TajD.v"]~models, main="Var in Tajima's D") @ \setkeys{Gin}{width=5in, height=5in} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Summary statistics under the three demographic models: population bottleneck, constant population size, exponential growth} \label{fig:ss} \end{figure} \subsection{Model selection} The main model selection function is called \code{postpr}. However, before applying this function on the real data, we perform a cross-validation for model selection (\code{cv4postpr}) to evaluate if ABC can, at all, distinguish between the three models. The cross-validation might take a long time to run. At this point, we might just want just a quick result to illustrate the use of the function, so we run only 10 cross-validation simulations, and summarize and the results using the following commands: <>= cv.modsel <- cv4postpr(models, stat.3pops.sim, nval=5, tol=.01, method="mnlogistic") s <- summary(cv.modsel) @ The resulting confusion matrix may also be plotted using the following command: <>= plot(cv.modsel, names.arg=c("Bottleneck", "Constant", "Exponential")) @ \setkeys{Gin}{width=4in, height=4in} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Misclassification proportions for the tree models. If all tree models were classified correctly $100\%$ of the times, all three bars would have a single colour.} \label{fig:cv4postpr} \end{figure} <>= mytotal <- length(cv.modsel$cvsamples)/length(unique(models)) myexp <- s$conf.matrix$tol0.01[3,3] misclasstot <- 1-(sum(s$conf.matrix$tol0.01[1,1],s$conf.matrix$tol0.01[2,2],s$conf.matrix$tol0.01[3,3])/sum(s$conf.matrix$tol0.01)) @ Both the confusion matrix and the barplot (Figure \ref{fig:cv4postpr}) illustrate that ABC is able to distinguish between these three models. Notably, the exponential expansion model can be classified correctly the most frequently: \Sexpr{myexp} times out of \Sexpr{mytotal}. However, the total misclassification rate of \Sexpr{round(misclasstot,2)} prompts us to be cautious about the results obtained with model selection and suggests that additional data or summary statistics are required to better discriminate between the models. Remember that simulations are randomly selected, so every user will get slightly different figures for the misclassification rates, especially if \code{nval} is small. Now, we may calculate the posterior probabilities of each demographic scenario using the rejection (\verb|"rejection"|) and the multinomial logistic regression method (\verb|"mnlogistic"|) of the function \code{postpr} with a tolerance rate of $0.05\%$. The function \code{summary} prints out posterior model probabilities and ratios of model probabilities (the Bayes factors) in a user-friendly way: <<>>= modsel.ha <- postpr(stat.voight["hausa",], models, stat.3pops.sim, tol=.05, method="mnlogistic") modsel.it <- postpr(stat.voight["italian",], models, stat.3pops.sim, tol=.05, method="mnlogistic") modsel.ch <- postpr(stat.voight["chinese",], models, stat.3pops.sim, tol=.05, method="mnlogistic") summary(modsel.ha) summary(modsel.it) summary(modsel.ch) @ These results show that Hausa data supports best the model of exponential growth model, while the Italian and Chinoise data supports best the population bottleneck model. In the following, we only focus on the data set from Italy. \subsection{Goodness-of-fit} Before turning to parameter inference, it is important to check that the preferred model provides a good fit to the data. For the Italian data, we can plot the histogram of the null distribution under a bottleneck model on which we superimpose the observed value (Figure \ref{fig:distance}). <>= res.gfit.bott=gfit(target=stat.voight["italian",], sumstat=stat.3pops.sim[models=="bott",], statistic=mean, nb.replicate=100) plot(res.gfit.bott, main="Histogram under H0") @ \setkeys{Gin}{width=4in, height=4in} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Histogram of the null distribution of the test statistic for goodness of fit assuming a bottleneck model.} \label{fig:distance} \end{figure} We can compute a P-value to test the fit to the bottleneck model. We can additionally check that the other demographic models do not provide a good fit to the Italian data, which would confirm the results obtained with model selection. <<>>= res.gfit.exp=gfit(target=stat.voight["italian",], sumstat=stat.3pops.sim[models=="exp",], statistic=mean, nb.replicate=100) res.gfit.const=gfit(target=stat.voight["italian",], sumstat=stat.3pops.sim[models=="const",], statistic=mean, nb.replicate=100) summary(res.gfit.bott) summary(res.gfit.exp) summary(res.gfit.const) @ Alternative graphical procedures are also possible to perform Goodness-of-fit. We can perform Principal Component Analysis (PCA) with 2 components to make new summary statistics. We can then display the $90\%$ (or something else equal to 1-\code{cprob}) enveloppe of the 2 PCs obtained with each demographic model (Figure \ref{fig:pca}). Simulations are performed a priori to avoid additional use of the simulation software but similar procedures can be used with posterior simulations. <>= gfitpca(target=stat.voight["italian",], sumstat=stat.3pops.sim, index=models, cprob=.1) @ \setkeys{Gin}{width=4in, height=4in} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{$90\%$ enveloppe of the 2 Principal Components obtained with each demographic model.} \label{fig:pca} \end{figure} \subsection{Posterior predictive checks} To further confirm that the bottleneck model provided the best fit to the Italian data, we can also consider posterior predictive checks \cite{gelmanetal03}. Note that there is no specific function in \pkg{abc} for posterior predictive checks, nevertheless the task can be easily carried out. First, we estimate the posterior distributions of the parameters of the three models using the function \code{abc} (see details in the following section). Then, we sample a set of 1,000 multivariate parameters from their posterior distribution. Last, we obtain a sample from the distribution of the three summary statistics a posteriori by simulating data sets with the 1,000 sampled multivariate parameters using \pkg{ms}. The exact code to run \pkg{ms} again can be found in \verb|\inst\doc\runms4ppc.R|. By running this code one can re-generate the package's data set \code{ppc}. The results of the posterior predictive checks could best be dispayed as histograms: \setkeys{Gin}{width=5in, height=5in} <>= require(abc.data) data(ppc) mylabels <- c("Mean nucleotide diversity","Mean Tajima's D", "Var Tajima's D") par(mfrow = c(1,3), mar=c(5,2,4,0)) for (i in c(1:3)){ hist(post.bott[,i],breaks=40, xlab=mylabels[i], main="") abline(v = stat.voight["italian", i], col = 2) } @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Posterior predictive checks for the Italian data under the bottleneck model} \label{fig:ppc} \end{figure} The Figure \ref{fig:ppc} illustrates the posterior predictive checks under the bottleneck model. We can see that the bottleneck model is able to reproduce the observed values of the summary statistics in the Italian data. Note that the constant population size model is also able to reproduce the observed summary statistics of the Italian data. However, the model of exponential growth is unable to account for the observed level of heterozygosity when accounting for the remaining two summary statistics in the Italian sample (results not shown, but maybe generated by the user). Finally, note that such posterior predictive checks use the summary statistics twice; once for sampling from the posterior and once for comparing the marginal posterior predictive distributions to the observed values of the summary statistics. To avoid this circularity, we might consider using different summary statistics for posterior predictive checks than for parameter estimation. Nevertheless, this could be difficult in many applications, since we already used our ``best'' summary statistics for inference. Now we can move to our original objective, which is to estimate the ancestral population size. \subsection{Cross-validation} Now, we are almost ready to infer the ancestral population size under the bottleneck model for the Italian data. So, we select the simulated parameter values that correspond to the bottleneck model: <<>>= stat.italy.sim <- subset(stat.3pops.sim, subset=models=="bott") head(stat.italy.sim) @ The bottleneck model can be described with four parameters, which can be found in the four columns of \code{par.italy.sim}: the ancestral population size $N_a$, the ratio of the population sizes before and during the bottleneck (\code{a}), the duration of the bottleneck (\code{duration}), and the time since the beginning of the bottleneck (\code{start}). <<>>= head(par.italy.sim) @ Before moving to the inference step, we first assess if ABC is able to estimate the parameter $N_a$ at all. We use the function \code{cv4abc} to determine the accuracy of ABC and the sensitivity of estimates to the tolerance rate. The following code evaluates the accuracy of estimates of $N_a$ under three tolerance rates using the rejection and the local linear regression method. The \verb|"neuralnet"| method is not recommended for cross-validation, as it takes considerably longer to run than the other methods. Since the cross-validation may take a long time to run, we might prefer to run first a code with a smaller value of \code{nval} to compare different tolerance rates, and also different estimation methods. <<>>= cv.res.rej <- cv4abc(data.frame(Na=par.italy.sim[,"Ne"]), stat.italy.sim, nval=10, tols=c(.005,.01, 0.05), method="rejection") cv.res.reg <- cv4abc(data.frame(Na=par.italy.sim[,"Ne"]), stat.italy.sim, nval=10, tols=c(.005,.01, 0.05), method="loclinear") summary(cv.res.rej) summary(cv.res.reg) @ We can also plot the results using the following code: \setkeys{Gin}{width=6in, height=6in} <>= par(mfrow=c(1,2), mar=c(5,3,4,.5), cex=.8) plot(cv.res.rej, caption="Rejection") plot(cv.res.reg, caption="Local linear regression") @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Cross-validation for parameter inference. The colors correspond to different levels of the tolerance rate in an increasing order from red to yellow.} \label{fig:cv4abc} \end{figure} The function \code{summary} shows the prediction error under the three different tolerance rates. \code{plot} compares the two methods (\verb|"rejection"| and \verb|"loclinear"|) under three different tolerance rates. Users can choose how they want to summarize the posterior distribution by setting the argument \code{statistic} in of the function \code{cv4abc}. By default, the posterior distribution of summarized by its median. Thus, the plots shows the posterior distribution medians of $N_a$ for each cross-validation sample. Points of the cross-validation plot are scattered around the identity line indicating that $N_a$ can be well estimated using the three summary statistics (Figure \ref{fig:cv4abc}). Further, estimates were not only accurate for $N_a$, but also insensitive to the tolerance rate (Figure \ref{fig:cv4abc}). Accordingly, the prediction error is relatively low and independent of the tolerance rate. \subsection{Parameter inference} Now, we finaly estimate the posterior distribution of $N_a$ using the main function of the package: \code{abc}. The following code shows how to use \code{abc} with the method \verb|"neuralnet"|. We chose to use a \verb|"log"| transformation of the parameter. The correction for heteroscedasticity of performed by default. <<>>= res <- abc(target=stat.voight["italian",], param=data.frame(Na=par.italy.sim[, "Ne"]), sumstat=stat.italy.sim, tol=0.05, transf=c("log"), method="neuralnet") res @ The function \code{abc} returns an object of class \verb|"abc"|. The function \code{print} returns a simple summary of the object (see above). Using the function \code{summary} we can calculate summaries of the posterior distributions, such as the mode, mean, median, and credible intervals, taking into account the posterior weights, when appropriate. We obtain the following summary of the posterior distribution using the function \code{summary}: <<>>= summary(res) @ Two functions are available to visualize the results, \code{hist} and \code{plot}. \code{hist} simply displays a histogram of the weighted posterior sample (Figure \ref{fig:abchist}). <>= hist(res) @ \setkeys{Gin}{width=4in, height=4in} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Histogram of the weighted posterior sample of $N_a$} \label{fig:abchist} \end{figure} The function \code{plot} can be used as a diagnostic tool when one of the regression methods is applied. We can run the following code to generate the diagnostic plots of the estimation of the posterior distribution of $N_a$: <>= par(cex=.8) plot(res, param=par.italy.sim[, "Ne"]) @ \setkeys{Gin}{width=6in, height=6in} \begin{figure} \begin{center} <>= <> @ \end{center} \caption{ABC regression diagnostics for the estimation of the posterior distribution of $N_a$} \label{fig:abc} \end{figure} Figure \ref{fig:abc} shows that \code{plot} returns various plots that allow the evaluation of the quality of estimation when one of the regression methods is used. The following four plots are generated: a density plot of the prior distribution, a density plot of the posterior distribution estimated with and without regression-based correction, a scatter plot of the Euclidean distances as a function of the parameter values, and a normal Q-Q plot of the residuals from the regression. When the heteroscedastic regression model is used, a normal Q-Q plot of the standardized residuals is displayed. Figure \ref{fig:abc} shows that the posterior distribution is very different from the prior distribution, confirming that the three summary statistics convey information about the ancestral population size (Figure \ref{fig:abc}, lower left panel). The upper right panel of Figure \ref{fig:abc} shows the distance between the simulated and observed summary statistics as a function of the prior values of $N_a$. Again, this plot confirms that the summary statistics convey information about $N_a$, because the distances corresponding to the accepted values are clustered and not spread around the prior range of $N_a$. The lower right panel displays a standard Q-Q plot of the residuals of the regression, $\hat{\epsilon_i}$. This plot serves as a regression diagnostic of the linear or non-linear regression when the method is \verb|"loclinear"| or \verb|"neuralnet"|. \section{Conclusions} A few of the capabilities of the \pkg{abc} package have been described. Inevitably, new applications will demand new features and might reveal unforeseen bugs. Please send comments or suggestions and report bugs to \texttt{kati.csillery@gmail.com} and/or \texttt{michael.blum@imag.fr}. \bibliographystyle{plainnat} \bibliography{abcvignette} \addcontentsline{toc}{section}{Bibliography} % To add bibliography to the TOC \end{document}