--- title: "Interfacing your model with R" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Interfacing your model with R} \usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} abstract: "This tutorial discusses how to interface models written in other programming languages with R, so that they can be fit with BayesianTools" author: Florian Hartig editor_options: chunk_output_type: console --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(fig.width=5, fig.height=5, warning=FALSE, cache = F) ``` ```{r, echo = F, message = F} set.seed(123) ``` # Interfacing a model with BT - step-by-step guide ## Step 1: Create a runModel(par) function A basic requirement to allow calibration in BT is that we need to be able to execute the model with a given set of parameters. Strictly speaking, BT will not see the model as such, but requires a likelihood function with interface likelihood(par), where par is a vector, but in this function, you will then probably run the model with parameters par, where par could stay a vector, or be transformed to another format, e.g. data.frame, matrix or list. What happens now depends on how your model is programmed - I have listed the steps in order of convenience / speed. If your model has never been interfaced with R you will likely have to move down to the last option. \begin{enumerate} \item Model in R, or R interface existing \item Model can be compiled and linked as a dll \item Model is in C/C++ and can be interfaced with RCPP \item Model can be compiled as executable and accepts parameters via the command line \item Model can be compiled as executable reads parameters via parameter file \item Model parameters are hard-coded in the executable \end{enumerate} ### Case 1 - model programmed in R Usually, you will have to do nothing. Just make sure you can call your model a in ```{r, eval = F} runMyModel(par) ``` Typically this function will directly return the model outputs, so step 2 can be skipped. ### Case 2 - compiled dll, parameters are set via dll interface If you have your model prepared as a dll, or you can prepare it that way, you can use the \ref{https://stat.ethz.ch/R-manual/R-devel/library/base/html/dynload.html}{dyn.load()} function to link R to your model ```{r, eval = F} dyn.load(model) runMyModel(par){ out = # model call here # process out return(out) } ``` Again, if you implement this, you will also typically return the output directly via the dll and not write to file, which means that step 2 can be skipped. The tricky thing in this approach is that you have to code the interface to your dll, which technically means in most programming languages to set your variables as external or something similar, so that they can be accessed from the outside. How this works will depend on the programming language. ### Case 3 - model programmed in C / C++, interfaced with RCPP RCPP is a highly flexible environment to interface between R and C/C++. If your model is coded in C / C++, RCPP offers the most save and powerful way to connect with R (much more flexible than with command line or dll). However, doing the interface may need some adjustments to the code, and there can be technical problems that are difficult to solve for beginners. I do not recommend to attempt interfacing an existing C/C++ model unless you have RCPP experience, or at least very food C/C++ experience. Again, if you implement this, you will also typically return the output directly via the dll and not write to file, which means that step 2 can be skipped. ### Case 4 - compiled executable, parameters set via command line (std I/O) If your model is written in a compiled or interpreted language, and accepts parameters via std I/O, wrapping is usually nothing more than writing the system call in an R function. An example would be ```{r, eval = F} runMyModel(par){ # Create here a string with what you would write to call the model from the command line systemCall <- paste("model.exe", par[1], par[2]) out = system(systemCall, intern = TRUE) # intern indicates whether to capture the output of the command as an R character vector # write here to convert out in the apprpriate R classes } ``` Note: If you have problems with the system command, try system2. If the model returns the output via std.out, you can catch this and convert it and skip step 2. If your model writes to file, go to step 2. ### Case 5 - compiled model, parameters set via parameter file or in any other method Many models read parameters with a parameter file. In this case you want to do something like this ```{r, eval = F} runMyModel(par, returnData = NULL){ writeParameters(par) system("Model.exe") if(! is.null(returnData)) return(readData(returnData)) # The readData function will be defined later } writeParameters(par){ # e.g. # read template parameter fil # replace strings in template file # write parameter file } ``` Depending on your problem, it can also make sense to define a setup function such as ```{r, eval = F} setUpModel <- function(parameterTemplate, site, localConditions){ # create the runModel, readData functions (see later) here return(list(runModel, readData)) } ``` How you do the write parameter function depends on the file format you use for the parameters. In general, you probably want to create a template parameter file that you use as a base and from which you change parameters * If your parameter file is in an *.xml format*, check out the xml functions in R * If your parameter file is in a *general text format*, the best option may be to create a template parameter file, place a unique string at the locations of the parameters that you want to replace, and then use string replace functions in R, e.g. [grep](https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html) to replace this string. ### Case 6 - compiled model, parameters cannot be changed You have to change your model code to achieve one of the former options. If the model is in C/C++, going directly to RCPP seems the best alternative. ## Step 2: Reading back data If your model returns the output directly (which is highly preferable, ), you can skip this step. For simple models, you might consider returning the model output directly with the runMyModel function. This is probably so for cases a) and b) above, i.e. model is already in R, or accepts parameters via command line. More complicated models, however, produce a large number of outputs and you typically don't need all of them. It is therefore more useful to make on or several separate readData or getDate function. The only two different cases I will consider here is * via dll / RCPP * via file ouputs *Model is a dll* If the model is a dll file, the best thing would probably be to implement appropriate getData functions in the source code that can then be called from R. If your model is in C and in a dll, interfacing this via RCPP would probably be easier, because you can directly return R dataframes and other data structures. *Model writes file output* If the model writes file output, write a getData function that reads in the model outputs and returns the data in the desired format, typically the same that you would use to represent your field data. ```{r, eval = F} getData(type = X){ read.csv(xxx) # do some transformation # return data in desidered format } ``` \subsection{Testing the approach} From R, you should now be able to do something like that ```{r, eval = F} par = c(1,2,3,4 ..) runMyModel(par) output <- getData(type = DesiredType) plot(output) ``` ## Step 3 (optional) - creating an R package from your code The last step is optional, but we recommend that you take it from the start, because there is really no downside to it. You work with R code in several files that you run by hand, or diretly put all code into an R package. Creating and managing R packages is very easy, and it's easier to pass on your code because everything, including help, is in on package. To create an R package, follow the tutorial \href{http://biometry.github.io/APES/R/R70-PackageDevelopment.html}{here}. Remember to create a good documentation using Roxygen. # Speed optimization and parallelization For running sensitivity analyses or calibrations, runtime is often an issue. Before you parallelize, make sure your model is as fast as possible. ## Easy things * Are you compiling with maximum optimization (e.g. -o3 in cpp) * If you have a spin-up phase, could you increase the time-step during this phase? * Could you increase the time step generally * Do you write unnecessary outputs that you could turn off (harddisk I/O is often slow)? ## Difficult things * Make the model directly callable (RCPPor dll) to avoid harddisk I/O * Is it possible to reduce initialization time (not only spin-up, but also for reading in the forcings / drivers) by avoid ending the model executable after each run, but rather keep it "waiting" for a new run. * Code optimization: did you use a profiler? Read up on code optimization * Check for unnecessary calculations in your code / introduce compiler flags where appropriate ## Parallelization A possibility to speed up the run time of your model is to run it on multiple cores (CPU's). To do so, you have two choices: 1. Parallelize the model itself 2. Parallelize the model call, so that BT can do several model evaluations in parallel Which of the two makes more sense depends a lot on your problem. To parallelize the model itself will be interesting in particular for very large models, which could otherwise not be calibrated with MCMCs. However, this approach will typically require to write parallel C/C++ code and require advanced programming skills, which is the reason why we will not further discuss it here. The usual advice in parallel computing is anyway to parallelize the outer loops first, to minimize communication overhead, which would suggest to start with parallelizing model evaluations. This is also much easier to program. Even within this, there are two levels of parallelization possible: 1. Parallelize the call of several MCMC / SMC samplers 2. Parallelize within the MCMC / SMC samplers Currently, BT only supports parallelization within MCMCs / SMCs, but it easy to also implement between sampler parallelization by hand. Both approaches are describe below. ### Within sampler parallelization Within-sampler parallelization is particular useful for algorithms that can use a large number of cores in parallel, e.g. sensitivity analyses or SMC sampling. For the MCMCs, it depends on the settings and the algorithm how much parallelization they can make use of. In general, MCMCs are Markovian, as the name says, i.e. the set up a chain of sequential model evaluations, and those calls can therefore not be fully parallelized. However, a number of MCMCs in the BT package uses MCMC algorithms that can be partly parallelized, in particular the population MCMC algorithms DE/DEzs/DREAM/DREAMzs. For all these cases, BT will automatically use parallelization of the BT setup indicates that this is implemented. How to do this? A first requirement to do so is to to have your model wrapped into an R function (see PREVIOUS SECTION). If that is the case, R offers a number of options to run functions in parallel. The easiest is to use the parallel package that comes with the R core. For other packages, see the internet and the CRAN task view on [High Performance Computing](https://CRAN.R-project.org/view=HighPerformanceComputing) As an example, assume we have the following, very simple model: ```{r} mymodel<-function(x){ output<-0.2*x+0.1^x return(output) } ``` To start a parallel computation, we will first need to create a cluster object. Here we will initiate a cluster with 2 CPU's. ```{r, eval = F} library(parallel) cl <- makeCluster(2) runParallel<- function(parList){ parSapply(cl, parList, mymodel) } runParallel(c(1,2)) ``` You could use this principle to build your own parallelized likelihood. However, something very similar to the previous loop is automatized in BayesianTools. You can directly create a parallel model evaluation function with the function generateParallelExecuter, or alternatively directly in the createBayesianSetup ```{r, eval = F} library(BayesianTools) parModel <- generateParallelExecuter(mymodel) ``` If your model is tread-safe, you should be able to run this out of the box. I therefore recommend using the hand-coded paraellelization only of the model is not thread-safe. ### Running several MCMCs in parallel Additionally to the within-chain parallelization, you can also run several MCMCs in parallel, and combine them later to a single McmcSamplerList ```{r} library(BayesianTools) ll <- generateTestDensityMultiNormal(sigma = "no correlation") bayesianSetup <- createBayesianSetup(likelihood = ll, lower = rep(-10, 3), upper = rep(10, 3)) settings = list(iterations = 200) # run the several MCMCs chains either in seperate R sessions, or via R parallel packages out1 <- runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings) out2 <- runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings) res <- createMcmcSamplerList(list(out1, out2)) plot(res) ``` ### Thread safety Thread safety quite generally means that you can execute multiple instances of your code on your hardware. There are various things that can limit Thread safety, for example * writing outputs to file (several threads might write to the same file at the same time)