---
title: "Creating and visualizing spatial simulations of tumor growth using SITH" 
author: "Phillip B. Nicol"
date: "January 2 2021" 
output: 
  rmarkdown::html_vignette:
      toc: true
vignette: >
  %\VignetteIndexEntry{Simulating spatial tumor growth with SITH}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
references:
- id: waclaw2015
  title: A spatial model predicts that dispersal and cell turnover limit intratumour heterogeneity
  author:
  - family: Waclaw et. al.
  container-title: Nature
  DOI: https://doi.org/10.1038/nature14971
  publisher: Nature
  page: 261-264
  type: article-journal
  issued:
    year: 2015
- id: chkhaidze2019
  title: Spatially constrained tumor growth affects the patters of clonal selection and neutral drift in cancer genomic data 
  author:
  - family: Chkhaidze et. al.
  container-title: PLOS Computational Biology
  DOI: https://doi.org/10.1371/journal.pcbi.1007243
  publisher: PLOS Computational Biology
  type: article-journal
  issued:
    year: 2019
    
- id: greves2012
  title: Clonal Evolution in Cancer
  author:
  - family: Greaves and Maley
  container-title: Nature
  publisher: Nature
  type: article-journal
  issued:
    year: 2012
  page: 306-313
  
- id: stanta2018
  title: Overview on clinical relevance of intra-tumor heterogeneity.
  author:
  - family: Stanta and Bonin
  container-title: Fronteirs in Medicine
  publisher: Frontiers in Medicine
  type: article-journal
  issued:
    year: 2018

- id: opasic2019
  title: How many samples are needed to infer truly clonal mutations from heterogenous tumours?
  author: 
  - family: Opasic et. al. 
  container-title: BMC Cancer
  publisher: BMC Cancer
  type: article-journal
  DOI: https://doi.org/10.1186/s12885-019-5597-1
  issued:
    year: 2018
    
- id: kimura1964
  title: The number of alleles that can be maintained in a finite population
  author: 
  - family: Kimura and Crow 
  container-title: Genetics
  publisher: Genetics
  type: article-journal
  page: 725-738
  issued:
    year: 1964
    
- id: kuipers2016
  title: Tree inference for single-cell data 
  author: 
  - family: Kuipers and Beerenwinkel
  container-title: Genome Biology
  publisher: Genome Biology
  type: article-journal
  page: 86
  issued:
    year: 2016
    
- id: gillespie1977
  title: Exact stochastic simulation of coupled chemical reactions
  author: 
  - family: Gillespie
  container-title: The Journal of Physical Chemistry
  publisher: The Journal of Physical Chemistry
  type: article-journal
  page: 2340-2361
  issued:
    year: 1977
    
- id: cooper2000
  title: The Cell, A Molecular Approach
  author: 
  - family: Cooper
  container-title: Sinauer Associates
  publisher: Sinauer Associates 
  type: book
  issued:
    year: 2000

link-citations: yes 
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(width = 800)
options(rmarkdown.html_vignette.check_title = FALSE)
```

## Introduction and basic model 

The cells within a cancerous tumor are usually highly diverse. An average tumor contains hundreds of thousands of mutations spread throughout billions of cancer cells, although it is thought that only a small percentage of these mutations are "drivers" which facilitate the progression of cancer into later stages [@greves2012]. A lack of understanding about the evolutionary process which results in the observed intratumor heterogeneity is a major obstacle preventing the development of effective cancer therapies [@stanta2018].

Tumor growth is inherently a three-dimensional process. For this reason, the most accurate models of intratumor heterogeneity will factor in the effects that spatial growth has on the diversity of the tumor. One of the most popular models was introduced by @waclaw2015, which models spatial tumor growth as a multi-type branching process where cells occupy sites on a 3D lattice. Similar models have been studied in more recent literature (See @chkhaidze2019, @opasic2019). However, the software is either unavailable or designed to run at the command line, which is inconvenient for `R` users. 

This package, a Spatial model of Intra-tumor Heterogeneity (SITH), implements a 3D lattice-based model of tumor growth (see Model details) that can be used entirely from the `R` console. The package allows for 3D interactive visualizations of the tumor, as well as a comprehensive summary of the spatial distribution of mutants within the tumor. `SITH` also contains functions allowing users to create simulated datasets from the generated tumor. 


### Model details 
Tumor growth is modeled as an exponential birth-death process where cells occupy sites on the three-dimensional integer lattice. Each cell is given a birth rate $b$ and a death rate $d$ such that the time until cell replication or death is exponentially distributed with parameters $b$ and $d$, respectively. A cell can replicate if at least one of the six sites adjacent to it is unoccupied. Each time cell replication occurs, both daughter cells acquire $Pois(u)$ mutations, for some chosen (usually small) $u$. We follow the "infinite alleles model" [@kimura1964] in assuming that each alteration occurs only once, so that each mutation event induces a unique allele in the population. 

An alteration is a driver mutation with probability $du$ and is a passenger otherwise. To model the selective advantage conferred to driver mutations, the birth rate of a cell depends on the number of driver mutations present in its genotype. Following a common assumption [@waclaw2015], a cell with $k$ driver mutations is given a birth rate $bs^k$, where $s \geq 1$ and $b$ is the initial, or wild-type, birth rate. The death rate remains the same as the dynamics of the model are completely determined by $b/d$. Since cancerous tumors are believed to often begin with a single mutated cell [@cooper2000], the initial state of the model is a single cell at the origin with $b > d$. 

For more information on how the model is simulated, see the appendix. 

## Simulating spatial tumor growth 

As always, we begin by loading the package and setting a seed. 

```{r setup, echo = -3}
set.seed(1126490984)
library(SITH)
rgl::setupKnitr()
```

The function `simulateTumor()` implements the lattice based model of tumor growth and mutation. $N$ (population size) $b,d,u,du,s$ are all inputs to the function, although they all have default values (see user manual). 

```{r}
out <- simulateTumor(max_pop = 250000, verbose = FALSE)
```

`simulateTumor()` returns a list containing useful information for analyzing the results of the model.
```{r}
names(out)
head(out$cell_ids)
head(out$muts)
```

As we can see, the `cell_ids` data frame contains the $(x,y,z)$ coordinates of the cells, the allele ID number, the number of mutations in each cell, and the Euclidean distance from the origin (computed for convenience). The `muts` data frame contains the ID number for each mutation and its corresponding mutation allele frequency (MAF) in the total population. 

There is other useful information contained in `out` such as a phylogenetic tree so that an "order of mutations" can be defined. See the user manual for the details on all the returned components. 

## Visualization 
`SITH` provides functions to visualize the simulated tumor. It is **strongly** recommended that the user has installed the `rgl` package. With `rgl`, the function `visualizeTumor()` will generate an interactive 3D plot of the simulated tumor. If `rgl` is not installed, then a static plot will be made with `scatterplot3d()`. 

```{r, echo = FALSE, results = "hide"}
visualizeTumor(out, background = "white")
```


```{r, rgl = TRUE, fig.width = 7, fig.height = 5, echo = -2}
visualizeTumor(out, background = "white")
rgl::view3d(zoom = 0.66)
```

Another option is to use `plot.type = "heat"`, which colors cells on a scale from blue to red, depending on the number of mutations within the cell. This allows for the user to observe regions of high mutation counts. 

```{r, rgl = TRUE, fig.width = 7, fig.height = 5, echo = -2}
visualizeTumor(out, background = "white", plot.type = "heat")
rgl::view3d(zoom = 0.66)
```

One can easily look inside the tumor by plotting a (static) cross-section with the `plotSlice()` function. One can slice in any coordinate direction and level with the `slice.dim` and `level` arguments (see the user manual). 

```{r, fig.width = 7.5, fig.height = 4}
par(mfrow = c(1,2))
plotSlice(tumor = out)
plotSlice(tumor = out, plot.type = "heat")
```

## Spatial distribution of mutants
One of the main reasons for using a spatial simulation of tumor growth is to investigate biases in the distribution of mutants throughout the tumor. There are countless questions that can be asked, and hopefully the return value of `simulateTumor()` will give enough information to answer most of these. To get a quick summary of the spatial distribution of mutations, `SITH` includes `spatialDistribution()`. This includes a plot of the number of mutations as a function of Euclidean distance, a histogram of the number of mutations, and the mutations with the largest MAF. We can compare the similarity of two cells $A$ and $B$ by viewing their genotypes as a binary vector (with component $i$ on if mutation $i$ is present in the cell) and computing the *jaccard index* $J(A,B) = |A \cap B|/|A \cup B|$. `spatialDistribution()` also estimates the average Jaccard index of cells as a function of the Euclidean distance between them. 

```{r, fig.width = 7.5, fig.height = 6}
sp <- spatialDistribution(tumor = out)
```

Note that the list `sp` also contains all of the raw data needed to generate the plots. 

```{r}
names(sp)
```


## Simulating sequencing data 
An exciting application of the spatial models of intratumor heterogeneity is to use them to generate synthetic sequencing data sets. 

Recently, there have been a myriad of algorithms designed to infer clonal composition or tumor phylogeny from single cell sequencing or bulk sequencing data sets (for example, @kuipers2016). It is not well understood what the impact of spatially biased sampling methods will have on these methods. Using simulated data sets from spatial models could be helpful in determining which inference algorithms are likely to be useful in practice. 

`SITH` contains several functions that simulate single cell sequencing and bulk sampling, allowing the user to get synthetic sequencing data sets from the tumor. 


### Single cell sequencing 

`randomSingleCells()` will take random cells from the tumor and report the list of mutations present in each cell. Due to artifacts of sequencing technology, it is expected that there are a large number of false negatives. To account for this, the `fnr` parameter introduces false negatives into the data set at the specified rate (there is also a `fpr` parameter for false positives). The function returns a binary matrix, where a $1$ indicates that a mutation is present. 
```{r}
Scs <- randomSingleCells(tumor = out, ncells = 5, fnr = 0.1)
head(Scs)
```

The user can also sequence a single cell at a specified $(x,y,z)$ location using the `singleCell()` function with argument `pos = (x,y,z)`. See the user manual for details. 

### Bulk sampling 
One can also simulate bulk sampling by taking a collection of cells and calculating the variant allele frequency (VAF) of each mutation. One way to define a collection of cells is to take a $n \times n \times n$ cube. Function `randomBulkSamples` will choose random locations for the cubes. Note that argument `cube.length` must be odd in order for the cube to have a well-defined center. In practice, mutations that fall a certain VAF ($\approx 0.05$) are either ignored or undetected due to technological artifacts. To account for this, the `threshold` argument åwill ignore mutations below a certain VAF. The user can also specify a desired sequencing depth by adjusting the `coverage` parameter. If a coverage of $C$ is specified, then the number of reads for each base is sampled $M \sim Pois(C)$. Then the number of variant calls is sampled from a binomial where the number of trials is $M$ and the probability of success is equal to the true VAF of the sample. If you don't want to simulate deep sequencing then leave `coverage` blank or set it to $0$.  The return of the function is a matrix of VAFs. 

```{r}
Bulks <- randomBulkSamples(tumor = out, nsamples = 5, cube.length = 7, coverage=100, threshold = 0.05)
head(Bulks)
```

If instead one would like to choose the location of the cube, `bulkSample()` can be used and `pos` can be set to the cube center. 

One can also simulate a long, thin needle passing through the tumor to collect a sample, as described in @chkhaidze2019. This is implemented as `randomNeedles()` and takes the same arguments as `randomBulkSamples` (minus `cube.length`).

```{r}
Needs <- randomNeedles(tumor = out, nsamples = 5, threshold = 0.05)
head(Needs)
```

There is currently no function allowing the user to select the location of the user, although this could be included in future versions. 

## User-defined mutations 

By default, `simulateTumor()` assumes the "infinite alleles hypothesis." This means that each mutation only occurs once (so that every new mutation is unique). While this is useful for investigating how genetic diversity is organized within an *individual* tumor, it doesn't allow the user any control over what mutations can occur and at what rate. For example, the user may want to simulate a population of tumors according to an underlying disease model. 

### Defining a disease model

Suppose that there are $n$ genetic mutations of interests. A fully expressive model would need to define $2^n$ birth rates corresponding to each possible subsets of mutations that can be present in a cell's genome. While this is possible for small $n$, it quickly becomes intractable. For this reason, we choose a more restricting parameterization. 

Our model assumes that there is a directed acyclic graph (DAG) $G$ governing the order in which mutations can occur. The vertices of $G$ are the mutations and the edges constrain the order in which the mutations occur. In particular, suppose that a cell's genotype contains mutations $V' \subset V$. During division, we assume that a cell can acquire any mutation $m$ for which there is an edge $(v',m) \in E$ such that $v' \in V'$. Furthermore, we assume that each edge $e = (v,w)$ has a mutation rate $u_e$ which gives the probability that a cell with mutation $v$ acquires mutation $w$ during division. When crossing edge $e$, the birth rate of the cell multiplied by $s_e$ (the selective advantage). 

To summarize the above paragraph, we require a DAG $G$, probabilities for cross each edge $u_e$, and selective advantages for crossing each edge $s_e$. 

The simplest way to encode the above information is with a matrix with $|E|$ rows and $4$ columns. Function `progressionChain()` generates a simple example where $G$ is a linear chain with $n$ vertices:

```{r}
library(SITH)
set.seed(278115205)

n <- 5
G <- progressionChain(5) 
G
```

The matrix returned by `progressionChain()` is the input that is required to `simulateTumor()`. Of course, the user can change `head` and `tail` to create more complicated DAGs. 

Once we have obtained a matrix of the desired form, we can plug it right into the disease model parameter `simulateTumor()`. 

```{r}
G[,3] <- rep(0.02, nrow(G))
out <- simulateTumor(max_pop = 1000, disease_model=G)
```

The `out` list is exactly the same as the returned by `simulateTumor()`. In particular all of the downstream functions described in the first vignette apply to the output of `simulateTumorUDT()`. 

We defined the disease model `G` to be a linear chain with $5$ vertices. We should thus expect mutation $0$ to be less frequent than mutation $1$, since $0$ is required for $1$ to occur. 

```{r}
out$muts 
```

Let's also take a look at a 2D cross section 

```{r, fig.width = 7.5, fig.height = 6}
plotSlice(out) 
```

### Converting from igraph

To define complicated networks, it is probably easier to obtain the matrix `G` from an `igraph` object. We provide functionality to do this. Let's start by whipping up a simple DAG:

```{r, fig.width = 7.5, fig.height = 6}
if (!requireNamespace("igraph", quietly = TRUE)) {
  stop("igraph needed to compile the vignette.",
      call. = FALSE)
}

el <- matrix(as.character(c(0,1,0,2,2,3,3,4,2,5,1,5)), ncol = 2, byrow = TRUE)
iG <- igraph::graph_from_edgelist(el)
plot(iG)
```
We'll assume that $0$ is the initial mutation (similar to the original model). 

Now we can convert this object into a matrix that can be used by `simulateTumor()` using `progressionDAG_from_igraph()`.

```{r}
G <- progressionDAG_from_igraph(iG)
G
```

And now we can simulate a tumor according to these constraints:

```{r}
out <- simulateTumor(max_pop=5000,disease_model=G, verbose = FALSE)
out$genotypes 
```


### Modeling loss-of-function mutations 

A loss-of-function mutation usually requires both copies of the gene to be mutated. A tumor cell may gain a selective advantage if a loss-of-function mutation occurs in the tumor suppressor gene *TP53*. However, it is possible that the cell is at a disadvantage if only one of the two copies are mutated. This might be referred to as "crossing the fitness valley." 

`simulateTumor()` is well-suited for studying a model of loss-of-function mutations. 

We can model mutations in *TP53* as follows: state $0$ corresponds to the wild-type allele, state $1$ corresponds to a cell with $1$ copy of *TP53*, and cell $2$ corresponds to a cell with both copies of *TP53* mutated. Furthermore, we will assume that a cell with only $1$ mutated copy of *TP53* has a selective disadvantage. However, a cell with mutations in both copies of *TP53* has a selective advantage. For simplicity, let's assume that both copies can't be lost in a single event, so that state $2$ cells can only arise from state $1$ cells. Using the functionality described above, we can simulate this model in just a few lines of code. 

```{r}
# Make a chain: 0 -> 1 -> 2
G <- progressionChain(2)

#Give a disadvantage to a cell in state 1 and an advantage to a cell in state 2 
G[1,4] <- 0.9; G[2,4] <- 1.4
G[,3] <- c(0.005,0.005)
G

#Simulate the model with N = 250000 cells
out <- simulateTumor(disease_model=G)
```

Let's see what a cross section looks like:
```{r, fig.width = 7.5, fig.height = 6}
plotSlice(out)
```

Cells with one copy of the gene mutated appear to spread out somewhat evenly while cells with both copies mutated are clustered together. Remember, if you are ever confused about which color is which just check:

```{r}
out$color_scheme 
```

Let's also check the MAFs 

```{r}
out$muts

out$genotypes
```

Of course, all of the other functions that we used in the introduction vignette can be applied to analyze this model as well. 


## Appendix
### The simulation algorithm 
The model is simulated using a Gillespie algorithm [@gillespie1977]. Given a population of $N$ cells at time $t$, cell $i$ is chosen to replicate with probability $b_i/\sum_{j=1}^N (b_j + d_j)$ (provided a free space is available) and die with probability $d_i/\sum_{j=1}^N(b_j + d_j)$. After an event is selected, the time is updated to be $t + X$, where 
$$ X \sim Expo \left( \sum_{j=1}^N b_j + d_j \right)$$
Our simulation algorithm approximates the parameter of the exponential distribution with 
$p_{max} = N \max_j (b_j + d_j)$. 

A standard approach is to use inverse transform sampling to select a cell for replication or death. However, this requires computing a cumulative sum over the birth and death rates for all unique alleles in the population, which is likely to scale linearly with $N$. We circumvent this issue by using rejection sampling. On each iteration, a random cell $i$ is selected uniformly from the population. Next, we obtain a sample $u$ from the distribution over $[0, p_{max}]$, where $p_{max}$ is as above. If $u < b_i + d_i$, then cell $i$ is selected to replicate or die. Otherwise, we proceed to the next iteration. Although there are contrived examples where rejection sampling is inefficient, the expected run-time is nearly constant for reasonable parameter values. 

### Session information
```{r}
sessionInfo()
```


# References