--- title: "SAVER Tutorial" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: TRUE vignette: > %\VignetteIndexEntry{SAVER Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r eval_saver, include = FALSE} # Whether or not to evaluate saver. The generated vignette was run setting it # to be TRUE but since running requires multiple cores, this was set to be # FALSE for purposes of submission to CRAN. eval.saver <- FALSE ``` ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = eval.saver) ``` SAVER (Single-cell Analyses Via Expression Recovery) is a method for denoising single-cell RNA sequencing data by borrowing information across genes and cells. Here, we demonstrate how to use the method. For more information, take a look at the [Github page](https://github.com/mohuangx/SAVER) and the [paper](https://doi.org/10.1038/s41592-018-0033-z). ## Installation You can install the most recent updates of SAVER from github with: ```{r, eval = FALSE} # install.packages("devtools") devtools::install_github("mohuangx/SAVER") ``` To install the stable release of SAVER: ```{r, eval = FALSE} # install.packages("devtools") devtools::install_github("mohuangx/SAVER@*release") ``` ## Getting Started Once we have the package installed, we can load the package. ```{r, eval = TRUE} library(SAVER) packageVersion("SAVER") ``` In this tutorial, we will run SAVER on the [mouse cortex data](https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt) from [Zeisel (2015)](http://www.sciencemag.org/content/early/2015/02/18/science.aaa1934.abstract). ```{r} data.path <- "../data/expression_mRNA_17-Aug-2014.txt" # Need to remove first 10 rows of metadata raw.data <- read.table(data.path, header = FALSE, skip = 11, row.names = 1, check.names = FALSE) cortex <- as.matrix(raw.data[, -1]) cellnames <- read.table(data.path, skip = 7, nrows = 1, row.names = 1, stringsAsFactors = FALSE) colnames(cortex) <- cellnames[-1] dim(cortex) ``` ### Preprocessing The input to SAVER is a matrix of gene expression counts with genes along the rows and cells along the columns. SAVER was developed for use on UMI count data but it seems to work well with non-UMI TPM data. Standard quality control such as removing low quality cells and filtering out genes with low expression should be performed prior to running SAVER. Here, the Zeisel dataset has already been preprocessed. ### Running SAVER The main function to run SAVER is the `saver` function. Since SAVER is computationally intensive for large datasets, we recommend running it in parallel on a cluster either by creating your own parallel environment or by specifying `ncores`. We will run SAVER on the Zeisel dataset across 12 cores. ```{r} cortex.saver <- saver(cortex, ncores = 12) str(cortex.saver) ``` `cortex.saver` is a saver object with the following components: * `estimate` gives the library size normalized SAVER estimates. * `se` gives the standard error of the estimates * `info` gives the more information about the run. If you are only interested in the estimate, you can run `saver` setting `estimates.only = TRUE`. For example, ```{r, eval=FALSE} cortex.saver <- saver(cortex, ncores = 12, estimates.only = TRUE) ``` ### Using SAVER estimates The SAVER estimates represent the library size normalized posterior means of the recovered gene expression. They can be used as input to many of the common downstream analysis methods such as [Seurat](https://satijalab.org/seurat/) and [Monocle](http://cole-trapnell-lab.github.io/monocle-release/). ## Additional Settings ### Normalization By default, SAVER takes in an unnormalized count matrix and performs library size normalization during the denoising step. However, if your data is already normalized or normalization is not desired, you can set `size.factor = 1`. In addition, if you want to use custom cell size factors, you can set `size.factor` to be a vector of size factors. ### Predicting certain genes SAVER has two main steps: the first is the prediction step and the second is a shrinkage step. The prediction step is the time-consuming part of SAVER. If you are only interested in a handful of genes, you can specify those genes and generate predictions for those genes only and choose to return the entire data matrix or for those genes only. For example, ```{r, eval=FALSE} # Identify the indices of the genes of interest genes <- c("Thy1", "Mbp", "Stim2", "Psmc6", "Rps19") genes.ind <- which(rownames(cortex) %in% genes) # Generate predictions for those genes and return entire dataset cortex.saver.genes <- saver(cortex, pred.genes = genes.ind, estimates.only = TRUE) # Generate predictions for those genes and return only those genes cortex.saver.genes.only <- saver(cortex, pred.genes = genes.ind, pred.genes.only = TRUE, estimates.only = TRUE) ``` ### Using other predictions The main contribution of SAVER is the shrinkage step after generating the predictions. The shrinkage step allows the algorithm to adaptively evaluate the quality of the predictions and weigh the estimates either towards the predictions or the observed value. SAVER performs a Lasso Poisson regression for each gene using other genes as covariates to generate the predictions. However, if you want to use other predictions such as [MAGIC](https://github.com/KrishnaswamyLab/MAGIC) or [DCA](https://github.com/theislab/dca), you can input the results in matrix format into the `mu` parameter of the `saver` function. ### Creating your own parallel backend By default, if you specify `ncores` without already having a parallel backend set up, `saver` will make a default cluster using the `makeCluster` function in the parallel package and register a parallel backend via the `registerDoParallel` function in the doParallel package. However, if you want to set up a more complicated backend, such as using MPI across nodes, you can do so and run `saver` without specifying the `ncores` option. ### Combining SAVER objects Running SAVER on large datasets may run into memory issues, especially when parallelized across cores where each core has limited memory. One solution would be to register a parallel backend using SOCK or MPI to parallelize across nodes. Another solution would be to split the genes into multiple runs using `pred.genes` and `pred.genes.only`. For example, say we have a dataset `x` with 10,000 genes. We can split this up into 4 runs on different machines and combine using the `combine.saver` function. We recommend setting `do.fast = FALSE` when splitting the dataset. ```{r, eval=FALSE} saver1 <- saver(x, pred.genes = 1:2500, pred.genes.only = TRUE, do.fast = FALSE) saver2 <- saver(x, pred.genes = 2501:5000, pred.genes.only = TRUE, do.fast = FALSE) saver3 <- saver(x, pred.genes = 5001:7500, pred.genes.only = TRUE, do.fast = FALSE) saver4 <- saver(x, pred.genes = 7501:10000, pred.genes.only = TRUE, do.fast = FALSE) saver.all <- combine.saver(list(saver1, saver2, saver3, saver4)) ``` ### Sampling example Oftentimes for downstream analysis, you may want to sample from the posterior distributions to account for estimation uncertainty. This is similar to performing multiple imputation to obtain multiple imputed datasets. To sample from the posterior distributions, we use the function `sample.saver`, which takes in the `saver` result and outputs sampled datasets. For example, ```{r, eval=FALSE} samp1 <- sample.saver(saver1, rep = 1, seed = 50) samp5 <- sample.saver(saver1, rep = 5, seed = 50) ``` ### Correlation example Because the SAVER estimates contain uncertainty, correlations between genes and cells cannot be directly calculated using the SAVER estimates. To adjust for the uncertainty, we provide the functions `cor.genes` and `cor.cells` to construct gene-to-gene and cell-to-cell correlation matrices respectively for the SAVER output. These functions take in the `saver` result and outputs the gene-to-gene or cell-to-cell correlation matrix. For example, ```{r, eval=FALSE} saver1.cor.gene <- cor.genes(saver1) saver1.cor.cell <- cor.cells(saver1) ```