--- title: "Creating Linear Regression Matrices from Segment Data" author: "James Dalgleish" date: "July 25, 2018" output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Linear Regression/Postprocess} %\VignetteEncoding{UTF-8} --- ```{r setup, include=F} knitr::opts_chunk$set(echo = F) if(Sys.info()['sysname']=="Windows"){groupdir<-"W:/"} else {groupdir<-"/data/CCRBioinfo/"} #knitr::opts_knit$set(root.dir = paste0(groupdir,"dalgleishjl/hicnv/hicnv/vignettes/")) # #setwd("~") #nbl_input_matrix<-readRDS("NBLTCGA_merged_df_aggregated_by_bin_fixed_comparisonv4.rds") #getwd() nbl_input_matrix<-readRDS("NBL_sample_matched_input_matrix.rds") #nbl_result_matrix<-readRDS("nbl_result_matrix_full.rds") ``` From our previous work, we created a small input matrix, with segmented 1Mb regions as our row labels and with sample names from TARGET data as our column labels. We can read that in using the following code: ```{r,echo=T,warning=F,message=F} library(CNVScope) ``` ```{r,echo=T} nbl_input_matrix[1:5,1:5] ``` calcVecLMs() comes standard in the CNVScope package. It allows calculation of the matrix with parallel processing using mclapply, but larger matrices will require a bit more power, and thus we use slurm_apply, from the rslurm pacakge to distribute the work over multiple cores. Our particular establishment has a limit approximating 1000 jobs, so it's best not to use more than that unless your cluster will support it. Conversely, you should use less if you can't submit that many individual jobs in a job array in your cluster. In this particular example, I've removed rows where there is no segmentation data, across the board using colSums(). ```{r, eval=F,echo=T} library(parallel) nbl_slurm_object_test_zero_removed<-calcVecLMs(bin_data =as.data.frame(t(nbl_input_matrix[which(rowSds(as.matrix(nbl_input_matrix))!=0.0),])),use_slurm = T,n_nodes = 975,memory_per_node = "32g",walltime = "04:00:00",cpus_on_each_node = 2,job_finished = F,slurmjob = NULL) ``` Saving the slurm object is essential as it will be required when you retrieve your results. ```{r, eval=F,echo=T} saveRDS(nbl_slurm_object_test_zero_removed,"nbl_slurm_object_test_zero_removed.rds") ``` Retrieving the data is as simple as using rslurm::get_slurm_out() on the saved slurm object and coercing it into a matrix with the original number of columns. The slurm object must have been read with readRDS() previously or done in the same session. For the purposes of making this tutorial, we have chosen to work on a small version of the whole matrix to make 5MB CRAN documentation limits. Previous versions of the tutorial included the whole matrix, but we leave that to the user to construct at this point. For reproducibility, one can find the original full data matrix at https://github.com/jamesdalg/CNVScope_public_data. ```{r, eval=F,echo=T} library(matrixStats) nbl_result_matrix<-matrix(as.numeric(unlist( get_slurm_out(nbl_slurm_object_test_zero_removed))),ncol=ncol(as.data.frame(t(nbl_input_matrix[which(rowSds(as.matrix(nbl_input_matrix))!=0.0),])) ) ) saveRDS(nbl_result_matrix,"nbl_result_matrix_full.rds") saveRDS(nbl_result_matrix[1:25,1:25],"nbl_result_matrix_small.rds") ``` ```{r, echo=F,include=T,eval=F} #source download locations #download.file("https://github.com/jamesdalg/CNScope_public_data/blob/master/nbl_result_matrix_full.rds?raw=true","nbl_result_matrix_full.rds") #download.file("https://github.com/jamesdalg/CNScope_public_data/blob/master/nbl_result_matrix_sign_corrected.rds","nbl_result_matrix_sign_corrected.rds") #download.file("https://www.dropbox.com/s/sevuhos976c6guu/nbl_result_matrix_full.rds?dl=1","nbl_result_matrix_full.rds") #download.file("https://www.dropbox.com/s/85hii5cd5epuuby/nbl_result_matrix_sign_corrected.rds?dl=1","nbl_result_matrix_sign_corrected.rds") nbl_result_matrix<-readRDS("nbl_result_matrix_full.rds") nbl_result_matrix_sign_corrected<-readRDS("nbl_result_matrix_sign_corrected.rds") ``` You'll notice that there are no signs in this matrix (they're just negative log p-values, which are always positive). We'll have to assign signs by the correlation matrix next, then we will chunk the large matrix into smaller, flattened matrices that the shiny app can handle. For lower capacity machines/clusters, an alternative may be using the cor function. In order to perfrom sign correction,fix the "Inf values" to a viewable value, and restore column and row names, postProcessLinRegMatrix() can be applied, yielding a final full matrix of the entire genome (Chr1->ChrX on both Axes). 300 has been used, although something a bit smaller will reduce saturation issues depending on the disparity between the lowest values in the matrix and 300. We'll plot the result below, using complexheatmap and a custom designed function that takes large asymmetric distributions of values and pushes them into the [0,1] colorspace with white at 0.5, corresponding to zero, values between 0 and 0.5 corresponding to negative values, and values from 0.5 to 1 corresponding to positive values (signedRescale). ```{r} ``` ```{r,eval=T,echo=T} nbl_result_matrix_small<-readRDS("nbl_result_matrix_small.rds") nbl_result_matrix_small[1:5,1:5] nbl_result_matrix_sign_corrected<-postProcessLinRegMatrix(input_matrix = nbl_input_matrix[1:25,1:25],LM_mat = nbl_result_matrix_small,cor_type = "pearson",inf_replacement_val = 300) nbl_result_matrix_sign_corrected[1:5,1:5] ``` ```{r,echo=T} nbl_result_matrix_sign_corrected[1:5,1:5] if (requireNamespace("ComplexHeatmap", quietly = TRUE) & requireNamespace("circlize", quietly = TRUE)) { ComplexHeatmap::Heatmap(signedRescale(as.matrix(nbl_result_matrix_sign_corrected)), col = circlize::colorRamp2(c(0,0.5,1),c("blue","white","red")), cluster_rows = F,cluster_columns = F, show_heatmap_legend = F, show_column_names = F, show_row_names = F) } else { print("ComplexHeatmap not installed.\n Please install ComplexHeatmap in order to create this plot.") } ``` Finally, the whole genome matrix is too big to plot interactively without crashing most browsers using the plotly package. We'll need to break things apart a bit. A final function will write chromosomal pair heatmaps to disk with genes from ensembl (hg19 coordinates) in encoded for each square in the matrix. Please only use this function on the WHOLE matrix, not on the small subset we have provided in documentation. ```{r,eval=F,echo=T} if(!dir.exists("nbl_matrix_set")){dir.create("nbl_matrix_set")} #setwd("nbl_matrix_set") doMC::registerDoMC() #use ONLY the whole matrix with chromosomes 1-X, not the small subset provided for documentation purposes. createChromosomalMatrixSet(whole_genome_mat=nbl_result_matrix_sign_corrected,output_dir="nbl_matrix_set",prefix="nbl_") ``` ```{r,echo=T,eval=F} list.files("nbl_matrix_set") ``` There should be 529 of these particular files upon running the code. If there are not, don't hesitate to run the code again. It can happen on a cluster. It is built to detect when chromosomal matrix is already written to disk.