--- title: "A User Manual for PRANA" author: "Seungjun Ahn" date: "July 22nd, 2024 (Version 1.0.5)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{UserManualPRANA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- We introduce a Pseudo-value Regression Approach for Network Analysis (PRANA) [1]. To our knowledge, this is the first attempt of utilizing a regression modeling for the differential network (DN) analysis by collective gene expression levels under two experimental conditions ($\textit{e.g.}$ 'current' vs. 'non-current smokers' or 'high-risk' vs. 'low-risk'). We start from the mutual information (MI) criteria, followed by pseudo-value calculations, which are then entered into a robust regrssion model. ```{r, include = FALSE} options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Requirements Please install these R packages prior to use PRANA. Note that minet package is available in Bioconductor. Please see below for further instruction. Of important note (as of July 2024), dnapath R package is archived in CRAN as of May 2024, which no longer becomes a requirement for the installation of PRANA. Instead, run_aracne() function is available in our package as of version 1.0.5 to obtain mutual information (MI) estimate via ARACNE for network estimation step. ```{r} # library(dplyr) # library(parallel) # To use mclapply() when reestimating the association matrix. # library(robustbase) # To fit a robust regression # library(minet) # To estimate network via ARACNE # Please run the following lines to install minet package # from Bioconductor in your R console: #if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") #BiocManager::install("minet") ``` ```{r setup} library(PRANA) ``` ### Example: Load the COPDGene study data from PRANA R package: In the original article [1], a real-data analysis was performed to showcase the utility of PRANA. This contains clinical and expression data for 406 samples and 28 COPD-related genes that were highlighted in a recent genome-wide association study [2]. The full data is available from the Gene Expression Database with accession number GSE158699 [3]. ```{r} data(combinedCOPDdat_RGO) # A complete data containing expression and clinical data. ``` ### Data processing for the example data from COPDGene study: The main variable of our interest in this analysis is the current smoking status. We obtain the indices of subjects who are 'current' vs. 'non-current smokers.' These indices are used to dichotomize expression dataset into 'current (Group B)' and 'non-current smokers (Group A).' This is important as the we estimate the group-specific $p \times p$ association matrices. ```{r} # A gene expression data part of the downloaded data. rnaseqdat <- combinedCOPDdat_RGO[ , 8:ncol(combinedCOPDdat_RGO)] rnaseqdat <- as.data.frame(apply(rnaseqdat, 2, as.numeric)) # A clinical data with additional covariates sorted by current smoking groups: # The first column is ID, so do not include. phenodat <- combinedCOPDdat_RGO[order(combinedCOPDdat_RGO$currentsmoking), 2:7] # Indices of non-current smoker (namely Group A) index_grpA <- which(combinedCOPDdat_RGO$currentsmoking == 0) # Indices of current smoker (namely Group B) index_grpB <- which(combinedCOPDdat_RGO$currentsmoking == 1) ``` ### Apply the PRANA: Once the data processing is done, we are off to apply the PRANA. In `PRANA` function, please make sure you specify the expression and clinical data separately. Additionally, you will need to provide the indices for each group of a binary indicator variable that is deemed as a main predictor variable. ```{r} PRANAres <- PRANA(RNASeqdat = rnaseqdat, clindat = phenodat, groupA = index_grpA, groupB = index_grpB) ``` ### Some supporting features after the use of PRANA: In this package, there are some additional R functions that can help you locating the name of genes that are significantly differentially connected (DC) and names and adjusted p-values via the empirical Bayes adjustment [4] for all variables and for a specific variable. ```{r} # This is useful when we want to create a table with adjusted p-values only. adjpval(PRANAres) ``` Back to the description of the original article by Ahn $\textit{et al.}$, the main interest is in the current smoking status to declare whether a gene is DC between current and non-current smokers at the user-specified significance level. Thus, we subset the vector of p-values for current smoking status from the `adjpvaltab` object, aka an object with adjusted p-values and gene names using `adjpval` function above. ```{r} # Create an object to keep the table with adjusted p-values using adjpval() function. adjptab <- adjpval(PRANAres) ``` `adjpval_specific_var` is a handy function to retrieve adjusted p-values with gene names for a single variable of interest. ```{r} # NOTE: Please do NOT forget to provide a name of variable with the quotation marks! adjpval_specific_var(adjptab = adjptab, varname = "currentsmoking") ``` If you would like to quickly know which genes are significantly DC for your main binary grouping variable, please use `sigDCGtab` to proceed. This will return a table with adjusted p-values and gene names. ```{r} # NOTE: Please do NOT forget to provide a name of variable with the quotation marks! sigDCGtab <- sigDCGtab(adjptab = adjptab, groupvar = "currentsmoking", alpha = 0.05) sigDCGtab ``` Lastly, a function below, called `sigDCGnames`, will be useful when you just want to return the names of the significantly DC genes from PRANA at the 0.05 significance level. ```{r} # NOTE: Please do NOT forget to provide a name of variable with the quotation marks! sigDCGnames <- sigDCGnames(adjptab = adjptab, groupvar = "currentsmoking", alpha = 0.05) sigDCGnames ``` As an additional step, we can use `rename_genes` function to rename Entrez gene IDs into gene symbols. However, it is currently not available since `dnapath` package is currently archived. A user may manually install the archived version of `dnapath` from the CRAN, and follow the R code below. ```{r} #rename_genes(sigDCGnames, to = "symbol", species = "human") ``` \newpage ## References [1] Ahn S, Grimes T, Datta S. (2023). A pseudo-value regression approach for differential network analysis of co-expression data. $\textit{BMC Bioinformatics}$. **24**(1), 8. [2] Sakornsakolpat P, Prokopenko D, Lamontagne M, and et al. (2019). Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. $\textit{Nature Genetics}$, **51**(3), 494–505. [3] Wang Z, Masoomi A, Xu Z, and et al. (2021). Improved prediction of smoking status via isoform-aware RNA-seq deep learning models. $\textit{PLoS Computational Biology}$, **17**(10), e1009433. [4] Datta S, Datta S. (2005). Empirical Bayes screening of many p-values with applications to microarray studies. $\textit{Bioinformatics}$, **21**(9), 1987–1994.