--- title: "PDN Vignette" author: "Javier Cabrera, Amanda Kowk and Zhenbang Wang" date: "August 15, 2017" output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Introduction to Package PDN} %\usepackage[UTF-8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Introduction A Personalized Disease Network (PDN) is a connected graph that represents the disease evolution of an individual patient. We use diagnoses, medical interventions, and their dates to build a PDN. Each node in the PDN represents an event (e.g. Heart failure) and each node is connected to another by a directed edge if the dates fulfill a certain criterion. PDN were introduced by Javier Cabrera and proved to be effective for improving goodness of fit of survival models, dominating individual comorbidity variables. (Cabrera et. al. 2016) ## Installation Like many other R packages, the simplest way to obtain glmnet is to install it directly from CRAN. Type the following command in R console: ```{r install, eval=FALSE} install.packages("PDN", repos = "https://cran.us.r-project.org") ``` Users may change the repos options depending on their locations and preferences. Other options such as the directories where to install the packages can be altered in the command. For more details, see help(install.packages). Here the R package has been downloaded and installed to the default directories. Alternatively, users can download the package source at "https://cran.r-project.org/package=PDN" and type Unix commands to install it to the desired location. ## Quick Overview The purpose of this section is to give users a general sense of the package, including the components, what they do and some basic usage. We will briefly go over the main functions, see the basic operations and have a look at the outputs. Users may have a better idea after this section what functions are available, which one to choose, or at least where to seek help. More details are given in later sections. First, we load the glmnet package: ```{r load, echo=TRUE} library(PDN) ``` The package contains five functions: __buildnetworks__, __datecut__, __draw.PDN__, __draw.PDN.circle__ and __draw.PDN.cluster__, and three sample data sets: __comorbidity_data__, __demographic_data__ and __survival_data__. You can visit the help page of the package using the following command: ```{r help, eval=FALSE} help(package = "PDN") ``` For __comorbidity_data__, its columns represent different diagnoses, its rows represent different patients, and the entries represent the time difference between the respective event and January 1st 1970. NA means the patient does not have the respective diagnosis. ```{r comorbidity_data, echo=TRUE} summary(comorbidity_data) ``` For __demographic_data__, it contains the demographic information of each patients It has five columns which are sex, race, hispan, dshyr and prime. ```{r demographic_data, echo=TRUE} summary(demographic_data) ``` For __survival_data__, it contains information of survival days from admission to death, as well as censor information regarding whether the patient is dead or not. ```{r survival_data, echo=TRUE} summary(survival_data) ``` ## Datecuts In order to create a PDN based on __comorbidity_data__, we need to find the criterion for each pair of diagnoses that define the connection of the network. Generally, it is important to choose an appropriate cutoff point such that it captures information of the interaction between the two diagnoses. For CMTHY and HTN, the distribution of the time difference is shown below: ```{r hist, echo=FALSE, fig.height = 3, fig.width = 5, fig.align = "center"} abc = comorbidity_data$CMTHY - comorbidity_data$HTN abc = abc[!is.na(abc)] bins=seq(floor(min(abc)),ceiling(max(abc)+1000),500) hist(abc,breaks=bins, yaxt="n", xaxt="n", main = "", ylab="Number of Patients", xlab="Time difference between CMTHY and HTN in days", col="grey" ) axis(side=2, at=axTicks(2), labels=formatC(axTicks(2), format="d", big.mark=',')) axis(side=1, at=axTicks(1), labels=formatC(axTicks(1), format="d", big.mark=',')) box() ``` We used Cox PH regression to find the best criterion for cutoff point following the rule proposed by (Cabrera et. al. 2016). ```{r datecut, echo=TRUE, warning=FALSE} k1 = datecut(comorbidity_data,survival_data[,1],survival_data[,2]) k1[1:20] ``` ## Build Networks By using function buildnetworks, we past __comorbidity_data__ and criterion information we get from the datecut into the function, then we will get the adjacency matrix for the network shown below: ```{r network, echo=TRUE} a = buildnetworks(comorbidity_data,k1) a[1:10,1:10] ``` ## Visualize Networks In our case, the nodes have chronological order, and we opt to incorporate this order in the network visualization for a more informative and clean result. Below left is the example PDN from function __draw.PDN.circle__. PDN for individual patient: ```{r plotnetwork, echo=FALSE, fig.align="center", fig.height=3, fig.width=5} datark = t(apply(comorbidity_data,1,rank)) dak = sort(datark[1,]) draw.PDN.circle(a[1,],dak) title("PDN for Patient 1") ``` Here is the example using the functioin __draw.PDN__, which utilize network and ggplot2 for better visualization: ```{r plotnetwork2, echo=FALSE, fig.align="center", fig.height=4, fig.width=5} nn = names(comorbidity_data) draw.PDN(a[7,],nn) ``` By looping each patient through draw.PDN.circle, we obtain the visualizations for the set of patients: ```{r plotnetwork1, echo=FALSE, fig.height = 5, fig.width = 7, fig.align = "center"} par(mfrow=c(4,5)) for(i in 1 : 20){ dak = apply(datark,2,sort) draw.PDN.circle(a[i,],dak[i,]) title(main=paste("Patient",i)) } ``` Here is an example of using __draw.PDN.cluster__ to visulaize network of cluster, we use different colors to represent weighted connection of each nodes. ```{r plotnetwork3, echo=FALSE, fig.align="center", fig.height=5, fig.width=7, warning=FALSE} ##Clustering Example k1 = datecut(comorbidity_data,survival_data[,1],survival_data[,2]) a = buildnetworks(comorbidity_data,k1) datark = t(apply(comorbidity_data,1,rank)) require(survival) zsq = NULL for(i in 1:ncol(a)){ a1 = (summary(coxph(Surv(as.numeric(survival_data[,1]),survival_data[,2]) ~ a[,i], data=as.data.frame(a)))$coefficient[,4])^2 zsq = cbind(zsq,a1) } zsq = as.numeric(zsq) wi=zsq/sum(zsq,na.rm=TRUE) wi[wi<10^ -3]=10^ -3 wi[is.na(wi)]=10^ -3 #weighted matrix wa = NULL for(i in 1:ncol(a)){ wa = cbind(wa,a[,i]*wi[i]) } #PCA pr.out=prcomp(wa) x.svd=svd(wa) x.score1 <- wa %*% x.svd$v x.score2 <- x.svd$u %*% diag(x.svd$d) ##HC cluster using the first 8 PCA scores dp<-dist(x.score2[,1:8]) hcp<-hclust(dp, method="ward.D") ##Plot Network s1=rev(sort(apply(a[cutree(hcp,3)==3,],2,mean)))[1:50] dak = sort(apply(datark[cutree(hcp,3)==3,],2,mean,na.rm=TRUE)) names(dak) = unlist(strsplit(names(dak),"DAT")) draw.PDN.cluster(s1,dak) ``` ## Analyze Networks In the below example using PDN network information to perform regression, we combine comorbidity information and the adjacency matrix from buildnetworks as our design matrix which is shown below: ```{r glmnet, echo=TRUE} x = cbind(!is.na(comorbidity_data),a) x[1:10,1:10] ``` We then use the __glmnet__ package to perform k-fold cross-validation for the Cox model. We extract two optimal lambdas, that is lambda.min: the ?? at which the minimal MSE is achieved, and lambda.1se: the largest ?? at which the MSE is within one standard error of the minimal MSE. We can check the active covariates in our model and see their coefficients based on the lambda we have chosen. ```{r glmnet1, echo=TRUE} require(survival) require(glmnet) y=Surv(survival_data[,1],survival_data[,2]) glm1 = cv.glmnet(x,y,family="cox",alpha=0.8) b = glmnet(x,y,family="cox",alpha=0.8,lambda=c(glm1$lambda.min,glm1$lambda.1se))$beta b[1:20,] ``` Through stepwise model selection based on the AIC criterion, we can see that the network variables dominate individual comorbidity variables. We conclude that the network information is significant for improving the model compared to just comorbidity information. ```{r glmnet2, eval=FALSE} isel = b[,2]!=0 bcox = coxph(y~.,data=data.frame(x[,isel])) step1 = step(bcox) step1$anova ```