--- title: "Fusion Learning Vignette" author: "Yuan Zhong and Xin Gao" date: "`r Sys.Date()`" output: rmarkdown::html_vignette fig_caption: yes vignette: > %\VignetteIndexEntry{Fusion Learning Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- > [Introduction](#intro) > [Objectives](#obj) > [Data Input](#data) > [Main Functions](#main) > [Fusion Learning Algorithm](#algo) > [Unbalanced datasets](#unbalanced) > [Model Selection Criterion](#model) > [Data Applications](#app) ```{r include = FALSE} library(ggplot2) #library(ggalt) #library(ggfortify) #library(ggpubr) library(FusionLearn) #library(lattice) set.seed(123) cex = 0.5 ``` ##Introduction `FusionLearn` package implements a new learning algorithm to integrate information from different experimental platforms. The algorithm applies the grouped penalization method to select important predictors across multiple datasets. This package can be installed directly from CRAN. ```{r eval=FALSE} install.packages("FusionLearn",repos = "http://cran.us.r-project.org") ``` Users can also obtain more information about this package at https://CRAN.R-project.org/package=FusionLearn. ## Objectives In biological studies, different types of experiments are often conducted to analyze the same biological process, such as disease progression, immune response, etc. High dimensional data from different technologies can be generated to select important biological markers. The conventional approach, such as the penalized likelihood methods can be employed to perform variable selection on high dimensional data using regression model with the biological states as the response variables and the markers measurements as the predictor variables. However, the measurements across different experimental platforms or techniques follow different distributions and models. Measurements generated by different techniques from the same subject can be correlated. The traditional approach is to perform single platform analysis and select a subset of biomarkers based on each of the datasets. However, the selected lists often do not have an agreement across different platforms, and the correlation among the different data sources are not taken into consideration. To address this problem, we develop this R package to integrate correlated information across different platforms and select a consolidated list of biomarkers. #### High dimensionality For each platform, a generalized linear model with a linear or logistic link is constructed to model the relationship between the predictors and the responses. For each subject, the response variable measures the biological process of interest, such as the disease status or some quantitative measurement of the disease. The predictors across different platforms are the measurements of thousands of biomarkers. For the same predictor, its regression coefficients from different platforms are grouped into a vector. Then a penalty (LASSO or SCAD) is added to to the $L_2$ norm of the estimated vector parameter. By doing so, the estimated predictor effect with small $L_2$ norm will be shrunk to zero and deemed to be unimportant predictors. In this case, even if the effect sizes of the same predictor differ from one platform to another, the algorithm will make the decision based on the predictor's overall $L_2$ norm of the effect sizes across all platforms. In high dimension scenario of $p>n$, the penalized estimation will enforce the sparsity on the fitted model to avoid overfitting by thresholding the small predictor effect to zero. #### Heterogeneity It is often observed that genes perform differently on the different experimental platforms. Such discrepancy could be due to either biological reasons or data inconsistency caused by random variation. Instead of searching for genes with consistent performances across all of the platforms, we select the genes that their overall effect sizes measured by $L_2$ norm are great. Based on the overall effect size, the most important genes are selected to produce the candidate list. #### Dependent vs Independent datasets In addition, the algorithm offers great flexibility in terms of modeling independent or correlated datasets. If the observations across the platforms are independent, the sample sizes can be different across the datasets. The full likelihood is used to model the independent data. If the samples from different platforms are the same or related, the observations are correlated and the sample sizes will be the same across the platforms. The overall model fitting of all the platforms is evaluated by pseudo-likelihood. The Godambe information matrix of the pseudo-likelihood captures and reflects the covariance structure among different platforms. The new algorithm is able to handle unbalanced data sets across different platforms: 1) the algorithm can deal with the scenario that some predictors are only measured in some but not all platforms by adjusting the penalty size according to the number of platforms in which the predictor is involved; 2) the algorithm allows the sample size to vary from platform to platform if the samples are independent across different platforms. \vspace{8pt} The overview of the fusion learning algorithm applied on a cross-platform data set. Figure \ref{fig:01} presents an overview of the fusion learning algorithm applied on multiple-platform datasets. The algorithm maximizes the following objective function: \begin{eqnarray} Q(\beta) = l_I(\beta) - n \sum_{j=1}^p \Omega_{\lambda_n}(||\beta^{(j)}||), \end{eqnarray} where $l_I(\beta)$ is the integrated pseudo loglikelihood, and $\Omega_{\lambda_n}$ represents the penalty function with the tuning parameter $\lambda_n$. The penalty function imposes penalty on the $L_2$ -norm of the vector of the grouped coefficients ($\beta^{(j)} = (\beta_{1j}, \beta_{2j}, \cdots, \beta_{mj})^T$) for each of thepredictors $M_j$, $j=1,\dots,p.$ |Platforms | Response | $M_1$ | $\dots$ | $M_j$ | $\dots$ | $M_p$ | |:---------|:----------|:-------|:--------|:-------|:--------|:------| |1 | $y_1 \,\text{with}\, g_1(\mu_1) \sim$ | $x_{11}\beta_{11}$ | $\dots$ | $+x_{1j}\beta_{1j}$ | $\dots$ | $+x_{1p}\beta_{1p}$ | |2 | $y_2 \,\text{with}\,g_2(\mu_2) \sim$ | $x_{21}\beta_{21}$ | $\dots$ | $+x_{2j}\beta_{2j}$ | $\dots$ | $+x_{2p}\beta_{2p}$ | |$\cdots$ | | | | | | | |m | $y_m\,\text{with}\,g_m(\mu_m) \sim$ | $x_{m1}\beta_{m1}$ | $\dots$ | $+x_{mj}\beta_{mj}$ | $\dots$ | $+x_{mp}\beta_{mp}$ | ## Data Input The overall data consists of datasets from multiple platforms. For the $j$th platform, the dataset consists of a response vector $y_j$ of dimension $n_j$ and a predictor matrix $x_j$ of dimension $n_j \times p.$ Across different platforms, the measurements can be taken from the same samples or related samples, therefore the measurements are dependent across the platforms. If so, the samples across different experiments should be aligned and the sample size remains the same for all experiments. If the samples from different platforms are unrelated, the observations are independent across the platforms. The sample sizes $n_j$ can be different for different platforms. The predictors across different platforms should be aligned. We present the simulation of the datasets from four correlated platforms. In each platform, there are 100 samples, and each observation consists of 10 covariates as the candidate variables. The datasets across all platforms are jointly generated as follow. ```{r} library(MASS) library(mvtnorm) m <- 4 N <- 100 p <- 10 ``` ```{r include=TRUE} ## parameters rho <- 0.3 Rho <- rho*matrix(1,4,4) + diag(1-rho,4,4) sigma <- c(2,1,3,2) Sigma <- Rho * sigma * rep(sigma, each = nrow(Rho)) mu = c(-1,1,0.5,-0.5) ## data information name_pf = c("platform 1","platform 2", "platform 3","platform 4") name_id = paste("ID", seq(1:N)) name_cov = paste("M", seq(1:p)) ``` ```{r } x <- array(NA,dim = c(N,m,p)) for(k in 1:p){ x[,,k] <- mvrnorm(N,mu,Sigma) } ``` ```{r include = FALSE} dimnames(x) = list(name_id, name_pf, name_cov) ``` \vspace{8pt} ```{r echo=FALSE} print("Platform 1") print(round(x[1:5,1,1:5],3)) print("Platform 2") print(round(x[1:5,2,1:5],3)) print("Platform 3") print(round(x[1:5,3,1:5],3)) print("Platform 4") print(round(x[1:5,4,1:5],3)) ``` ```{r echo= FALSE} tp <- 5 beta <- array(0,dim = c(p,m)) for (j in 1:m){ beta[1:tp,j]<- c(1,1.5,1.2,2,1.8)*j beta[(tp+1):p,j]<-rep(0,(p-tp)) } rownames(beta) <- name_cov e <- rmvnorm(N,mu,Sigma) x1 = x[,1,] x2 = x[,2,] x3 = x[,3,] x4 = x[,4,] y1 = x1 %*% beta[,1] + e[,1] y2 = x2 %*% beta[,2] + e[,2] y3 = x3 %*% beta[,3] + e[,3] y4 = x4 %*% beta[,4] + e[,4] ``` To use the fusion learning functions on these datasets from four platforms, the response vectors and the predictors with the aligned order need to be listed. ```{r} x <- list(x1, x2, x3, x4) y <- list(y1, y2, y3, y4) ``` ## Main Functions `FusionLearn` package provides three functions `fusionbase`, `fusionbinary`, and `fusionmixed`, which can be employed to combine continuous, binary, or mixtures of continuous and binary responses from different platforms. ### Fusion Learning Algorithm #### Continuous Responses `fusionbase` is the function applied to the datasets with all continuous response variables. In the function `fusionbase`, users can set the penalty parameter `lambda` and choose the penalty functions such as `methods = scad` or `methods = lasso`. The function input also include $N$, the sample sizes for each platform and $p$, the number of the predictors and $m$, the number of the platforms. If the sample sizes are all equal, $N$ is given as a single value. If the sample sizes are different, $N$ is specified as a vector. If some predictors are only measured in some but not all of the platforms, then the datasets do not have complete information and users can specify the option `Complete = FALSE` . Detailed examples about this type of missing information are shown in Section 4.2. ```{r} result <- fusionbase(x, y, lambda = 0.9, N = N ,p = p, m = m, methods = "scad", Complete = TRUE) print(result) ## partial estimated parameters ``` The result of the algorithm yields 5 non-zero coefficient vectors and 5 zero coefficient vectors for the 10 predictors. Based on the penalized estimation results, the predictors with non-zero regression coefficients are selected as important predictors. Users can further use the model selection function `fusionbase.fit` in this package to obtain the value of the pseudo likelihood informaiton criterion for the model. The algorithm outputs include `beta`, `method`, `iteration`, and `threshold`. In the output `beta`, the estimated coefficients are listed. For the example above, the algorithm correctly selects all the five true non-zero coefficients. The figures show the non-zero coefficients and the linear models are well fitted with selected predictors. ```{r fig.align='center', fig.width= 4.5, fig.height = 3.5, echo = FALSE, fig.cap="Learning example of simulated continuous responses"} para <- data.frame(result$beta) rownames(para) <- name_cov g <- ggplot(para, aes(x=1:10)) g + geom_area(aes(y=X3+0.05, fill="Platform 3")) + geom_area(aes(y=X4+0.02, fill="Platform 4")) + geom_area(aes(y=X1*1.2+0.01, fill="Platform 1")) + geom_area(aes(y=X2*0.7+0.06, fill="Platform 2")) + labs(title="Estimated Coefficients", subtitle="Non-zero v.s. zero coefficients", caption="Continuous responses", y = "Estimation", x = " ") + scale_x_continuous( name = "Predictors", breaks = 1:10, labels = name_cov) + scale_y_discrete(labels = NULL) #+theme(panel.background = element_rect(fill = "white", colour = "grey50")) ``` #### Binary Responses If the responses from different platforms are all binary variables, users can use the function `fusionbinay`. Most function inputs are similar to the inputs of the function `fusionbase`. Users can specify the link functions `link = "logit"` or `link = "probit"` for the binary response variables. ```{r echo = FALSE} y1. <- y1 > 0 y2. <- y2 > 0 y3. <- y3 > 0 y4. <- y4 > 0 y.bin <- list(y1. ,y2., y3., y4.) ``` ```{r} result <- fusionbinary(x, y.bin, lambda = 0.15, N = N, p = p, m = m, methods = "scad", link = "logit") print(result) ``` ```{r fig.align='center',fig.width= 4.5, fig.height = 3.5, echo = FALSE, fig.cap = "Learning example of simulated binary responses"} para <- data.frame(result$beta) rownames(para) <- name_cov g <- ggplot(para, aes(x=1:10)) g + geom_area(aes(y=X1*1.4+0.01, fill="Platform 1")) + geom_area(aes(y=X2*1.2+0.015, fill="Platform 2")) + geom_area(aes(y=X3*1.1+0.005, fill="Platform 3")) + geom_area(aes(y=X4*0.8+0.02, fill="Platform 4")) + labs(title="Estimated Coefficients", subtitle="Non-zero v.s. zero coefficients", caption="Binary responses", y = "Estimation", x = " ") + scale_x_continuous( name = "Predictors", breaks = 1:10, labels = name_cov) + scale_y_discrete(labels = NULL) #+theme(panel.background = element_rect(fill = "white", colour = "grey50")) table(beta[,1]!=0, result$beta[,1]!=0) ``` #### Mixed-type Responses If the responses across multiple platforms contain both binary and continuous types, users can use the function `fusionmixed`. Besides all the inputs similarly required by `fusionbase`, the function requires the specification of the numbers of the platforms for each type of reponse (`m1`: the number of the platforms with continuous responses; `m2`: the number of platforms with binary responses). The `link` option is used to specify the link function for the binary response variables. ```{r echo = FALSE} y.mixed <- list(y1,y2, y3>0, y4>0) ``` ```{r} result <- fusionmixed(x, y.mixed, lambda = 0.4, N = N, p = p, m1 = 2, m2 = 2, methods = "scad", link = "logit") print(result) ``` ```{r fig.align='center', fig.width= 4.5, fig.height = 3.5, echo = FALSE, fig.cap = "Learning example of simulated mixed responses"} para <- data.frame(result$beta) rownames(para) <- name_cov g <- ggplot(para, aes(x=1:10)) g + geom_area(aes(y=X1+0.01, fill="Platform 1")) + geom_area(aes(y=X2*0.7+0.015, fill="Platform 2")) + geom_area(aes(y=X3*0.7+0.005, fill="Platform 3")) + geom_area(aes(y=X4*0.7+0.02, fill="Platform 4")) + labs(title="Estimated Coefficients", subtitle="Non-zero v.s. zero coefficients", caption="Mixed responses", y = "Estimation", x = " ") + scale_x_continuous( name = "Predictors", breaks = 1:10, labels = name_cov) + scale_y_discrete(labels = NULL) #+theme(panel.background = element_rect(fill = "white", colour = "grey50")) table(beta[,1]!=0, result$beta[,1]!=0) ``` ## Unbalanced datasets In practice, some predictors may not be measured in all of the platforms. Therefore, some of the datasets may not have a complete set of the predictors. The fusion learning package can handle this type of missing predictors. For example, the dataset in `platform 2` contains some predictors with all `NA` values, which means these predictors are not measured in the `platform 2`. In this case, the group penalization estimation will adjust the penalty with respect to the number of platoforms that predictor is involved in. ```{r echo=FALSE} x2. <- x2 x2.[,2] <- NA print(cbind(round(x2.[1:5,1:5],3))) ``` ```{r echo = FALSE} x <- list(x1, x2., x3, x4) y <- list(y1, y2, y3, y4) ``` Since the second predictor is not measured `platform 2`, the results of corresponding coefficients are `NA`. For predictors or responses missing at random, various imputation techniques and existing R packages can be employed to impute the missing predictors or responses. ```{r} result <- fusionbase(x, y, lambda = 1.3, N = N, p = p, m = m, methods = "scad", Complete = FALSE) print(result$beta) ``` ## Model Selection Criterion ```{r include = F} m <- 4 N <- 500 p <- 1000 ## parameters rho <- 0.3 Rho <- rho*matrix(1,4,4) + diag(1-rho,4,4) sigma <- c(2,1,3,2) Sigma <- Rho * sigma * rep(sigma, each = nrow(Rho)) mu =c(0,0,0,0) ## data information name_pf = c("platform 1","platform 2", "platform 3","platform 4") name_id = paste("ID", seq(1:N)) name_cov = paste("M", seq(1:p)) ## Main x <- array(NA,dim = c(N,m,p)) for(k in 1:p){ x[,,k] <- mvrnorm(N,mu,Sigma) } dimnames(x) = list(name_id, name_pf, name_cov) tp <- 40 beta <- array(0,dim = c(p,m)) for (j in 1:m){ beta[1:tp,j]<- runif(tp, 1,2) beta[(tp+1):p,j]<-rep(0,(p-tp)) } beta[c(2,4,6,10,16,20),] <- - beta[c(2,4,6,10,16,20),] rownames(beta) <- name_cov rho <- 0.3 Rho <- rho*matrix(1,4,4) + diag(1-rho,4,4) sigma <- c(2,1,3,2) Sigma <- Rho * sigma * rep(sigma, each = nrow(Rho)) e <- rmvnorm(N,mu,Sigma) x1 = x[,1,] x2 = x[,2,] x3 = x[,3,] x4 = x[,4,] y1 = x1 %*% beta[,1] + e[,1] y2 = x2 %*% beta[,2] + e[,2] y3 = x3 %*% beta[,3] + e[,3] y4 = x4 %*% beta[,4] + e[,4] x <- list(x1, x2, x3, x4) y <- list(y1, y2, y3, y4) ``` The performance of the variable selection depends on the size of the penalty parameter $\lambda_n.$ Larger $\lambda_n$ leads to more sparse models. To choose the size of the penalty parameter, users can use model selection criterion. The selection criterion included in `FusionLearn`is the pseudo-Bayesian information criterion (pseudo-BIC). \begin{eqnarray}\label{eq:02} \text{pseudo-BIC}(s) = -2l_I(\hat{\beta}_I) + d_s^* \gamma_n, \end{eqnarray} with $-2l_I(\hat{\beta}_I)$ denoting the pseudo loglikelihood, $d_s^{*}$ measuring the model complexity and $\gamma_n$ being the penalty on the model complexity. Three functions `fusionbase.fit`, `fusionbinary.fit`, and `fusionmixed.fit` provide the model selection criteria. In the next example, the simulation study is conducted in high dimensional settings. Suppose there are four correlated platforms with mixed-type responses, in which two platforms have continuous responses and two have binary responses. In each platform, we have 500 samples and 1000 predictors. Therefore, the dimension of parameters in each platform is more than the sample size. For a grid of penalty parameter values from $0.1$ to $2$, function `fusionbase.fit` can calculate the pseudo-Bayesian information criterion (pseudo-BIC) for each value of the penalty parameter. If the platforms are correlated, the function requires the input `depen = "CORR"` and the sample sizes across all platforms must be the same and the samples must be aligned across the platforms. The algorithm estimated the Godambe information matrix to capture the dpendence structure across different platforms. There is a multiplicative factor `a` to the second term in the information criterion \ref{eq:02}. It has a default value of one and users can specify higer values. Increasing the value of `a` is to impose heavier penalty on the model complexity in the model selection criterion. ```{r fig.align='center', echo = TRUE, fig.width= 4.5, fig.height = 3.5, fig.cap = "Example of model selection process in the fusion learning algorithm", warning = F} lambda <- c(0.19, 0.2, 0.3, 0.5, 0.6, 1.75, 2) model <- fusionbase.fit(x, y, lambda, N = 500, p = 1000, m = 4, depen = "CORR") print(model) ``` According to the results, when the penalty parameter `lambda` is $0.5$, the selected model has the minimum pseudo-BIC, and the minimum value remains in the range between $0.19$ and $2.00$. When the penalty parameter is too small, the selected model still has the number of predictors exceeding the number of samples, so that the information matrix can be singular. If the penalty parameter is too large, all predictors' coefficients will be penalized to zero. Therefore, we can set `lambda = 0.5` in the function `fusionbase` in the next step. ```{r} result <- fusionbase(x, y, lambda = 0.5, N = 500, p = 1000, m = 4) ``` The function `fusionbase` provides the results based on the penalty parameter `lambda = 0.5`. The estimation results indicate that the algorithm correctly identifies the true non-zero and true zero coefficients. ```{r echo=F} result_table = table(beta[,1]!=0,result$beta[,1]!=0 ) colnames(result_table) <- c("Est zero", "Est non-zero") rownames(result_table) <- c("True zero", "True non-zero") result_table ``` ## Data Applications We provide two examples in `FusionLearn` package. The stock index data is a special application of the fusion learning algorithm on financial data. In this example, the predictors are the 34 stocks, and the three responses are three different market indices. We treat each market index together with the 34 stocks as the dataset from one platform. This data is an example that the financial analysts may be interested in finding the common set of stocks which are influential to all of the three market indices. Since the platforms are correlated, we include the option `depen = "CORR"` in `fusionbase.fit`. The plot shows the minimum pseudo-BIC reaches when `lambda = 3.34`. The selected stocks are listed below. ```{r fig.align='center', fig.width= 4.5, fig.height = 3.5, fig.cap = "Model selection for the function `fusionbase.fit`"} ##analysis of the stock index data #Responses contain indices "VIX","GSPC", and "DJI" y <- list(stockindexVIX[,1],stockindexGSPC[,1],stockindexDJI[,1]) #Predictors include 46 stocks x <- list(stockindexVIX[,2:47],stockindexGSPC[,2:47],stockindexDJI[,2:47]) ##Implementing the model selection algorithm based on the psuedolikelihood ##information criteria model <- fusionbase.fit(x,y,seq(0.03,5,length.out = 10),232,46,3,depen="CORR") print(model) lambda <- model[which.min(model[,2]),1] result <- fusionbase(x,y,lambda,232,46,3) ##Identify the significant predictors for the three indices id <- which(result$beta[,1]!=0)+1 colnames(stockindexVIX)[id] ``` In the second example, we consider two microarray experiments and both of the experiments were conducted to investigate the estrogen status of breast cancer patients. We treat each experiment as one platform. Since the two experiments are independently conducted, the samples are independent across the two platforms and the sample sizes are different. In particular, one dataset contains 98 observations, and another one includes 286 observations. The predictors in each dataset are 1000 mock-version gene expressions. In the end, the algorithm selects three most important biomarkers to differentiate the estrogen status of the patients. ```{r fig.align='center'} ##Analysis of the gene data y = list(mockgene1[,2],mockgene2[,2]) ## responses "status" x = list(mockgene1[,3:502],mockgene2[,3:502]) ## 500 predictors ##Implementing fusion learning algorithm lambda <- seq(0.1,0.5, length.out = 10) result <- fusionbinary(x,y,0.3,N=c(98,286),500,2) id <- which(result$beta[,1]!=0)+2 genename <- colnames(mockgene1)[id] print(genename) ```