--- title: "Multivariate tools for compositional data analysis: the ToolsForCoDA package" author: - Jan Graffelman - Dpt. of Statistics, Universitat Politecnica de Catalunya; Dpt. of Biostatistics, University of Washington date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: ToolsForCoDa.bib vignette: > %\VignetteIndexEntry{Multivariate tools for compositional data analysis: the ToolsForCoDA package} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) knitr::opts_chunk$set(fig.align = "center") knitr::opts_chunk$set(warnings = FALSE) knitr::opts_chunk$set(message = FALSE) knitr::opts_chunk$set(fig.width = 6, fig.height = 6) rm(list=ls()) ``` ## Introduction
The **ToolsForCoDa** package was originally created in order to provide functions for a canonical correlation analysis with compositional data (@Graffelman2018), based on the centred logratio (clr) transformation of the compositions. Posteriorly, it has been extended with additional tools for the multivariate analysis of compositional data in the R environment. Currently, this package (version 1.0.9) provides functionality for * log-ratio principal component analysis (LR-PCA). * log-ratio canonical correlation analysis (LR-CCO). * log-ratio discriminant analysis (LR-LDA). Both CCO and LDA rely on the inversion of a covariance matrix. The covariance matrix of the clr transformed compostions is structurally singular. The programs `lrcco` and `lrlda` resolve this with the use of a generalized inverse. Functionality for the analysis of compositional data in the R environment can be found in the packages **compositions** (@compositions), **robCompositions** (@robCompositions), **easyCODA** (@easyCODA) and **zCompositions** (@zCompositions). For further reading on compositional data, see the seminal text on compositional data by Aitchison (@Aitchison) and several recent statistical textbooks (@Pawlowsky, @Filzmoser, @Greenacre, @VanDenBoogaart). The remainder of this vignette shows an example session showing how to perform the three aforementioned types of analysis. 1. [Installation](#installation) 2. [LR-PCA](#lrpca) 3. [LR-CCO](#lrpco) 4. [LR-LDA](#lrlda) ## 1. Installation ```{r preinstall} #install.packages("ToolsForCoDa") library(calibrate) library(Correlplot) library(ToolsForCoDa) ``` ## 2. Logratio principal component analysis (LR-PCA) We consider the composition of 37 Pinot Noir samples, consisting of the concentrations of Cd, Mo, Mn, Ni, Cu, Al, Ba, Cr, Sr, Pb, B, Mg, Si, Na, Ca, P, K and an evaluation of the wine's aroma. (@FrankKowalski). ```{r pinotnoir} data(PinotNoir) head(PinotNoir) Aroma <- PinotNoir[,c("Aroma")] ``` We apply closure to the chemical concentrations by division by their total, and use `lrpca` to do perform LR-PCA. ```{r closure} X <- as.matrix(PinotNoir[,1:17]) Xc <- X/rowSums(X) out.lrpca <- lrpca(Xc) ``` We study the decomposition of compositional variance, and the decay of the LR-PCA eigenvalues by means of a screeplot ```{r decomposition} round(out.lrpca$decom,2) plot(1:ncol(out.lrpca$decom), out.lrpca$decom[1,],type="b",main="Scree-plot", xlab="PC",ylab="Eigenvalue") ``` We construct a covariance biplot, using `jointlim` to establish sensible limits for the x and y axes. Column markers for the clr transformed variables are multiplied by a constant (2.5) for a better visualization, and the amount of explained variance is indicated on the coordinate axes. ```{r biplot} lims <- jointlim(out.lrpca$Fs,2.5*out.lrpca$Gp) bplot(out.lrpca$Fs,2.5*out.lrpca$Gp,rowlab="",collab=colnames(Xc),rowch=1,colch=NA, xl=lims$xlim,yl=lims$ylim,main="Covariance biplot") pc1lab <- paste("PC1 (",toString(round(100*out.lrpca$decom[2,1],1)),"%)",sep="") pc2lab <- paste("PC2 (",toString(round(100*out.lrpca$decom[2,2],1)),"%)",sep="") text(1,-0.1,pc1lab,cex=0.75) text(0.1,1.5,pc2lab,cex=0.75,srt=90) ``` This biplot reveals that the logratio $\ln{(Na/Pb)}$ has a large variance and is tightly correlated to the first principal component. The variable `Aroma` correlates with the first principal components ```{r aroma} cor(Aroma,out.lrpca$Fs[,1:2]) ``` and as the biplot suggests, `Aroma` correlates positively with the logratio $\ln{(Cr/Sr)}$ ```{r correlation} logScSr <- log(Xc[,c("Cr")]/Xc[,c("Sr")]) cor(Aroma,logScSr) ``` We note function `lrpca` also calculates condition indices, which may prove useful for detecting proportionality or one-dimensional relationships (@Graffelman2021). ## 3. Logratio canonical correlation analysys (LR-CCO) Two examples of LR-CCO are given below. The first example concerns a small artificial data set, where both the X and Y set are compositional, and is described in Section 3.1 of Graffelman et al. (2018). The second example concerns major oxides compositions of bentonites, where the X set is compositional and Y set is not. ### 3.1 Artificial data We first load two artificial 3-part compositions. ```{r artificial} data("Artificial") Xsim.com <- Artificial$Xsim.com Ysim.com <- Artificial$Ysim.com colnames(Xsim.com) <- paste("X",1:3,sep="") colnames(Ysim.com) <- paste("Y",1:3,sep="") ``` We make the ternary diagrams of the two sets of compositions ```{r ternaries, fig.width = 6, fig.height = 3} opar <- par(mfrow=c(1,2),mar=c(2,2,1,0)+0.5,pty="s") par(mfg=c(1,1)) ternaryplot(Xsim.com,pch=1) par(mfg=c(1,2)) ternaryplot(Ysim.com,pch=1) par(opar) ``` We perform the compositional canonical analysis: ```{r lrcco} out.lrcco <- lrcco(Xsim.com,Ysim.com) ``` And we reproduce the results in Table 1 of Graffelman et al. (2018). The canonical correlations are obtained as ```{r cancortab} round(diag(out.lrcco$ccor),digits=3) ``` The canonical weights of the X set and the Y set are obtained by: ```{r canweights} out.lrcco$A out.lrcco$B ``` The canonical loadings of the X set and the Y set are obtained by ```{r canloadings} out.lrcco$Rxu out.lrcco$Ryv ``` The adequacy coefficients of the X set and the Y set: ```{r canadequacy} out.lrcco$fitXs out.lrcco$fitYs ``` The redundancy coefficients of the X set and the Y set ```{r canredundancey} out.lrcco$fitXp out.lrcco$fitYp ``` Finally, we make the full set of biplots for LR-CCO given in Figure 2 (@Graffelman2018). In each biplot, the canonical variates are multiplied by a convenient scalar to facilitate the visualization. ```{r panelbiplots} opar <- par(mfrow=c(2,2),mar=c(2,2,1,0)+0.5,mgp=c(2,1,0)) par(mfg=c(1,1)) # # Figure A # Z <- rbind(out.lrcco$Fs,out.lrcco$Gp) plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="A") arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1) arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1) text(out.lrcco$Fs[,1],out.lrcco$Fs[,2], c(expression(X[1]),expression(X[2]),expression(X[3]))) text(out.lrcco$Gp[,1],out.lrcco$Gp[,2], c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1)) grid() fa <- 0.15 points(fa*out.lrcco$U[,1],fa*out.lrcco$U[,2]) par(mfg=c(1,2)) # # Figure B # Z <- rbind(out.lrcco$Fp,out.lrcco$Gs) plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="B") arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1) arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1) text(out.lrcco$Fp[,1],out.lrcco$Fp[,2], c(expression(X[1]),expression(X[2]),expression(X[3]))) text(out.lrcco$Gs[,1],out.lrcco$Gs[,2], c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1)) grid() fa <- 0.25 points(fa*out.lrcco$V[,1],fa*out.lrcco$V[,2]) # # Standardizing the clr transformed data # Xstan.clr <- scale(clrmat(Xsim.com)) Ystan.clr <- scale(clrmat(Ysim.com)) res.stan.cco <- canocov(Xstan.clr,Ystan.clr) par(mfg=c(2,1)) # # Figure C # Z <- rbind(res.stan.cco$Fs,res.stan.cco$Gp) plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="C") arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1) arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1) text(res.stan.cco$Fs[,1],res.stan.cco$Fs[,2], c(expression(X[1]),expression(X[2]),expression(X[3]))) text(res.stan.cco$Gp[,1],res.stan.cco$Gp[,2], c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1)) grid() fa <- 0.2 points(fa*res.stan.cco$U[,1],fa*res.stan.cco$U[,2]) circle() par(mfg=c(2,2)) # # Figure D # Z <- rbind(res.stan.cco$Fp,res.stan.cco$Gs) plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="D") arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1) arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1) text(res.stan.cco$Fp[,1],res.stan.cco$Fp[,2], c(expression(X[1]),expression(X[2]),expression(X[3]))) text(res.stan.cco$Gs[,1],res.stan.cco$Gs[,2], c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1)) grid() fa <- 0.25 points(fa*res.stan.cco$V[,1],fa*res.stan.cco$V[,2]) circle() par(opar) ``` Panel A shows the logratios $\log{(x_2/x_3)}$ and $\log{(y_1/y_2)}$ to have long links that run parallel to the first canonical variate with the largest canonical correlation; these logratios are highly correlated. The canonical biplot shows the association between the two sets of compositions, which is not visible in the ternary diagrams above. ### 3.2 Canonical analysis of bentonites In this subsection we treat the canonical analysis of bentonites. The X set concerns the concentrations of 9 major oxides, measured in 14 samples in the US (@Cadrin). A canonical analysis of this data set has been previously described (@Reyment), and is extended here with biplots. The Y set concerns two isotopes, $\delta D$ and $\delta 18O$. ```{r bentonites} data("bentonites") head(bentonites) ``` We clr-transform and column-center the major oxides, after deletion of MnO which is outlying and had many zeros, which were replaced with 0.001. We standardize the isotopes. ```{r clrbentonites} X <- bentonites[,1:9] X <- X[,-4] X <- X/rowSums(X) Y <- scale(bentonites[,10:11]) Xclr <- clrmat(X) cco <- canocov(Xclr,Y) ``` The two canonical correlations are large: ```{r twocancor} diag(cco$ccor) ``` We construct a biplot of the data: ```{r biplotbentonites} plot(cco$Fs[,1],cco$Fs[,2],col="red",pch=NA,xlab="First principal axis", ylab="Second principal axis",xlim=c(-1,1),ylim=c(-1,1),asp=1) textxy(cco$Fs[,1],cco$Fs[,2],colnames(X),cex=0.75) arrows(0,0,cco$Gp[,1],cco$Gp[,2],angle=10,col="blue",length=0.1) arrows(0,0,cco$Fs[,1],cco$Fs[,2],angle=10,col="red",length=0.1) text(cco$Gp[,1],cco$Gp[,2],colnames(Y),pos=c(3,1)) fa <- 0.45 points(fa*cco$U[,1],fa*cco$U[,2]) textxy(fa*cco$U[,1],fa*cco$U[,2],1:14) ``` We overplot the biplot with the canonical X-variates, which allows one to inspect the original samples (@Graffelman2005). For plotting, the canonical variate is scaled with a convenient scaling factor (here 0.45). This factor does not affect the interpretation of the biplot, but gives the samples a convenient spread. The logratio $\log{(Na/Mg)}$ (among others) almost coincides with the first canonical variate, which correlates with $\delta 18O$. However, interpretation should proceed with care because of the small sample size. ```{r lognaca} lnNaCa <- log(X[,c("Na")]/X[,c("Mg")]) cor(Y[,c("d18O")],lnNaCa) ``` ## 4. Logratio discriminant analysis (LR-LDA) We use archeological data from the UK (@Tubb) to illustrate LR-LDA. This dataset consists of measurements of nine oxides on 48 archeological samples from three regions in the UK. We first prepare the data: ```{r tubbdata} data(Tubb) head(Tubb) site <- factor(Tubb$site) Oxides <- as.matrix(Tubb[,2:10]) rownames(Oxides) <- Tubb$Sample Oxides <- Oxides/rowSums(Oxides) ``` Next, we carry out LR-LDA by passing the compositions in `Oxides` to the function `lrlda`. Internally, `lrlda` applies the clr transformation of the data. ```{r lrldatubbdata} out.lrlda <- lrlda(Oxides,site) ``` The group sizes are obtained with: ```{r groupsizes} out.lrlda$gsizes ``` The group mean vectors of the clr transformed compositions are given by: ```{r groupmeans} out.lrlda$Mclr ``` The scores of the linear discriminant function are obtained by: ```{r ldscores} head(out.lrlda$LD) ``` The confusion matrix for the training observations is: ```{r confusion} out.lrlda$CM ``` Posterior probabilities for the classifications are obtained by ```{r posteriorprobs} head(round(out.lrlda$prob.posterior,4)) ``` We extract biplot coordinates for group centers, individual observations and variables, and construct the LDA biplot. ```{r biplotcoordinates} Fp <- out.lrlda$Fp Gs <- out.lrlda$Gs LD <- out.lrlda$LD colvec <- rep(NA,nrow(Oxides)) colvec[site=="G"] <- "red" colvec[site=="NF"] <- "green" colvec[site=="W"] <- "blue" lims <- jointlim(LD,Fp) opar <- par(bty="n",xaxt="n",yaxt="n") plot(Fp[,1],Fp[,2],asp=1,pch=17,xlab="",ylab="",col=c("red","green","blue"), xlim=lims$xlim,ylim=lims$ylim,cex=1.25) points(LD[,1],LD[,2],col=colvec) origin() arrows(0,0,10*Gs[,1],10*Gs[,2],angle = 10, length = 0.1) textxy(10*Gs[,1],10*Gs[,2],colnames(Oxides)) par(opar) legend("topleft",c("G","NF","W"),col=c("red","green","blue"),pch=1,cex=0.5) ``` The LR-LDA biplot shows perfect separation of the three UK regions and suggests that a single logratio like $\log{(MgO/Al2O3)}$ (among other possibilities) is capable of discriminating the three regions. A boxplot of this logratio confirms this. ```{r lrmgoal2o3} lrMgOAl2O2 <- Oxides[,c("MgO")]/Oxides[,c("Al2O3")] boxplot(lrMgOAl2O2~site,col=c("red","green","blue")) ```
## References