--- title: "douconca" output: rmarkdown::html_vignette bibliography: dcCAbibliography.bib link-citations: yes vignette: > %\VignetteIndexEntry{douconca} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(douconca) ``` # The douconca package {.unnumbered} The aim of the `douconca` package is to help ecologists unravel trait-environment relationships from an abundance data table with associated multi-trait and multi-environment data tables. A popular method to such aim is RLQ [@doledec1996RLQ;@dray2014RLQ], which also a three-tables method. RLQ is available in the R-package `ade4`[@dray2007ade4; @thioulouse2018multivariate]. The `douconca` package provides an alternative method, termed double-constrained correspondence analysis (dc-CA), which is a natural extension of the commonly used method of community-weighted means (CWMs) regression [@Braak2018dcCA;@kleyer2012assessing]. As dc-CA is based on both regression analysis and ordination (factorial analysis), it allows for the usual forms of testing, model building, and of biplots of resulting scores. There are a number of recent applications that also contain discussion on the value of dc-CA compared to RLQ [@peng2021double;@gobbi2022hay;@pinho2021functional], summarized in this document on [ResearchGate](https://www.researchgate.net/publication/379731819_dc-CA_RLQ). The `douconca` package has a `formula`-interface to specify the dc-CA model, and `scores`, `anova`, `plot` and `predict` functions, mostly fairly similar to those in the `vegan` package [@oksanen_vegan2024] on to which `douconca` is based. # Double constained correspondence analysis As RLQ, dc-CA seeks for an ordination (i.e. a low-dimensional representation of) of the multi-trait, multi-environment relationships, but dc-CA differs from RLQ in that dc-CA is based on regression with the traits and environmental variables as predictors, whereas RLQ is based on co-variance. The dc-CA method thus allows for variation-partitioning and the type of model-building that is familiar to users of regression analysis, whereas RLQ does not. A dc-CA axis consists of two regression models (linear combinations), one of traits and the other of environmental predictors, the fitted values of which can be thought of as a composite trait and a composite environmental gradient. The response variables of these regressions are species niche centroids and CWMs of such composites. This circularity is typical for any eigenvalue ordination method. The linear combinations maximizes the fourth-corner correlation between the composite trait and composite environmental gradient. The dc-CA eigenvalues are squared fourth-corner correlations, but also variances, namely the amounts of variation in the abundance data that the consecutive axes explain [@Braak2018dcCA]. Statistical testing is by the max test [@Braak2012max_test], evaluated by extensive simulation [@Braak2019MEE]. This test combines two permutations tests, one permuting sites and the other permuting species, the maximum P-value of which is the final P-value. As in the `vegan` package, the permutations are specified via the `permute` package [@permute2022], so as to allow for analysis of hierarchical and nested data designs [@gobbi2022hay]. In `douconca`, a dc-CA models is specified by two formulas: a `formula` for the sites (rows) with environmental predictors and a `formula` for the species (columns) with trait predictors, which both may contain factors, quantitative variables and transformations thereof, and interactions, like in any (generalized) linear regression model. The formulas specify the constraints applied to the site and species scores; without constraints dc-CA is simply correspondence analysis. The double constrained version of principal components analysis also exists and is available in the Canoco software [@Braak2018canoco], but has less appeal in ecological applications as it lacks ecological realism, ease of interpretation and the link to methods, such as CWM-regression and fourth-corner correlation analysis, which have proven to be useful in trait-based ecology. # Example data and questions We use the `dune_trait_env` data in the package to illustrate dc-CA. It consists of the abundances of 28 plant species in 20 meadows (plots, here called sites), trait data for these plant species and environmental data of these sites. ```{r} library(douconca) data("dune_trait_env") names(dune_trait_env) dim(dune_trait_env$comm[, -1]) ## without the variable "Sites" dim(dune_trait_env$traits) dim(dune_trait_env$envir) names(dune_trait_env$traits) names(dune_trait_env$envir) ``` There are five morphological traits (from the `LEDA` trait database) and four ecological traits (Ellenberg indicator values for moisture, acidity, nutrients and light). There are five environmental variables and two sets of two spatial coordinates, which are approximately equal. The `X` and `Y` are the coordinates of the plot. The lot-variables are the center of the meadow where the sample has been taken. The type of questions that dc-CA is able to address is: * How many dimenstions are needed to represent the major part of the trait-environment relations? * Is the trait-environment relationship statistically significant? * How many dimensions are statistically significant? * What is the importance of a variable in a dc-CA axis? * What is trait-structured variation and which of the trait sets has the larger such variation? * Which of the trait sets (morphological versus ecological) is more closely related to the environmental variables? ## Basic analysis The next code gives a basic dc-CA analysis. The response matrix or data frame must be numerical, with columns representing the species. The first variable (`Sites`) must therefore be deleted. ```{r} Y <- dune_trait_env$comm[, -1] # must delete "Sites" mod <- dc_CA(formulaEnv = ~ A1 + Moist + Use + Manure + Mag, formulaTraits = ~ SLA + Height + LDMC + Seedmass + Lifespan, response = Y, dataEnv = dune_trait_env$envir, dataTraits = dune_trait_env$traits) ``` In `douconca`, dc-CA is calculated in two steps that provide useful information each. Step 1 of the dc-CA algorithm summarizes the canonical correspondence analysis (CCA) of the transposed response matrix on to the trait data using `formulaTraits = ~ SLA + Height + LDMC + Seedmass + Lifespan`. The morphological traits in this formula explain 28.85% of the total inertia (variance) in the abundance data `Y`. This inertia (0.6776) is called the trait-structured variation. Inertia is in general a weighted variance, but in this case it is thus unweighted as sites have equal weight in the analysis, because `divideBySiteTotals` is true by defaults. Formally, it is the total (unweighted) variance in the community weighted means of orthonormalized traits (the traits are orthonormalized in Step 1). The trait-structured variation is further analyzed in Step 2 using redundancy analysis (RDA). Step 2 shows that 65.73% of this variation can be explained by the environmental variables using `formulaEnv = ~ A1 + Moist + Use + Manure + Mag`. The constrained axes of this RDA are also the dc-CA eigenvalues: ```{r} mod$eigenvalues ``` The first axis explains `r round(100*mod$eigenvalues[1]/ sum(mod$eigenvalues),0)`% of the trait-environment variance and this axis is dominated by moisture and by SLA and Seedmass, as judged by the size of their regression coefficient and (optimistic) t-value on this axis in the print of the model. The default `plot` shows the intra-set correlations of the variables with the axis, but t-values can be visualized with ```{r, fig.width=7} plot(mod,gradient_description = "t") ``` ## Statistical testing There are two-ways to statistically test the model: (1) the omnibus test (using all five dimensions) is obtained with `anova(mod)`, giving a P-value of about 0.02 and (2) a test per dc-CA axis, obtained by ```{r} set.seed(1) anova(mod, by = "axis") ``` In the test per axis, the first axis has P-values of 0.09 and 0.001 at the species- and site-level, respectively, so that the P-value of the max test is 0.09. A little more explanation may be instructive. The species-level test consists of testing the (weighted) regression of the species-niche-centroids with respect to orthonormalized environmental variables against the traits. The site-level test consists of testing the (in this case, unweighted) regression of the community-weighted means of orthonormalized traits against the environmental variables. Both tests are carried out by permutation, the first by permuting species in the trait data, the second by permuting sites in the environmental data. A new dc-CA is carried out for each permuted data set. For a full description see under Details in the help system. ## Fitted values and predictions There are three kinds of fitted values (and of predictions for new data): * fitted traits per site, obtained with `predict(mod, type = "traits" )` * fitted environmental values per species, obtained with `predict(mod, type = "env")` * fitted abundances, obtained with `predict(mod, type = "response" )` The fitted traits per site are simply fitted community-weighted means and the fitted environmental values are fitted species-niche centroids. Note that 10% of the fitted abundance values is negative in our example. Negative values indicate likely absences or low abundance values of species with the specified traits and environmental values. # Which set of traits is most closely related to abundance and to the environment? In this section, we pose the question whether the set of morphological (functional) traits is more or less related to species abundance and to the environmental variables than the set of ecological traits. ```{r} mod_e <- dc_CA(formulaEnv = ~ A1 + Moist + Manure + Use + Mag, formulaTraits = ~ F + R + N + L, response = Y, dataEnv = dune_trait_env$envir, dataTraits = dune_trait_env$traits) ``` The entry `traits_explain` is `r round(mod_e$inertia["traits_explain", "weighted variance" ], 3)`, which is the variance in the abundance data that is explained by the traits. It can directly be compared to the corresponding entry in the previous model, which is `r round(mod$inertia["traits_explain", "weighted variance" ], 3)`. The entry `constraintsTE` is the variance in abundance data that is explained by traits and environmental variables jointly. Its value in the second model is higher than that in the first. On closer examination of the results, the second eigenvalue of the last model is even higher than the first one of the first model and, indeed, the first two dc-CA axes are significant as can be seen from an `anova`: ```{r} anova(mod_e, by = "axis")$max ``` The fourth-corner correlation of the best linear combination of the ecological traits with the best linear combination of the environmental variables is ```{r} round(sqrt(mod_e$eigenvalues[1]), 2) ``` compared to `r round(sqrt(mod$eigenvalues[1]), 2)` for the best linear combination of the morphological traits. In conclusion, the ecological traits explain more of the abundance data and are closer related to the environmental variables. # Do the morphological traits contribute after accounting for the ecological traits? Do the morphological traits carry important additional information on the species (beyond their ecological traits) for understanding which species occur where? (i.e. for understanding the species-environment relationships). To address this question, specify the ecological traits as `Condition` in the trait formula and perform an `anova` of the resulting model. ```{r} mod_mGe <- dc_CA(formulaEnv = ~ A1 + Moist + Manure + Use + Mag, formulaTraits = ~ SLA + Height + LDMC + Seedmass + Lifespan + Condition(F+R+N+L), response = Y, dataEnv = dune_trait_env$envir, dataTraits = dune_trait_env$traits, verbose = FALSE) anova(mod_mGe, by= "axis")$max ``` As jugded by the test of significance by axes, the morphological traits contribute little. # One trait: CWM regression without inflated type I error. ## Introduction CWM regression is know to suffer from serious type I error inflation in statistical testing [@peres2017linking; @lepvs2023differences]. This section shows how to perform CWM regression of a single trait using dc-CA with a `max` test to guard against type I error inflation. This test does not suffer from the, sometimes extreme, conservativeness of the ZS (Zelený & Schaffers) modified test [@Braak2018simple; @lepvs2023differences]. We also show the equivalence of the site-level test with that of a CWM-regression and the equivalence of the dc-CA and CWM regression coefficients. On the way, we give examples of the `scores` and `fCWM_SNC` functions in the `douconca` package. ## Testing the relationship between LDMC and the environmental variables ```{r} mod_LDMC <- dc_CA(formulaEnv = ~ A1 + Moist + Manure + Use + Mag, formulaTraits = ~ LDMC, response = Y, dataEnv = dune_trait_env$envir, dataTraits = dune_trait_env$trait, verbose = FALSE) anova(mod_LDMC) ``` The P-values of the species-level and site-level permutation tests are 0.396 and 0.028, respectively, so that the final P-value is 0.396. There is thus no evidence that the trait LDMC is related to the environmental variables in the model. We now show that performing CWM-regression only would lead to the opposite conclusion. For this, we first calculate the CWMs of LDMC, using the function `fCWM_SNC`, the arguments of which are similar to that of the `dc_CA` function. ```{r} CWMSNC_LDMC <- fCWM_SNC(formulaEnv = ~ A1 + Moist + Manure + Use + Mag, formulaTraits = ~ LDMC, response = Y, dataEnv = dune_trait_env$envir, dataTraits = dune_trait_env$trait, verbose = FALSE) ``` The result, `CWMSNC_LDMC`, is a list containing the CWMs of LDMC, among other items. We combine the (community-weighted) mean LDMC to the environmental data, apply linear regresion and compare the model with the null model using `anova`. ```{r} envCWM <- cbind(dune_trait_env$envir, CWMSNC_LDMC$CWM) lmLDMC <- lm(LDMC ~ A1 + Moist + Manure + Use + Mag, data = envCWM) anova(lmLDMC, lm(LDMC ~ 1, data = envCWM)) ``` resulting in a P-value of 0.0279, which is in agreement with the P-value of the site-level permutation test of this model. CWM-regression of LDMC shows evidence for a relationship with the environmental variables whereas there is in fact very little evidence as shown by dc-CA. ## The coefficients of a CWM-regression are proportional to those of dc-CA The introduction said that dc-CA extends CWM-regression to multiple traits. We now show that the regression coefficients issued by dc-CA with a single trait are, up to a scaling constant, identical to those of a linear CWM-regression of this trait. We first extract the regression coefficients from the dc-CA model using the `scores` function. The second line calculates the regression coefficients by dividing the standardize regression coefficients from `dc_CA` by the standard deviation of each environmental variable. ```{r} (regr_table <- scores(mod_LDMC, display = "reg")) coefs_LDMC_dcCA <- regr_table[, "dcCA1"] / regr_table[, "SDS"] range(coef(lmLDMC)[-1] / coefs_LDMC_dcCA) ``` The result shows that the two sets of coefficients are equal up to a constant of proportionality, here 154.4359. The t-values are also equal: ```{r} cbind(summary(lmLDMC)$coefficients[-1, "t value", drop = FALSE], scores(mod_LDMC, display = "tval")) ``` # References