---
title: "douconca"
output: 
  rmarkdown::html_vignette
bibliography: dcCAbibliography.bib
link-citations: yes
vignette: >
  %\VignetteIndexEntry{douconca}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(douconca)
```
# The douconca package {.unnumbered}

The aim of the `douconca` package is to help ecologists unravel trait-environment
relationships from an abundance data table with
associated multi-trait and multi-environment data tables.
A popular method to such aim is RLQ [@doledec1996RLQ;@dray2014RLQ],
which also a three-tables method. RLQ is available in the R-package `ade4`[@dray2007ade4; @thioulouse2018multivariate].
The `douconca` package provides an alternative method, termed 
double-constrained correspondence analysis (dc-CA), which is a natural 
extension of the commonly used method of community-weighted means (CWMs) regression
[@Braak2018dcCA;@kleyer2012assessing]. 
As dc-CA is based on both regression analysis and ordination
(factorial analysis), it allows for the usual forms of testing, model building,
and of biplots of resulting scores.
There are a number of recent applications that also contain discussion
on the value of dc-CA compared to RLQ
[@peng2021double;@gobbi2022hay;@pinho2021functional], summarized in this document on
[ResearchGate](https://www.researchgate.net/publication/379731819_dc-CA_RLQ).

The `douconca` package has a `formula`-interface to specify the dc-CA model, 
and `scores`, `anova`, `plot` and `predict` functions, mostly fairly similar to
those in the `vegan` package [@oksanen_vegan2024] on to which `douconca` is based.

# Double constained correspondence analysis

As RLQ, dc-CA seeks for an ordination (i.e. a low-dimensional representation of)
of the multi-trait, multi-environment relationships, but dc-CA differs from RLQ in that
dc-CA is based on regression with the traits and environmental variables as predictors,
whereas RLQ is based on co-variance.
The dc-CA method thus allows for variation-partitioning and 
the type of model-building that is familiar to users of regression analysis,
whereas RLQ does not.

A dc-CA axis consists of two regression models 
(linear combinations), one of traits and the other of environmental predictors,
the fitted values of which can be thought of as 
a composite trait and a composite environmental gradient.
The response variables of these regressions are 
species niche centroids and CWMs of such composites.
This circularity is typical for any eigenvalue ordination method.
The linear combinations maximizes
the fourth-corner correlation between 
the composite trait and composite environmental gradient. 
The dc-CA eigenvalues are squared fourth-corner correlations, but also 
variances, namely the amounts of variation in the abundance 
data that the consecutive axes explain [@Braak2018dcCA].

Statistical testing is by the max test [@Braak2012max_test],
evaluated by extensive simulation [@Braak2019MEE].
This test combines two permutations tests, one
permuting sites and the other permuting species, the maximum P-value
of which is the final P-value. As in the `vegan` package, the permutations are
specified via the `permute` package [@permute2022], so as to allow for analysis of
hierarchical and nested data designs [@gobbi2022hay].

In `douconca`, a dc-CA models is specified by two formulas:
a `formula` for the sites (rows) with environmental predictors and a
`formula` for the species (columns) with trait predictors, which both 
may contain factors, quantitative variables and transformations thereof, and interactions,
like in any (generalized) linear regression model. 
The formulas specify the constraints applied to
the site and species scores; without constraints dc-CA is simply
correspondence analysis.

The double constrained version of principal components analysis also exists and is
available in the Canoco software [@Braak2018canoco], but has less appeal in ecological applications as it lacks ecological realism, ease of interpretation and the link to methods, 
such as CWM-regression and fourth-corner correlation analysis, which have proven to be 
useful in trait-based ecology. 


# Example data and questions

We use the `dune_trait_env` data in the package to illustrate dc-CA. It consists of
the abundances of 28 plant species in 20 meadows (plots, here called sites),
trait data for these plant species and environmental data of these sites.

```{r}
library(douconca)
data("dune_trait_env")
names(dune_trait_env)
dim(dune_trait_env$comm[, -1]) ## without the variable "Sites"
dim(dune_trait_env$traits)
dim(dune_trait_env$envir)
names(dune_trait_env$traits)
names(dune_trait_env$envir)
```
There are five morphological traits (from the `LEDA` trait database)
and four ecological traits (Ellenberg indicator values for moisture, 
acidity, nutrients and light).

There are five environmental variables and two sets of two spatial coordinates,
which are approximately equal. The `X` and `Y` are the coordinates of the plot.
The lot-variables are the center of the meadow where the sample has been taken. 

The type of questions that dc-CA is able to address is:

* How many dimenstions are needed to represent the major part of the trait-environment relations?
* Is the trait-environment relationship statistically significant?
* How many dimensions are statistically significant?
* What is the importance of a variable in a dc-CA axis?
* What is trait-structured variation and
which of the trait sets has the larger such variation?
* Which of the  trait sets (morphological versus ecological) is more closely related to
the environmental variables?

## Basic analysis

The next code gives a basic dc-CA analysis. 
The response matrix or data frame must be numerical, with columns representing the species.
The first variable (`Sites`) must therefore be deleted. 

```{r}
Y <- dune_trait_env$comm[, -1] # must delete "Sites"
mod <- dc_CA(formulaEnv = ~ A1 + Moist + Use + Manure + Mag,
             formulaTraits = ~ SLA + Height + LDMC + Seedmass + Lifespan,
             response = Y,
             dataEnv = dune_trait_env$envir,
             dataTraits = dune_trait_env$traits)
```

In `douconca`, dc-CA is calculated in two steps that provide useful information each. 
Step 1 of the dc-CA algorithm summarizes the canonical correspondence analysis (CCA)
of the transposed response matrix on to the trait data using 
`formulaTraits = ~ SLA + Height + LDMC + Seedmass + Lifespan`. 
The morphological traits in this formula explain 28.85% of the total inertia (variance)
in the abundance data `Y`. This inertia (0.6776) is called the trait-structured 
variation. Inertia is in general a weighted variance, but in this case
it is thus unweighted as sites have equal
weight in the analysis, because `divideBySiteTotals` is true by defaults.
Formally, it is the total (unweighted) variance in the community weighted means of orthonormalized 
traits (the traits are orthonormalized in Step 1).
The trait-structured variation is further analyzed in Step 2 using redundancy analysis (RDA).
Step 2 shows that 65.73% of this variation can be explained by the environmental variables
using 
`formulaEnv = ~ A1 + Moist + Use + Manure + Mag`. The constrained axes of this RDA are
also the dc-CA eigenvalues:

```{r}
mod$eigenvalues
```

The first axis explains `r round(100*mod$eigenvalues[1]/ sum(mod$eigenvalues),0)`% 
of the trait-environment variance and this axis is dominated by moisture and by SLA and Seedmass,
as judged by the size of their regression coefficient and
(optimistic) t-value on this axis in the print of the model. The default `plot` shows the intra-set correlations of the
variables with the axis, but t-values can be visualized with 

```{r, fig.width=7}
plot(mod,gradient_description = "t")
```

## Statistical testing

There are two-ways to statistically test the model: (1) the omnibus test (using all five dimensions)
is obtained with `anova(mod)`, giving a P-value of about 0.02 and (2) a test per dc-CA axis, obtained by 

```{r}
set.seed(1)
anova(mod, by = "axis")
```

In the test per axis, the first axis has P-values of 0.09 and 0.001
at the species- and site-level, respectively, so that the P-value of the max test is 0.09. 
A little more explanation may be instructive.
The species-level test consists of testing the 
(weighted) regression of the species-niche-centroids with respect to
orthonormalized environmental variables against the traits. 
The site-level test consists of testing the (in this case, unweighted) regression of 
the community-weighted means of orthonormalized traits against the environmental variables.
Both tests are carried out by permutation, the first by permuting species in the trait data,
the second by permuting sites in the environmental data.
A new dc-CA is carried out for each permuted data set. For a full description see under Details
in the help system.

## Fitted values and predictions

There are three kinds of fitted values (and of predictions for new data):

* fitted traits per site, obtained with `predict(mod, type = "traits" )`
* fitted environmental values per species, obtained with `predict(mod, type = "env")`
* fitted abundances, obtained with `predict(mod, type = "response" )`

The fitted traits per site are simply fitted community-weighted means and the fitted 
environmental values are fitted species-niche centroids. 
Note that 10% of the fitted abundance values is negative in our example. 
Negative values indicate likely absences or low abundance values of species with the
specified traits and environmental values.

# Which set of traits is most closely related to abundance and to the environment?

In this section, we pose the question whether the set of morphological (functional)
traits is more or less related to species abundance and to the environmental
variables than the set of ecological traits.

```{r}
mod_e <-  dc_CA(formulaEnv = ~ A1 + Moist + Manure + Use + Mag,
                formulaTraits = ~ F + R + N + L,
                response = Y,
                dataEnv = dune_trait_env$envir,
                dataTraits = dune_trait_env$traits)
```

The entry `traits_explain` is `r round(mod_e$inertia["traits_explain", "weighted variance" ], 3)`, 
which is the variance in the abundance data that is explained by the traits. 
It can directly be compared to the corresponding entry in the previous model, which is
`r round(mod$inertia["traits_explain", "weighted variance" ], 3)`. The entry 
`constraintsTE` is the variance in abundance data that is explained by 
traits and environmental variables jointly. 
Its value in the second model is higher than that in the first. 
On closer examination of the results, the second eigenvalue of the last model is
even higher than the first one of the first model and, indeed, the first two 
dc-CA axes are significant as can be seen from an `anova`:

```{r}
anova(mod_e, by = "axis")$max
```

The fourth-corner correlation of the best linear combination of the
ecological traits with the best linear combination of the environmental variables is 

```{r}
round(sqrt(mod_e$eigenvalues[1]), 2)
```

compared to `r round(sqrt(mod$eigenvalues[1]), 2)` for the best
linear combination of the morphological traits.
In conclusion, the ecological traits explain more of the abundance data and are closer
related to the environmental variables. 

# Do the morphological traits contribute after accounting for the ecological traits?

Do the morphological traits carry important additional information
on the species (beyond their ecological traits)
for understanding which species occur where? 
(i.e. for understanding the species-environment relationships).
To address this question, specify the ecological traits as `Condition` in
the trait formula and perform an `anova` of the resulting model.

```{r}
mod_mGe <-  dc_CA(formulaEnv = ~ A1 + Moist + Manure + Use + Mag,
                 formulaTraits = 
                   ~ SLA + Height + LDMC + Seedmass + Lifespan + Condition(F+R+N+L),
                 response = Y,
                 dataEnv = dune_trait_env$envir,
                 dataTraits = dune_trait_env$traits, verbose = FALSE)
anova(mod_mGe, by= "axis")$max
```

As jugded by the test of significance by axes, the morphological traits
contribute little.

# One trait: CWM regression without inflated type I error.

## Introduction

CWM regression is know to suffer from serious type I error inflation
in statistical testing [@peres2017linking; @lepvs2023differences]. 
This section shows how to perform CWM regression of a single trait
using dc-CA with a `max` test to guard against type I error inflation.
This test does not suffer from the, sometimes extreme,
conservativeness of the 
ZS (Zelený & Schaffers) modified test [@Braak2018simple; @lepvs2023differences].

We also show the equivalence of the site-level test with that of a CWM-regression and
the equivalence of the dc-CA and CWM regression coefficients.
On the way, we give examples of the `scores` and `fCWM_SNC` functions in the `douconca` package.

## Testing the relationship between LDMC and the environmental variables

```{r}
mod_LDMC <- dc_CA(formulaEnv = ~ A1 + Moist + Manure + Use + Mag,
                   formulaTraits = ~ LDMC,
                   response = Y, 
                   dataEnv = dune_trait_env$envir,
                   dataTraits = dune_trait_env$trait, verbose = FALSE)
anova(mod_LDMC)
```

The P-values of the species-level  and site-level permutation tests are
0.396 and 0.028, respectively, so that the final P-value is 0.396.
There is thus no evidence that the trait LDMC is related to the 
environmental variables in the model. 

We now show that performing CWM-regression only would lead to the opposite conclusion.
For this, we first calculate the CWMs of LDMC, using the function `fCWM_SNC`,
the arguments of which are similar to that of the `dc_CA` function.

```{r}
CWMSNC_LDMC <- fCWM_SNC(formulaEnv = ~ A1 + Moist + Manure + Use + Mag,
                      formulaTraits = ~ LDMC,
                      response = Y, 
                      dataEnv = dune_trait_env$envir,
                      dataTraits = dune_trait_env$trait, verbose = FALSE)
```
The result, `CWMSNC_LDMC`, is a list containing the CWMs of LDMC, among other items. 
We combine the (community-weighted) mean LDMC to the environmental data, 
apply linear regresion and compare the model with the null model using `anova`.

```{r}
envCWM <- cbind(dune_trait_env$envir, CWMSNC_LDMC$CWM)
lmLDMC <- lm(LDMC ~ A1 + Moist + Manure + Use + Mag, data = envCWM)
anova(lmLDMC, lm(LDMC ~ 1, data = envCWM))
```

resulting in a P-value of 0.0279, which is in agreement with the
P-value of the site-level permutation test of this model. 
CWM-regression of LDMC shows evidence for a relationship with the 
environmental variables whereas there is in fact very little evidence
as shown by dc-CA.

## The coefficients of a CWM-regression are proportional to those of dc-CA

The introduction said that dc-CA extends CWM-regression to multiple traits.
We now show that the regression coefficients issued by dc-CA with a single trait 
are, up to a scaling constant, identical to those of a linear CWM-regression
of this trait.

We first extract the regression coefficients from the dc-CA model using the `scores` function.
The second line calculates the regression coefficients by
dividing the standardize regression coefficients from `dc_CA` by the standard deviation
of each environmental variable.

```{r}
(regr_table <- scores(mod_LDMC, display = "reg"))
coefs_LDMC_dcCA <- regr_table[, "dcCA1"] / regr_table[, "SDS"]
range(coef(lmLDMC)[-1] / coefs_LDMC_dcCA)
```

The result shows that the two sets of coefficients are equal
up to a constant of proportionality, here 154.4359.
The t-values are also equal:

```{r}
cbind(summary(lmLDMC)$coefficients[-1, "t value", drop = FALSE],
scores(mod_LDMC, display = "tval"))
```
# References