---
title: "GCalignR Step by Step"
author: "Meinolf Ottensmann, Martin A. Stoffel, Hazel J. Nichols and Joseph I. Hoffman"
output: rmarkdown::html_vignette
bibliography: bibliography.bib
vignette: >
    %\VignetteIndexEntry{GCalignR Step by Step}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---
 
```{r, echo = FALSE}
library(knitr)
knitr::opts_chunk$set(cache = FALSE, fig.width = 8, fig.height = 8,highlight = TRUE,comment = "#>",strip.white = TRUE,collapse = TRUE, fig.align = "center", cache = F, warning = F, message = F)
```

## Introduction
We developed `GCalignR` primarily as an alignment tool for GC-FID data based on peak retention times, but other types of data that contain retention time information for peaks (e.g. GC-MS) are supported as well. `GCalignR` implements a fast and objective method to cluster putatively homologous substances prior to multivariate statistical analyses. Using sophisticated visualisations, the resulting alignments can then be fine tuned.
The input format of `GCalignR` is a peak list, which is comprised of retention times and arbitrary variables (e.g. peak height, peak area) that characterise each peak in a dataset. 

The implemented algorithm is purely based on retention time data, which is why the quality of the alignment is highly dependent on the quality of the raw data and the parameters used for the initial peak detection. In other words: The clearer the peaks are which were extracted from the chromatograms, the better the alignment will be. `GCalignR` has been created for situations where the main interest of the research is in exploring broader patterns rather then the specific function of a certain chemical, which is unlikely to be determined correctly in all cases when just retention times are used. Also, we recommend to double-check the resulting alignment with mass-spectrometry data, where available. Furthermore, replicates that were analysed using both GC-MS and GC-FID with identical gas chromatography settings can be aligned together and may be used to identify peaks and validate alignment results. 

This vignette gives an quick introduction into using `GCalignR`, whereas a more detailed description of the background is found in our manuscript[@Ottensmann.2017] and in the second vignette of the package *"GCalignR: How does the Algorithm work?"*.


## GCalignR workflow in a larger context

In the flow diagram below, we visualized the functionality of GCalignR within a complete workflow of analysing chemical data. After (1) analysis of the chemical samples with GC-FID, an often proprietary software is used to extract a list of peaks (retention times, peak area, also often peak height and other variables). Steps (3)-(7) are the alignment steps within `GCalignR`, detailed below. After alignment and normalisation, the output can be used as input for multivariate statistics in other packages such as vegan (8). 

```{r, out.width = 650, fig.retina = NULL, echo = FALSE, fig.cap="Extended Workflow using GCalignR in the analysis of chemical similarity patterns."}
knitr::include_graphics("workflow.png") 
```

## Installation

The development version can be downloaded from [GitHub](https://github.com/mottensmann/GCalignR) with the following code:

```{r, eval = FALSE} 
install.packages("devtools") 
devtools::install_github("mottensmann/GCalignR", build_vignettes = TRUE) 
```

```{r} 
library("GCalignR") 
```

The package documentation can be accessed with:

```{r, eval = FALSE} 
?GCalignR # documentation
```

## The functions below form the core of `GCalignR`:

* `align_chromatograms`: 
The entire alignment procedure is implemented within this function based on three parameter values described below. Additional parameters allow optional processing steps.

* `plot.GCalign`: 
Diagnostic plots summarise the aligned dataset.

* `print.GCalign`: 
Summarises the alignment process and lists all arguments to the function call.

* `gc_heatmap`: 
Visualises alignment results and enables to quickly inspect the distribution of substances (i.e aligned peaks) within the dataset. Furthermore, the deviation from mean retention times of a given substance can be used to detect potential issues with the alignment.

* `check_input`: 
Checks the format of the input data for conformity with the requirements and highlights violations.

* `norm_peaks`: 
Normalises the abundance measure of peaks by calculating relative intensities within samples. 

* `as.data.frame.GCalign`:
Outputs aligned datasets within a list of data frames for each variable in the input data.


## The alignment algorithm
We developed an alignment procedure that involves three sequential steps to align and finally match peaks belonging to putatively homologous substances across samples. The first 
step is to align each sample to a reference sample while maximising overall similarity
through linear shifts of retention times. This procedure is often described in the literature as 'full alignment'. In the second step, individual peaks are sorted into
rows based on close similarity of their retention times, a procedure that is often referred to as 'partial alignment'. Finally, there is still a chance that homologous peaks can
be sorted into different, but adjacent, rows in different samples, depending on the
variability of their retention times Consequently, a third step merges rows representing putatively homologous substances.

The alignment algorithm implemented in the `align_chromatograms` function contains the following steps:
(Here we refer to a peak list as all extracted peaks from a given sample chromatogram)

1. The first step in the alignment procedure consists of an algorithm that corrects systematic linear shifts between peaks of a query sample and a fixed reference to
account for systematic shifts in retention times among samples. By default the sample that is most similar on average to all other samples is automatically selected as reference. With respect to the user-defined parameter `max_linear_shift` linear shifts shifts are applied to all retention times of a sample to maximise similarity to the reference.

2. The second step in the alignment procedure aligns individual peaks across samples by comparing the peak retention times of each sample consecutively with the mean of all previous samples The parameter `max_diff_peak2mean` specifies the allowed deviation between the retention time of a peak and the mean of previous retention times within the same row. If the deviation is larger than allowed, matrix operations are conducted to sort the peaks accordingly.

3.The third step in the alignment procedure accounts for the fact that a number of homologous peaks will be sorted into multiple rows that can be subsequently merged The maximum mean difference between two retention time rows can be specified with the `min_diff_peak2peak` argument. 

### Optional steps:

4. Delete peaks that occur in just one sample by setting the `delete_single_peak` argument to `TRUE`

5. Delete all peaks that occur in negative control samples by specifying their names as argument to `blanks`
   
## Input data

The statistical analysis of GC-FID or GC-MS data is usually based on the detection of peaks (i.e. substances) within chromatograms instead of using the whole profile. Peak can be integrated using proprietary software or free programs. The peak data derived from a chromatogram usually contains the retention time of a given peak plus additional information such as the area under the peak or its height which are used in the subsequent analysis. 
`GCalignR` uses only the retention times (and not the mass-spectra, which may not be available, e.g. when using gas-chromatography coupled to a flame ionization detector (FID)) to align the peaks across individuals for subsequent chemometric analysis and pattern detection. The simple assumption is that peaks with similar retention times represent the same substances. However, it is recommended to verify this assumption by comparing also the mass-spectra (if available) of the substances of interest.
The input peak list used in `GCalignR` is a plain text file, whereby all elements should be separated by tabs (with sep = "/t") or any other separator, which has to be specified with the `sep` argument (see `?read.table` for a list of separators). The decimal separator has to be the point. 

* The first row contains the sample names, whereby every name must be unique. Names should not contain whitespaces and use the underscore as the only special character. 

* The second row
contains the names of the data columns. For example, if both the retention time of a peak as well as the area under the peak have been extracted you would write `RT area` into the second line. You can however add more variables such as the peak height. The variable names should not
contain whitespaces and use the underscore as the only special character. 

* Starting with the third row, the actual peak data is included, whereby single samples are concatenated horizontally. Each chromatogram needs to consist of the same number of columns, whereby at least one containing retention times is required. In order to use the functionality of the function `norm_peaks` an additional column containing a concentration measure (e.g. height or area of peaks) is required. The decimal separator has to be the point (not comma!).

```{r, out.width = 650, fig.retina = NULL, echo = FALSE,fig.align='center', fig.cap="Example of a input datafile. Only the head of the file is shown."}
knitr::include_graphics("example.jpg") 
```

Naturally, not all chromatograms contain the same number of peaks.

```{r, out.width = 650, fig.retina = NULL, echo = FALSE, fig.cap="Tail of the input file"} 
knitr::include_graphics("example2.jpg") 
```

### Alternative input from R

Alternatively to reading a text file, `GCalignR` also accepts lists of data frames. The name of each list element corresponds to the identity of a sample and the data frame contains the peak data for this sample (again, the minimum number of columns is one column for the retention time and one column for another variable such as the area under the peak or it's height is required for using `norm_peaks`). All column names within the data frames have to be the same and named consistently. The attached dataset `peak_data` contains data from skin swabs of Antarctic Fur Seals *Arctocephalus gazella* [@stoffel2015chemical] and can be used as template.


```{r} 
data("peak_data")
length(peak_data) # number of individuals, i.e. number of list elements
names(peak_data) # names of individuals, i.e. names of list elements 
head(peak_data[[1]]) # column names and data, i.e. one data.frame of list element 
```

## GCalignR workflow

### Check the input

To check the data formatting for the most common errors, use the `check_input` function. This will test for conformity with the the main requirements of the aligning algorithm and give a warning message if these aren't met. When a text file is used as input, the decimal has to be a point (not a comma). However, there could potentially be a variety of different error sources so it is advisable to make  sure by yourself that the data is in the correct format.

```{r}
# if plot = T, a histogram of peaks is plotted
check_input(data = peak_data,plot = F)  
```

The alignment procedure adjust peaks that represent putatively homologous substances that differ to varying degree in retention times among samples. In order to define sensible values for the the parameters `min_diff_peak2peak` it is advisable to investigate how far peaks are commonly separated within chromatograms. The function `peak_interspace` plots a histogram of all between peak retention time differences within samples and combines this estimates for the complete dataset. Here, the difference between close peaks is of interest and and the function allows to plot certain quantile ranges of the distribution by specifying the parameter `quantile_range`.

```{r, fig.width=7, fig.height = 5, fig.cap="Figure 4: Histogram of Peak interspaces"}
peak_interspace(data = peak_data, rt_col_name = "time",
                quantile_range = c(0, 0.8), quantiles = 0.05)
```

The histogram shows the distribution of retention-time 'spaces' between peaks. Most peaks are around 0.05 minutes apart from each other. From the distribution we want to infer the potential error-margin around a peak,that GCalignR will correct. We can see from the histogram that very few peaks are closer together than 0.03 minutes. By looking at the original chromatograms it becomes clear that these peaks are usually substances with low concentration which show a 'double peak', i.e. two peaks appear for what we believe is just a single substance. Therefore, we decide to take a value of 0.03 for the `min_diff_peak2peak` parameter below. It is noteworthy that this does not set a strict threshold and substances with a smaller difference in mean retention times can still be formed during the alignment (because those peaks are known to exist within samples). Therefore, we suggest to carefully check the aligned data and revise the initial peak calling if required. 

### Align chromatograms

The core function in `GCalignR` is `align_chromatograms`, which will align the peak lists 
with the algorithm described above. See `?align_chromatograms` for a detailed description of the arguments. The alignment process will take a few minutes to several hours on a standard computer depending on mainly three factors; (1) the number of samples and (2) the number of peaks per sample and (3) the number of substances that are classified during the alignment procedure. Progress bars and run time estimates (where possible) indicate the progress of the alignment run.  

```{r, eval = FALSE}
peak_data <- peak_data[1:4] # subset for speed reasons
peak_data_aligned <- align_chromatograms(data = peak_data, # input data
    rt_col_name = "time", # retention time variable name 
    rt_cutoff_low = 15, # remove peaks below 15 Minutes
    rt_cutoff_high = 45, # remove peaks exceeding 45 Minutes
    reference = NULL, # choose automatically 
    max_linear_shift = 0.05, # max. shift for linear corrections
    max_diff_peak2mean = 0.03, # max. distance of a peak to the mean across samples
    min_diff_peak2peak = 0.03, # min. expected distance between peaks 
    blanks = "C2", # negative control
    delete_single_peak = TRUE, # delete peaks that are present in just one sample 
    write_output = NULL) # add variable names to write aligned data to text files
```

The aligned data matrices are now stored in data frames which can be accessed as follows:

```{r, eval = FALSE}
peak_data_aligned$aligned$time # to access the aligned retention times
peak_data_aligned$aligned$area # to access the aligned area data
```

The package includes the already aligned data set for all samples using default alignment parameter settings.

```{r}
data("aligned_peak_data") 
``` 

### Visual diagnostics for the aligned data

The `gc_heatmap` function can be used to visualise aligned datasets. A white filling indicates the absence of a peak in a sample, when using the default option of a binary heatmap. The basic rationale of the alignment is to sort the substances with very similar retention times together, as they represent putatively homologous substances. The heatmap shows how single peaks for the individuals deviate from the mean retention time of a substance. As a rule of thumb: The larger the deviation, the less likely it is the same substance, but retention time variation can be high for certain peaks. Therefore, retention time variations are highlighted with respect to mean and outliers are highlighted.
By default, the a threshold value matching the parameter `max_diff_peak2peak` is used to show deviations. However, deviations might be higher after pairs of rows where merged with respect to the value of `min_diff_peak2peak`. Check out the documentation with `help("gc_heatmap")` for further possibilities.

```{r,message=FALSE,fig.width=8,fig.height=8, fig.cap="Heatmap showing the distribution of peaks and the variability within substances"}
gc_heatmap(aligned_peak_data,threshold = 0.03) 
```

The plot function shows a four figure plot for the aligned data. The first histogram shows the number of peaks per sample before and after alignment. The number of peaks is much smaller after the alignment as peaks have been deleted which were present in the control samples, as well as peaks that were found in just a single individual. The histograms on the bottom left shows the full chromatogram shifts (the first step in the algorithm). The bottom middle shows how much peaks vary around their means across samples. The histogram on the bottom right shows how many peaks are shared across samples. In this case, there is just a single substance present in all samples, while often 10-12 samples share a single substance (the mode of the distribution).

```{r,message=FALSE,fig.width=8,fig.height=8, fig.cap="Diagnostic plot of the aligned dataset"}
plot(aligned_peak_data,which_plot = "all") # Plots, can be invoked separetely
```

Using the `print`function, provides a verbal summary of the alignment procedure is retrieved and the function call can be retraced.

```{r, eval=TRUE}
print(aligned_peak_data) 
```

### Normalise peaks and log+1 transformation

`norm_peaks` is used to standardize the concentration of peaks across samples to obtain the relative abundance. This is an essential step prior to the analysis if the absolute concentration of chemicals varies across samples. Note that this step is also required when retention time cut-offs, single peak deletion or blank peak removal was applied, even if the data already contained a measure of relative abundance. The output is a list of data frames containing the relative abundance of peaks for every individual.

```{r}
## normalise area and return a data frame
scent <- norm_peaks(aligned_peak_data, conc_col_name = "area",rt_col_name = "time",out = "data.frame") 
## common transformation for abundance data to reduce the extent of mean-variance trends
scent <- log(scent + 1) 
```

### Visualise patterns by ordination plots using the *vegan* package

`vegan` offers a variety of useful function for the analysis of multivariate abundance data such as the scent profiles handled here. Check out `help("vegan")` for a first overview. 

### Non-metric multidimensional scaling 
```{r, results = 'hide'}
## GCalignR contains factors for the chemical dataset
data("peak_factors") 
## keep order of rows consistent
scent <- scent[match(row.names(peak_factors),row.names(scent)),] 
## NMDS using Bray-Curtis dissimilarities
scent_nmds <- vegan::metaMDS(comm = scent, distance = "bray")
## get x and y coordinates
scent_nmds <- as.data.frame(scent_nmds[["points"]])  
## add the colony as a factor to each sample
scent_nmds <- cbind(scent_nmds,colony = peak_factors[["colony"]])
```

```{r,fig.align="center", fig.cap="NMDS plot"}
## ordiplot with ggplot2
library(ggplot2)
ggplot(data = scent_nmds,aes(MDS1,MDS2,color = colony)) +
    geom_point() + 
    theme_void() + 
    scale_color_manual(values = c("blue","red")) +
    theme(panel.background = element_rect(colour = "black", size = 1.25,
            fill = NA), aspect.ratio = 1, legend.position = "none")
```

### Multivariate analysis using *adonis2*

*Update July 2024: vegan::adonis2 is replacing the now defunct function vegan::adonis*

Using **adonis2** and **betadisper** we can immediately do multivariate statistics, showing that the two colonies differ significantly. This illustrates a location effect.

```{r}
## colony effect
vegan::adonis2(scent ~ peak_factors$colony,permutations = 999)
## no dispersion effect
anova(vegan::betadisper(vegan::vegdist(scent),peak_factors$colony))
```

## References