--- title: "Understanding genome polarisation output files" author: "Natália Martínková, Stuart J.E. Baird" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true bibliography: refs.bib vignette: > %\VignetteIndexEntry{Understanding genome polarisation output files} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#" ) ``` Genome polarisation [@Baird2023] uses the *diem* algorithm to estimate which allele of a single nucleotide polymorphic marker belongs to which side of a barrier to gene flow based on whole-genome associations. The key result is then the **marker polarity** (whether to read the marker as in the *diem* input file or whether to flip the homozygous state labels 0↔2), **marker diagnostic index** (how relevant the marker is with respect to the barrier to gene flow), and **marker polarity support** (how certain we are regarding whether to flip the state labels or not). This information is stored in a file *MarkerDiagnosticsWithOptimalPolarities.txt* with columns Marker, newPolarity, DI, and Support. However, the `diem` function generates other outputs, which help to control the genome polarisation analyses and to interpret the results. Some of this information is identical to the function return value. The documentation for the return values can be found by running `?diem`. This vignette explains the outputs saved to files. # Obligatory outputs {#defaults} Three output files will be saved to the working directory or to the path specified in the `verbose` argument. Those are: * MarkerDiagnosticsWitOptimalPolarities.txt, * HIwithOptimalPolarities.txt, and * I4withOptimalPolarities.txt. ## MarkerDiagnosticsWithOptimalPolarities {#markerDiagnostics} The *MarkerDiagnosticsWithOptimalPolarities.txt* is a tab-delimited table with four columns and the number of rows equal to the number of markers across all compartments used to run genome polarisation with `diem`. This is the key results file identifying the marker relevance with respect to the detected barrier to gene flow. The first column **Marker** is an index of the marker that is successive in the files as they are ordered in the `files` argument in the `diem` call. The value in the Marker column will correspond to the respective index in the file ending with *includedSites.txt* if the input files were generated by the `vcf2diem` function. The second column **newPolarity** contains TRUE/FALSE values, specifying which allele in the specific marker belongs to which side of the barrier. When the value is `FALSE`, encoded as 0 in Eq. 22 [@Baird2023], the allele encoded as 0 in its homozygous state in the input file belongs to the barrier side associated with low values of hybrid index of the individuals. When the newPolarity value for a marker is `TRUE`, the homozygous states need to be flipped 0↔2 to correctly associate with the barrier sides. In terms of interpretation, this means that the allele encoded as 2 in its homozygous state is associated with low values of the hybrid index [sic!]. Which specific alleles these are for any marker can be found in the *includedSites.txt* file from the `vcf2diem`. The third column **DI** is numeric with values of the diagnostic index of the markers as specified in Eq. 20 [@Baird2023]. The diagnostic index is a log likelihood, so its values are always negative. Polarised markers with high diagnostic index characterize the barrier to gene flow. Such markers will tend to be fixed for one allele on either side of the barrier. Conversely, markers with low diagnostic index will have signal orthogonal to the detected barrier to gene flow. They could reflect ancestral genetic variation, standing variation not contributing to the barrier to gene flow, or artifacts such as overmerging of repeated motifs onto the reference. The fourth column **Support** is numeric and the values represent polarity support for the markers as given in Eq. 21 [@Baird2023]. The support is the gain in likelihood when the correct marker 0↔2 'flipping' is compared to its reverse. High marker support gives confidence that the marker polarity reflects the detected barrier to gene flow. In general, markers with high diagnostic index tend to also have high support. ## HIwithOptimalPolarities {#HI} The *HIwithOptimalPolarities.txt* is a tab-delimited one-column matrix of individual hybrid indices with row names corresponding to the individual indices. Note that the hybrid indices are calculated from the polarised genotypes for all individuals found in the input files, not only those specified in the `ChosenInds` argument of the `diem` function. The hybrid indices are calculated as genome admixture from summaries of polarised genotypes as in Eq. 7 [@Baird2023]. > CAUTION:These hybrid indices are calculated *without taking into account the diagnosticity of markers*! As such they may not show barrier signal well at all. Hybrid indices and genome polarisation diagrams should be calculated and plotted after the user decides on a diagnostic index threshold. The ideal threshold will be specific to your dataset. Check `vignette("diemr-diagnostic-index-expecation-maximisation-in-r")` for guidelines how to calculate it. If you choose to ignore this warning and include all markers when calculating hybrid indices you will find the range of hybrid indices is small, and centred on 0.5. This is because most genome sites are close to invariant, and so `diem` will flip homozygous state labels 0↔2 at random. A purely random answer will give hybrid index equal to 0.5. You can rescale hybrid indices with a large random influence using Eq. 11 [@Baird2023], so they lie on the interval $[0,1]$. ## I4withOptimalPolarities {#I4} The *I4withOptimalPolarities.txt* file contains genome-wide summaries of genomic states for all input genomes (Eq. 4 in [@Baird2023]). It is a tab-delimited table with four columns representing the unknown state (column name "_") that cannot be encoded as a genotype in terms of the two most frequent alleles, the second column representing homozygote encoding 0, the third column heterozygote encoding 1 and the fourth column homozygote encoding 2. The rows are all individuals in the input files, with row names corresponding to indices. The 4-genomic state counts can be used to calculate the hybrid index, observed heterozygosity and error rate for all individuals. The equations 7, 9, and 10 [@Baird2023] are implemented in the function `pHetErrOnStateCount` that can be applied to the rows of the I4withOptimalPolarities. However, note that the ${}^4\mathbf{I}$ in the output was calculated from all markers (see [CAUTION](#HI) in the previous section). We advice the users to preferentially filter their markers based on their diagnosticity after the `diem` analysis and recalculate the [hybrid indices](#HI) and the [${}^4\mathbf{I}$](#I4) from the filtered, polarised genotypes. # Optional outputs The `diem` function can output additional files useful for tracking algorithm convergence during iterations and ensuring repeatable initialisation of genome polarisation. The flowchart in [Figure 1](#fig1) specifies what processes generate the files and how the user controls what files will be stored and where.