Inferring a Tree of Blobs with TINNIK

library(MSCquartets)

Why use TINNIK?

TINNIK infers a “tree of blobs” under the network multispecies coalescent model. The resulting unrooted topological tree partially depicts the species relationships that led to a collection of gene trees, showing only the cut edges of the network (ones which join otherwise unconnected pieces of the network). Any cycles or more complicated blobs formed by reticulations in the network are contracted to multifurcations (polytomies) in this tree.

The tree of blobs thus isolates those parts of the network where reticulations have made relationships non-tree-like. A researcher might then apply other methods to investigate the structure of each blob, perhaps by reducing to consideration of a smaller number of taxa. However, with current methods inferring complicated blob structure accurately may be difficult or impossible. Since the TINNIK algorithm is statistically consistent under the NMSC model regardless of the unknown blob structure, it may provide the strongest network inference possible without making assumptions on the unknown network structure.

Many current methods of network inference are statistically consistent only if the unknown network is level-1 (each blob is as simple as possible, containing only a single reticulation). They can only return a level-1 network as output, and may give no indication as to whether a level-1 network is an adequate model for the data. When a level-1 method is used for network inference, checking that TINNIK’s results are consistent with the level-1 output can provide a researcher some justification for that assumption.

Preparing the input data

TINNIK requires input in the form of a collection of gene trees on a common collection of taxa. Thus from multilocus sequence data it is necessary to first align each gene’s sequences and use standard phylogenetic methods and software (e.g., IQtree, RAxML) to obtain an unrooted topological tree for each gene. While these trees are not strictly data, as they are themselves inferred, TINNIK treats them as such. If the gene trees are rooted, or have edge lengths, that information is simply ignored.

We work with an example data set of 1730 gene trees of 7 Papionini species, extracted from Vanderpool et al. (2020) and analyzed by TINNIK in Allman et al. (2024). Gene trees in Newick format are read from a text file to a multiphylo object using an ape function:

# read text file of gene trees supplied with MSCquartets package

gts=read.tree(file = system.file("extdata","dataPapioniniVanderpool",package="MSCquartets"))

Notes:

While some missing taxa on gene trees can be handled by TINNIK, it is necessary that each subset of 4 taxa appears on at least 1 gene tree and desirable that it appears on many. Statistical tests are performed for each such set, and the amount of data for these is determined by the count of trees displaying the set.

All understanding of TINNIK’s statistical behavior is established by assuming that the input gene trees are a true sample under the NMSC model. In practice, inferred gene trees are used, which likely contain some error. In particular, as with other network inference methods using inferred gene trees as input, widespread lack of resolution in the gene trees is unlikely to lead to a good analysis,

An initial analysis

The first steps of TINNIK are to count occurrences of each quartet tree topology across all gene trees, for each set of 4 taxa, and then to apply two hypothesis tests to these counts to judge their fit to 4-taxon star and resolved species trees, producing two p-values, “p_star” and “p_T3”

The easiest way to do this is to run the commands below, where we save the table of quartet counts and associated p-values as “pT”. The TINNIK command here does a full analysis, which we generally should not take as our final one, as we are using default nominal levels for the hypothesis tests.


# perform initial TINNIK analysis for gene trees, using defaults

output=TINNIK(gts)
#> Analyzing 7 taxa: Cercocebus_atys, Macaca_fascicularis, Macaca_mulatta, Macaca_nemestrina, Mandrillus_leucophaeus, Papio_anubis, Theropithecus_gelada
#> Counting occurrences of displayed quartets for 35 four-taxon subsets of 7 taxa across 1730 gene trees.
#> Warning in quartetTable(genetrees, taxanames, epsilon = epsilon): Some taxa
#> missing from some trees.
#> Applying hypothesis test for model T3 to 35 quartets.
#> Applying hypothesis test for star tree model to 35 quartets.

# save table of quartet information and p-values

pT=output$pTable

The TINNIK command produced three plots. The first is a simplex plot of the hypothesis test results, at the specified levels. Each set of 4 taxa is represented by one plotted symbol, with those near the 3 lines indicating a tree-like relationship, and those farther from the lines a putative non-tree-like one. Those near the centroid of the simplex indicate a star-like relationship. (See Allman, Mitchell, and Rhodes (2021) for a more complete explanation of these plots.)

The second simplex plot shows the outcome of the TINNIK inference rule, where some of the quartets initially viewed as tree-like (T-quartets) from the hypothesis test are subsequently flagged as coming off a blob (B-quartets). (See Allman et al. (2022) or Allman et al. (2024) for a more complete explanation.)

The final plot shows TINNIK’s inferred tree of blobs, with 5- and 4- multifurcations joined by an edge, indicating 4- and 5-blobs. Note this result is dependent on the levels of the hypothesis tests shown in the subtitle of the plot. The default values are \(\alpha=.05\) for the test with null hypothesis “the quartet has a tree-like relationship”, and \(\beta=.95\) for the test with null hypothesis “the quartet has a star-like relationship”.

Notes:

Since this data set is relatively small, it does not take long to produce the table of quartet counts and p-values, and we could simply recalculate it for additional runs. For a large data set (many taxa, many gene trees) producing this table is the most significant factor in run-time, so saving it for reuse is wise.

The table “pT” can be printed directly in R, or printed in a nicer format with quartetTablePrint. However, its rows should not be reordered, as the TINNIK algorithm uses an indexing function to access them quickly.

Instead of the T3 hypothesis test for tree-likeness, the analysis can be run using the cut test (see the TINNIK documentation). Doing so may produce fewer initial B-quartets, and thus a more resolved tree of blobs, but is more appropriate only for what might be considered extreme network structures (see Allman et al. (2024)).

Varying the test levels

An initial run of TINNIK with default test levels is rarely a sufficient analysis. One should always vary the test levels \(\alpha\) and \(\beta\) to judge robustness of the inferred tree of blobs to their choice.

We generally have little understanding of the error that might be in the gene trees, and making different choices of the levels can be used to address this somewhat. Setting \(\alpha=0\) will lead to all quartets being judged as tree-like, and increasing \(\alpha\) will potentially increase the number considered to have putative 4-blobs. In a noisy data set, the default \(\alpha=.05\) may result in many putative 4-blobs, and a smaller value of \(\alpha\) may be needed for a useful inference. Since hypothesis testing is done on each set of 4 taxa independently, even with no gene tree error merely having more taxa present makes the erroneous judgment of some putative 4-blobs more likely.

Similarly, a value of \(\beta=1\) will result in no quartet relationship being judged as star-like, but decreasing \(\beta\) may result in more star-like quartets.

Smaller \(\alpha\)s and/or larger \(\beta\)s tend to produce more resolution in the inferred tree of blobs.

For our example data set, we see no symbols plotted near the centroid in the first simplex plot, so only values of \(\beta\) very close to 0 are likely to result in any judgment of star trees. An examination of “pT” confirms this, as the maximum in the “p_star” column is \(\approx.112e-32\). To see the effect of a tiny \(\beta\), we reuse the tabulated quartet information in “pT” and enter:


TINNIK(pT, alpha=.05, beta=1e-40)
#> Some points (green) not rejected as star, but rejected as tree.

Here the first simplex plot shows a few symbols where the star-tree is now not rejected. However, as these are treated by the TINNIK algorithm as B-quartets, we see no change in the next simplex plot or the inferred tree of blobs from our initial TINNIK run.

Since the original simplex plot shows rejection of tree-likeness for some quartets plotted near the line segments, varying \(\alpha\) is likely to be more interesting. Using a smaller \(\alpha\), to potentially increase the number of quartets which are judged as tree-like, we enter:


TINNIK(pT, alpha=.01, beta=.95)

The first simplex plot indeed shows fewer red triangles (initial B-quartets) and more blue circles, while the second plot has fewer gold squares (all B-quartets) and more green “x”s, producing an inferred tree of blobs with more resolution — in fact, a fully-resolved tree.

This leads us to try a choice of \(\alpha\) intermediate to the previous ones, so we enter:


TINNIK(pT, alpha=.02, beta=.95)

Here the simplex plots and the tree of blobs vary from either of the earlier ones, showing one of the multifurcations in our initial analysis, but resolving the other. Further varying \(\alpha\) can give intervals over which these analyses are stable.

Note:

In reporting any TINNIK analysis it is essential to report the test levels used, and we strongly recommend exploring with values beyond the defaults.

References

Allman, E. S., H. Baños, J. D. Mitchell, and J. A. Rhodes. 2022. “The Tree of Blobs of a Species Network: Identifiability Under the Coalescent.” Journal of Mathematical Biology 86 (1): 10. https://doi.org/10.1007/s00285-022-01838-9.
———. 2024. “TINNiK: Inference of the Tree of Blobs of a Species Network Under the Coalescent.” bioRxiv. https://doi.org/10.1101/2024.04.20.590418.
Allman, E. S., J. D. Mitchell, and J. A Rhodes. 2021. Gene Tree Discord, Simplex Plots, and Statistical Tests under the Coalescent.” Systematic Biology 71 (4): 929–42. https://doi.org/10.1093/sysbio/syab008.
Vanderpool, D., B. Q. Minh, R. Lanfear, D. Hughes, S. Murali, R. A. Harris, M. Raveendran, et al. 2020. “Primate Phylogenomics Uncovers Multiple Rapid Radiations and Ancient Interspecific Introgression.” PLOS Biology 18 (12): 1–27. https://doi.org/10.1371/journal.pbio.3000954.