Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions
Seminar Room 1, Newton Institute
AbstractDistance-based phylogenetic reconstruction methods rely heavily on accurate pairwise distance estimates. There are two separate sources of error in this estimation process: (1) the relatively short sequence alignments used to obtain distance estimates induce a "stochastic error" corresponding to estimation of model parameters from finite data; (2) model misspecification leads to a "fixed error" which does not depend on sequence length. It is common practice to assume some substitution model over the sequence data and use an additive substitution rate function for that model when computing pairwise distances. In the providential case when the assumed model coincides with the true model, which is typically unkown, the distance estimates will not be afflicted with fixed error. But even then, there is no reason to a-priori enforce a zero fixed error, when this causes elevated rates of stochastic error, especially in the case of short sequence alignments. This work challenges this paradigm of "using the most additive distance function at any cost". We do this by studying the contribution and effect of both fixed and stochastic error in distance estimation. We present a formal framework for quantifying the fixed error associated with a specific distance function and a given phylogenetic tree in a homogeneous substitution model. As an example, we study the behavior of the Jukes-Cantor distance formula in homogeneous instances of Kimura's two parameter substitution model. The effects of fixed error are observed through analytic results and experiments on simulated data. In addition, we compare the performance of various distance functions on biological sequences. We evaluate reconstruction accuracy by comparing the reconstructed trees to an independently validated species tree. Our study indicates that often enough simple distance functions outperform more sophisticated functions, despite the fact that the given sequence data appears to have poor fit to the substitution model they assume.
If it doesn't, something may have gone wrong with our embedded player.
We'll get it fixed as soon as possible.