Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels has not been fully investigated.
We prove that treating indels as missing data can be inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. We also prove that the true tree can be identified solely from the pattern of gaps in the true alignment (that is, character states can be ignored).
Our results show that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. Moreover, the pattern of gaps in an accurate alignment may give substantial information about the underlying phylogeny, over and above what is present in the character states. These observations suggest that the recent focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.