Multiple-sequence functional annotation and the generalized hidden Markov phylogeny
Phylogenetic shadowing is a new comparative genomics principle which allows for the discovery of conserved regions in sequences from multiple closely-related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. In particular, we show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe SHADOWER, our implementation of such a prediction system. We find that SHADOWER outperforms previously reported ab initio gene finders, including comparative human-mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of SHADOWER's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation.