Parameterizing sequence alignment with an explicit evolutionary model reduces homologous overextension artifacts

Elena Rivas
Howard Hughes Medical Institute

The alignment scoring systems in standard sequence homology search programs correspond to implicit probabilistic models of sequence evolution. Using one fixed score system (BLOSUM62 with some gap open/extend costs, for example) corresponds to making an unrealistic assumption that all sequence relationships have diverged by the same time. Adoption of explicit time-dependent evolutionary models for sequence alignment scoring has been hindered by algorithmic complexity and technical difficulty.
Here, we have studied models of sequence evolution that describe the occurrence of insertion and deletion events over time, in a fashion compatible with the parameterization of standard profile HMM methods, and by extension, other standard sequence alignment methods that use affine gap costs. We observe that preserving a convenient macroscopic formulation (such as that of BLAST or profile HMMs) requires making important compromises regarding the nature of the microscopic (instantanous)
evolutionary events allowed. We have identified several such “affine-compatible” evolutionary models. We have implemented these evolutionary models into a new pair HMM program to perform multiple sequence local alignments, and into the profile HMM program HMMER used for homology detection and alignment. We test different aspects of the search performance of these ”optimized branch length” models, including detection (discrimination of homologous from nonhomologous sequences) and coverage (discrimination of residues in a homologous region from nonhomologous
flanking residues). Contrary to our expectations, we find that a single fixed-time long branch parameterization suffices for detecting both distant and close relationships; an optimal branch length parameterization only provides a gain in detection measures when homologous regions are very short. In contrast, we do see a substantive improvement in coverage measures. Optimal branch parameterization reduces a known artifact called ”homologous overextension”, in which local alignments erroneously extend through flanking nonhomologous residues.

Presentation (PDF File)

Back to Multiple Sequence Alignment