Co-estimation of sequence alignments and trees

Siavash Mir arabbaygi
University of Texas at Austin
Computer Science

The quality of Multiple sequence alignment (MSA) affects many bioinformatics analyses, including phylogenetic reconstruction. While many MSA tools have been developed, performance studies focusing on large datasets and those with high rates of evolution have shown mixed results. One promising approach for datasets with up to 28,000 sequences is co-estimation of alignments and phylogenetic trees. SATé-I and SATé-II (Liu et al., 2011) were among the first co-estimation alignment tools that could accurately analyze up to 28,000 sequence - even those that evolve under high rates of evolution. Yet, phylogenetic analyses of sequence datasets containing more than 100,000 sequences are being attempted these days, and little is known about how well alignment methods perform on such ultra-large datasets. We present PASTA, “Practical Alignments using SATé and TrAnsitivity”, which is a new MSA tool based on SATé, but with better accuracy and improved scalability. PASTA begins with an alignment and tree estimated using a very simple profile HMM-based technique and then re-aligns the sequences using the tree. If desired, a new tree can be estimated on the new alignment, and the algorithm can iterate. We demonstrate PASTA’s speed and accuracy on a collection of biological and simulated datasets, including a 200K-sequence RNASim dataset, which we align in less than 24 hours using PASTA on a 12-core machine. PASTA has better accuracy than all other methods we tested, including SATé, and is many times faster than SATé. Unlike SATé and most other tools, PASTA can analyze datasets with up to 1M sequences in a reasonable time.

Back to Multiple Sequence Alignment