Abstract

Phylogenetics without multiple sequence alignment

Mark Ragan
University of Queensland

In the phylogenetic context, the aim of multiple sequence alignment is to generate a position-by-position (column-by-column) hypothesis of homology along the full length of the set of molecular sequences under consideration. This alignment can then serve as input into a tree-inference program. However, multiple alignment is computationally hard, and does not extend naturally to instances in which the sequences under consideration have been rearranged relative to each other, misassembled (or not assembled in the first place), or contain regions of lateral origin. In this presentation I explore alternative approaches that begin with the extraction of short perfectly or near-perfectly matching character strings variously known as words, k-mers or n-grams. Using synthetic and empirical data I will survey the major alignment-free approaches in phylogenetics, consider their performance and robustness under various scenarios of sequence evolution, and comment on their computational scalability.