High-quality draft assemblies of large and small genomes from massively parallel DNA sequence data

David Jaffe
Broad Institute

We report the development of a new algorithm for genome assembly, ALLPATHS-LG, and its application to MPS data from fifteen vertebrate genomes. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity and coverage of the genome. In particular, the base accuracy is high (= 99.95%) and the scaffold sizes (e.g. N50 size = 11.5 Mb for human and 17.4 Mb for mouse) are similar to those obtained with capillary-based sequencing. While high-quality assembly of large genomes remains a key challenge of the field, in fact the assembly of small genomes is often challenging, and presently limited by defects in amplification-based MPS data, including read length and uneven coverage. Unamplified single-molecule sequencing data (having complementary properties) can now be generated on the Pacific Biosciences platform. At current yields, this is highly practical for small genomes. We demonstrate hybrid (Illumina plus Pacific Biosciences) assemblies of bacterial genomes. These assemblies are much better than the Illumina-only assemblies of the same genomes. In fact they close nearly all small gaps.


Back to Workshop I: Next-generation Sequencing Technology and Algorithms for Primary Data Analysis