Visualizing large data sets from next generation sequencing.

Jim Kent
University of California, Santa Cruz (UC Santa Cruz)

The explosive growth of large data sets fueled by next generation sequencing has led to several new developments in the UCSC Genome Browser both in terms of visualization, search, and distributed data architecture. We've developed ways of condensing multiple data sets into the same display space using clustering and transparent overlay approaches. Related data, such as the same experiment done on multiple cell lines, or with multiple antibodies can be combined into a single track. We've a flexible system for selecting which data to display involving either full text or field-by-field searches of track descriptions and metadata, or by browsing a hierarchy of tracks. Since the data volume has grown to the point where uploading it to a central repository has become problematic, we've developed and integrated a series of file formats that feature indexing and compression, so that the bulk of the data can reside remotely while only the part relevant to the displayed window in the genome is transferred over the internet. Using a combination of careful format design, caching, and parallel fetching of remote data we are able to get near-local performance from remote "track data hubs."

Presentation (PDF File)

Back to Workshop I: Next-generation Sequencing Technology and Algorithms for Primary Data Analysis