Sparse-aware methods for analyzing high-throughput sequencing assays

Joseph Paulson
Genentech, Inc.

High-throughput genomic assays, especially the microbiome, have unique technical artifacts that require specialized techniques for analysis. While data generation is no longer a challenge, there remain statistical and computational challenges in analyzing the associated Big Data. For example, in differential abundance testing, the majority of analyses make use of single-factor methods without controlling for confounding variables in calculating associations. This can lead to incorrect conclusions about changes in bacterial species composition or misidentification of changes in bacterial species associated with changes in host phenotype or in the local environment. Additionally, few studies incorporate multiple sources of available biomedical multi-omic data and many methods aimed to holistically analyze the microbial communities are computationally intensive. We present some of the statistical solutions to under-sampling, a technical artifact in metagenomics and large heterogeneous RNA-Seq data – zero-inflated mixed models. These solutions will require privacy solutions in the future.

Back to Algorithmic Challenges in Protecting Privacy for Biomedical Data