Integrating ChIP-seq and DNAse-seq ENCODE Data From from Multiple Cell Types Using Self-Organizing Maps

Ali Mortazavi
University of California, Irvine (UCI)

A fundamental property of transcriptional regulation in metazoa is its cell type specificity. These regulatory differences are established by elaborate molecular machinery consisting of transcription factors, cofactors and chromatin modifying complexes. Within the ENCODE project, a systematic collection of histone modification ChIP-seq and DNAse-hypersensitive datasets have mapped the chromatin landscape of transcriptionally active, inactive, and inaccessible regions across multiple cell types. Of particular interest are elements that change chromatin state, in specific patterns, among the different cell types. Based on RNA output patterns and cis-acting regulatory module structure, we expect potentially thousands of relatively small cohorts of similarly marked cis-regulatory regions having specific chromatin profiles among participating cell types. To find these and further relate them to function and to transcription factor binding, we use a self-organizing map (SOM), which is an unsupervised machine learning-method for clustering, visualizing, and mining high-dimensional data. One strength of SOMs is that they can cluster such cohorts at both global and increasingly local levels. Here we use a large, fine-grained SOM constructed from ENCODE Tier 1 and Tier 2 ChIP-seq histone mark and DNAse-seq datasets to cluster the genome into 1350 coherent units, and we show that these units capture both global and cell-type specific chromatin profiles. Using gene ontology analysis, we probe the functional significance of these chromatin patterns and show their correlation to gene expression. We compare the SOM using our segmentation results to SOMs built from the Hidden Markov Model-based methods used within ENCODE and show that the SOM identifies distinct, biologically-interesting cohorts at a finer granularity than the HMMs. By performing global hierarchical clustering of the SOM, we identified distinctive H3K27me3 signals in map units dominated by developmental transcription factors that differed between untransformed samples and cancerous cell lines in the ENCODE collection.

Ali Mortazavi1,2*, Shirley Pepke3,4*, Georgi Marinov4, Barbara Wold4,5

1. Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA 92697

2. Center for Complex Biological Systems, University of California Irvine, Irvine, CA 92697

3. Center for Advanced Computing Research, California Institute of Technology, Pasadena, CA 91125

4. Division of Biology, California Institute of Technology, Pasadena, CA 91125

5. Beckman Institute, California Institute of Technology, Pasadena, CA 91125

Back to Workshop II: Transcriptomics and Epigenomics