Optimal Sparse Segment Identification with Applications in Copy Number Variation Analysis

Hongzhe Li
University of Pennsylvania

Motivated by DNA copy number variation (CNV) analysis based on high-density single nucleotide polymorphism (SNP) data, we consider two problems of detecting and identifying sparse short segments in long one-dimensional sequences of data with additive Gaussian noise, where the number, length, locations and the population frequencies of the segments are unknown. The first problem is to identify the CNVs for one single sample. We present a statistical characterization of the identifiable region of a segment where it is possible to reliably separate the segment from noise. An efficient likelihood ratio selection (LRS) procedure for identifying the segments for one sample is developed, and the asymptotic optimality of this method is presented in the sense that the LRS can separate the signal segments from the noise as long as the signal segments are in the identifiable regions. The second problem aims to simultaneously identify both rare and common CNVs based on a large set of population samples. We propose a proportion adaptive segment selection (PASS) procedure that automatically and optimally adjusts to the unknown proportions of the carriers of the segment variants. PASS is shown to have desirable theoretical and numerical properties. The proposed methods are demonstrated with simulations and analysis a neuroblastoma data set to identify the CNVs in neuroblastoma patients. The results show that the LRS procedure can yield greater gain in power for detecting the true segments than some standard signal identification methods and PASS significantly gains power of detecting the CNVs by pooling information from multiple samples. Extensions to segment identification with the next generation sequence data will also be discussed.


Back to Long Programs