A comparison of error models for gene expression data with applications to clustering

Kay Tatsuoka
SmithKlineBeecham
Cheminformatics

Co-authors: Steve Clark, Jason Ruan, David Gruben, Mary Brawner, Robert Knowlton, Jeff Mooney, Shawn O'Brien, Frances Stewart

Microarrays have arrived as a means of high throughput screening of genes and potential targets. Our lab processes thousands of experiments per year. There are genuine outliers in the resulting data for both high and low signals due to for example dust or dropouts.Also, variability is inherent in all array-based gene expression data. This natural variation is due to differences in how genes respond to the specific experimental conditions of the array. Automated detection of outliers and quantifying the variability in a production environment is a requirement in microarray analysis for accurate measurement of low signals and in building high quality databases for further data mining.

We describe our methodology for identifying outliers and providing p-values and confidence limits for gene expression measurements. We describe the role of individual spot quality indices in outlier identification as well as the use of replication. We describe experiments that ensure that our confidence intervals are giving accurate coverage. We compare our coverage with that given by other error models. We have found that our error model ensures that we are able to detect 2-fold changes with 99.5% confidence.

Inference of data from multiple cell-lines or multiple time points typically relies on a single clustering and does not incorporate uncertainty in microarray data. We propose methodology to analyse these data that incorporates our error models and provides confidence statements about the classification of genes of unknown function. The value of the methodology is illustrated via simulations and on real datasets, and applied to k-means and hierarchical clustering. For k-means a simple simulation shows a 20 percent improvement in the missclassification rate. The techniques used come from machine learning and in the case of hierarchical clustering from phylogenetic analysis of DNA and amino acid sequences.


Back to Expression Arrays, Genetic Networks and Disease