Data structures and algorithms for tractable statistical mining of very high dimensional data.

Andrew Moore
Carnegie Mellon University

This talk is about recent work on new ways to exploit preprocessed views of data tables for tractably solving high dimensional statistical queries. We'll describe deployments of these new algorithms in the realms of social networks, high throughput drug screening and fast spatial detection unnatural disease outbreaks.

In recent years, several groups have looked at methods for pre-storing general sufficient statistics of the data in spatial data structures such as kd-trees and ball-trees so that statistical operations involving aggregation, convolution and contingency tables become fast for large high dimensional datasets. In this talk we will summarize some of these approaches for the case of high dimensional data. We will show how nonparametric classification methods such as k-nearest-neighbor can be accelerated non-approximately even in up to a million dimensions. This work combines ball-trees with the insight that it is possible to answer k-nearest-neighbor classification queries exactly without ever needing to explicitly find the k-nearest neighbors. We will examine higher-order statistical queries such as spatial scanning over very multivariate time series, such as retail data from 25,000 pharmacies around the United States. And if time permits we will discuss the case of sparse high dimensional data and the computational issues in tractably building very large probabilistic models from such data.


Presentation (PDF File)
Video of Talk (RealPlayer File)

Back to Long Programs