Validation and Reproducibility by Geometry, for Unsupervised Learning

Marina Meila
University of Washington
Statistics

Machine learning is many times faster than humans at finding patterns, yet the task of validating these as ``meaningful'' is still left to the human expert or to further experiment. In this talk I will present three instances in which geometric knowledge is used to augment unsupervised learning with data driven validation.

In the case of clustering, I will demonstrate a new method to guarantee that a clustering is approximately correct, without requiring knowledge about the data distribution. This framework is similar to PAC bounds in supervised learning; unlike PAC bounds, the bounds for clustering can be calculated exactly and can be of direct practical utility.

In manifold learning, I will present implementable solutions to the following well known problems. When the output of embedding algorithms distorts distances, and other geometric properties of the data, we introduce a statistically grounded method to estimate and then cancel out the distortions, thus effectively preserving the distances in the original data. This method is based on the notion of augmenting the algorithm output with a Riemannian metric, i.e. with the information that allows it to reconstruct the original geometry.

The embedding coordintes obtained by dimension reduction are often identified, by visual inspection, with interpretable properties of the data. The third and last part of the talk will describe a method to semi-automate this process. The human expert provides a dictionary of meaningful functions, and our algorithm selects a subset of these that can parametrize a manifold via an arbitrary smooth non-linear transformation.

Joint work with Dominique Perrault-Joncas, James McQueen, Yu-chia Chen, Samson Koelle, Hanyu Zhang

Presentation (PDF File)

Back to Workshop III: Validation and Guarantees in Learning Physical Models: from Patterns to Governing Equations to Laws of Nature