Non-linear dimension reduction in the age of big data

Marina Meila
University of Washington

Dimension reduction is used to compress large high dimensional data, to discover predictive features, or simply to understand the data generating process. Manifold learning is the most natural approach for the latter goal, whenever the data can be well described by a small number of parameters.

Accurate manifold learning typically requires very large sample sizes, yet many existing implementations are not scalable, which has led to the commonly held belief that manifold learning algorithms aren't practical for today's data. Another well-known drawback of low dimensional non-linear mappings is that they distort the geometric properties of the original data, like distances and angles. These impredictible and algorithm dependent distortions make it unsafe to pipeline the output of a manifold learning algorithm into other data analysis algorithms, limiting the use of these techniques in engineering and the sciences.

This talk will show how both limitations can be overcome. I will present a statistically founded methodology to estimate and then cancel out the distortions introduced by any embedding algorithm, thus effectively preserving the distances in the original data. This method also helps infer other manifold properties in a principled and semi-automatic way, hence with minimal reliance on human experts and visualization. On the computational side I will demonstrate that with careful use of sparse data structures manifold learning can scale to data sets in the millions or higher.


Joint work with Dominique Perrault-Joncas, James McQueen, Jacob VanderPlas, Zhongyue Zhang, Yu-Chia Chen, Grace Telford, Samson Koelle, Alon Milchgrub, Hanyu Zhang, Alvaro Vasquez-Maiagoytia

Presentation (PDF File)

Back to Workshop I: Big Data Meets Large-Scale Computing