Abstract

Geometric clustering in kernel embedding spaces for document corpora organization

Stephane Lafon
Google Inc.
Google Inc.

Collections of documents are associated to a graph whose structure is analyzed and organized via "diffusion coordinates". In the corresponding embedding space, one can perform simple geometric algorithms, such as k-means clustering or ball covering, in order to obtain a meaninful organization of the corpus. The dual approach is to consider the set of words contained in these documents as the data of interest, leading to an automatic lexicon analysis and concept extraction scheme The problem of finding an "optimal" clustering of the documents in the embedding space is reduced to simple questions of matrix approximation and completion. This provides a rigorous justification for kernel k-means. In addition, this idea goes beyond the diffusion framework and classical symmetric kernel setting, as it allows to deal with arbitrary oriented graphs and kernels

Audio (MP3 File, Podcast Ready)