Data-driven dictionary definition for diverse document domains

Michael Mahoney
Yahoo! Research

Data from numerous application domains can be represented as a term-document matrix. Examples include term-document data, recommendation system data, individual-gene data, and temporally- or spectrally-resolved image data. A common method of analysis of such document matrix data is the Singular Value Decomposition, which may be thought of as providing a dictionary of axes of maximum variation in the data. A difficulty with the applicability of this and related methods is that the dictionary elements, being linear combinations of actual data elements, often do not lend themselves to simple interpretation. We describe a CUR matrix decomposition, i.e., a low-rank matrix decomposition that is explicitly expressed in terms of a small number of actual rows and actual columns of the data matrix, that provides provably-good relative error guarantees (as opposed to the much coarser additive error guarantees of previous CUR decompositions) when compared with the best rank-k approximation. In addition, we show that when applied to diverse document domains, the extracted rows and columns have an immediate interpretation as dictionary elements and can be used successfully for structure extraction and prediction.

Audio (MP3 File, Podcast Ready) Presentation (PowerPoint File)

Back to Document Space