TUTORIAL - Variable Latent Semantic Indexing

Prabhakar Raghavan
Yahoo! Research

Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a term-document matrix. However, in the context of a particular query distribution, the approximation thus produced need not be optimal. We propose VLSI, a new query-dependent (or ``variable'') low-rank approximation that minimizes approximation error for any specified query distribution. With this tool, it is possible to tailor the LSI technique to particular settings, often resulting in vastly improved approximations at much lower dimensionality. We validate this method via a series of experiments on classical corpora, showing that VLSI typically performs similarly to LSI with an order of magnitude fewer dimensions.



Presentation (PDF File)
Video of Talk (RealPlayer File)

Back to Graduate Summer School: Intelligent Extraction of Information from Graphs and High Dimensional Data