Scalable inference in probabilistic topic models

David Blei
Princeton University
Computer Science

Probabilistic topic modeling provides an important suite of tools for
the unsupervised analysis of large collections of documents. Topic
models uncover the underlying themes of the documents, and then use
those themes to aid in exploration, search, and prediction. However,
traditional topic modeling algorithms require multiple passes through
the collection. They come with a significant computational
burden, and much research on scaling up topic models has gone into
developing distributed variants. In this talk, I will describe a
different strategy---a topic modeling algorithm that can analyze
documents arriving in a stream and that does not require repeated
views of the same document. An analysis of 3.3M articles from
Wikipedia shows that the on-line approach fits topic models that are
as good or better than those found with the traditional batch
approach, and fits them in a fraction of the time.

(This is joint work with Matthew Hoffman and Francis Bach.)

Back to Long Programs