Probabilistic topic modeling provides an important suite of tools for the unsupervised analysis of large collections of documents. Topic models uncover the underlying themes of the documents, and then use those themes to aid in exploration, search, and prediction. However, traditional topic modeling algorithms require multiple passes through the collection. They come with a significant computational burden, and much research on scaling up topic models has gone into developing distributed variants. In this talk, I will describe a different strategy---a topic modeling algorithm that can analyze documents arriving in a stream and that does not require repeated views of the same document. An analysis of 3.3M articles from Wikipedia shows that the on-line approach fits topic models that are as good or better than those found with the traditional batch approach, and fits them in a fraction of the time.
(This is joint work with Matthew Hoffman and Francis Bach.)