Probabilistic topic modeling provides a suite of tools for analyzing large collections of electronic documents. With a collection as input, topic modeling algorithms uncover its underlying themes and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a large collection of documents or to solve a variety of prediction problems about text.
Topic models are based on hierarchical mixed-membership models, statistical models where each document expresses a set of components (called topics) with individual per-document proportions. The computational problem is to condition on a collection of observed documents and estimate the posterior distribution of the topics and per-document proportions. In modern data sets, this amounts to posterior inference with billions of latent variables.
How can we cope with such data? In this talk I will describe stochastic variational inference, a general algorithm for approximating posterior distributions that are conditioned on massive data sets. Stochastic inference is easily applied to a large class of hierarchical models, including time-series models, factor models, and Bayesian nonparametric models. I will demonstrate its application to topic models fit with millions of articles. Stochastic inference opens the door to scalable Bayesian computation for modern data analysis.
This is joint work based on this paper:
M. Hoffman, D. Blei, J. Paisley, and C. Wang. Stochastic variational inference. Journal of Machine Learning Research, 14:1303-1347.