Abstract

Modeling Science: Topic models of Scientific Journals and Other Large Document Collections

David Blei
Princeton University
Computer Science

A surge of recent research in machine learning and statistics has developed new techniques for finding patterns of words in document collections using hierarchical probabilistic models. These models are called "topic models" because the word patterns often reflect the underlying topics that permeate the documents; however topic models also naturally apply to data such as images and biological sequences.

After reviewing the basics of topic modeling, I will describe two related lines of research in this field, which extend the current state of the art.

First, while previous topic models have assumed that the corpus is static, many document collections actually change over time:
scientific articles, emails, and search queries reflect evolving content, and it is important to model the corresponding evolution of the underlying topics. For example, an article about biology in 1885 will exhibit significantly different word frequencies than one in 2005. I will describe probabilistic models designed to capture the dynamics of topics as they evolve over time.

Second, previous models have assumed that the occurrence of the different latent topics are independent. In many document collections, the presence of a topic may be correlated with the presence of another. For example, a document about sports is more likely to also be about health than international finance. I will describe a probabilistic topic model which can capture such correlations between the hidden topics.

In addition to giving quantitative, predictive models of a corpus, topic models provide a qualitative window into the structure of a large document collection. This perspective allows a user to explore a corpus in a topic-guided fashion. We demonstrate the capabilities of these new models on the archives of the journal Science, founded in 1880 by Thomas Edison. Our models are built on the noisy text from JSTOR, an online scholarly journal archive, resulting from an optical character recognition engine run over the original bound journals.

(joint work with J. Lafferty)

Audio (MP3 File, Podcast Ready)