Abstract

Document Representations for Topic-Adaptation in Statistical Language Modeling

Sanjeev Khundapur
Johns Hopkins University
Electrical and Computer Engineering

Statistical language models are probability assignments on the space of word-sequences, whose goal is to assign high probability to grammatical, linguistically plausible or likely word sequences, and which find widespread use in a range of human language technologies, such as automatic speech recognition, statistical machine translation, information retrieval, spelling correction, etc. An important consideration in such models is the topic of the discourse or document, since it significantly affects the probability of content-bearing words (as opposed to grammatical ``function words'') in the discourse. In this presentation, we will discuss document representations used to capture such statistical dependencies, including vector space models and latent semantic analysis, metrics in document space, such as cosine similarity, and mechanisms for incorporating document-level conditioning in a statistical language model, including linear or log-linear interpolation and maximum entropy. Empirical results from our own past work will be presented to motivate and illustrate some of the ideas that have been developed by us as well as others in the field.

Biosketch:
Sanjeev Khudanpur is an Assistant Professor in the Department of Electrical & Computer Engineering and a member of the Center for Language and Speech Processing at Johns Hopkins University. He obtained a B. Tech. from the Indian Institute of Technology, Bombay, in 1988, and a Ph. D. from the University of Maryland, College Park, in 1997, both in Electrical Engineering. His research is concerned with the application of information theoretic and statistical methods to problems in human language technology, including automatic speech recognition, machine translation and information retrieval, and he is particularly interested in maximum entropy and related techniques for model estimation from sparse data.

Audio (MP3 File, Podcast Ready)