NSF Logo IPAM Logo UCLA Logo

Document Space

January 23 - 27, 2006

Schedule and Presentations

Pictures

Organizing Committee:

Damianos Karakos (Johns Hopkins University, Center for Language and Speech Processing)
Mauro Maggioni (Yale University, Mathematics/Program in Applied Mathematics)
David Marchette (Naval Surface Warfare Center)
Carey Priebe (Johns Hopkins University, Center for Imaging Science/Applied Mathematics and Statistics)

Scientific Background

Processing and management of the ever-increasing amount of spoken and written information appears to be a huge challenge for statisticians, computer scientists, engineers and linguists; the situation is further aggravated by the explosive growth of the web, the largest known electronic document collection.

There is a pressing need for high-accuracy Information Retrieval (IR) systems, Speech Recognition systems, and "smart" Natural Language Processing (NLP) systems. For tackling many problems in these fields, most approaches rely on:

    • Well-established statistical techniques, sometimes borrowed from the analysis of numerical data,
    • Ad-hoc, fast techniques that appear to work "well", but which lack a solid understanding of how the language is structured, and
    • High-complexity algorithms from Computational Linguistics that exploit the syntactic structure of language but which do not scale well with the amount of information that needs to be processed in emerging applications.

This workshop on Document Space has the goal of bringing together researchers in Mathematics, Statistics, Electrical Engineering, Computer Science and Linguistics; the hope is that a unified theory describing "document space" will emerge that will become the vehicle for the development of algorithms for tackling efficiently (both in accuracy and computational complexity) the challenges mentioned above.

Text documents are sequences of words, usually with high syntactic structure, where the number of distinct words per document ranges from a few hundreds to a few thousands. Much effort has been devoted to finding (e.g., through statistical means) useful low-dimensional representations of these inherently high-dimensional documents, that would facilitate NLP tasks such as document categorization, question answering, machine translation, unstructured information management, etc. Moreover, many of these tasks can be formulated as problems of clustering, outlier detection, and statistical modeling. Many important questions arise:

    • What is the best way to perform dimensionality reduction? The fact that documents can have diverse features in terms of vocabulary, genre, style, etc., makes the mapping into a common space very challenging. Is there a single best metric for measuring similarity between documents? Documents can be similar in many ways (in terms of content, style, etc); how do different vector representations facilitate different similarity judgments?
    • How can the semantics of each word be incorporated into the analysis and representation? For example, there are many cases where related documents share very few common words (e.g., due to synonymy). On the other hand, documents with high vocabulary overlap are not necessarily on the same topic.
    • It has been argued that sub-corpus dependent feature extraction (that is, document feature computation that depends on collective features of a subset of the corpus) yields far better retrieval results than when the features depend only on each document independently. Hence, efficient representation of documents into a common space becomes a "hard" problem: in principle, one would have to consider all possible subsets of a corpus in order to find the one that yields the best feature selection.
    • There is a natural duality between the symbolic and stochastic approaches in NLP, which have been exploited in order to organize document corpora. Symbolic information can be used to define coordinates and/or similarities between documents, and conversely the stochastic approach can lead to the definition of symbolic information. As above, this correspondence is relative to different subsets, of both documents and symbols, and organizing and fully exploiting it, with efficient algorithms, is challenging.

We expect that this workshop will lead the way toward well-justified answers (in terms of theory and experimental results) to the questions above, and, hopefully, contribute to a better understanding of the rich medium of language.

Speakers

Michael Berry (University of Tennessee)
David Blei (Princeton University)
Eugene Charniak (Brown University)
Ronald Coifman (Yale University)
John Conroy (IDA Center for Computing Sciences)
Nello Cristianini (UC Davis)
Jason Eisner (Johns Hopkins University)
Djoerd Hiemstra (Universiteit Twente)
David Horn (Tel Aviv University)
Piotr Indyk (Massachusetts Institute of Technology)
Frederick Jelinek (Johns Hopkins University)
Peter Jones (Yale University)
Damianos Karakos (Johns Hopkins University)
Sanjeev Khundapur (Johns Hopkins University)
John Lafferty (Carnegie-Mellon University)
Stephane Lafon (Google Inc.)
Mauro Maggioni (Yale University)
Michael Mahoney (Yahoo! Research)
David Marchette (Naval Surface Warfare Center)
Carey Priebe (Johns Hopkins University)
Andrew Tomkins (Yahoo! Research)
Michael Trosset (College of William and Mary)

Contact Us:

Institute for Pure and Applied Mathematics (IPAM)
Attn: DS2006
460 Portola Plaza
Los Angeles CA 90095-7121
Phone: 310 825-4755
Fax: 310 825-4756
Email: ipam@ucla.edu
Website: http://www.ipam.ucla.edu/programs/ds2006/


Home ] [ People ] [ Events ]  Programs  [ Visitor Info ]
Contact: (310)825-4755