January 23 - 27, 2006

Processing and management of the ever-increasing amount of spoken and written information appears to be a huge challenge for statisticians, computer scientists, engineers and linguists; the situation is further aggravated by the explosive growth of the web, the largest known electronic document collection.

There is a pressing need for high-accuracy Information Retrieval (IR) systems, Speech Recognition systems, and “smart” Natural Language Processing (NLP) systems. For tackling many problems in these fields, most approaches rely on:

- Well-established statistical techniques, sometimes borrowed from the analysis of numerical data,
- Ad-hoc, fast techniques that appear to work “well”, but which lack a solid understanding of how the language is structured, and
- High-complexity algorithms from Computational Linguistics that exploit the syntactic structure of language but which do not scale well with the amount of information that needs to be processed in emerging applications.

This workshop on Document Space has the goal of bringing together researchers in Mathematics, Statistics, Electrical Engineering, Computer Science and Linguistics; the hope is that a unified theory describing “document space” will emerge that will become the vehicle for the development of algorithms for tackling efficiently (both in accuracy and computational complexity) the challenges mentioned above.

Text documents are sequences of words, usually with high syntactic structure, where the number of distinct words per document ranges from a few hundreds to a few thousands. Much effort has been devoted to finding (e.g., through statistical means) useful low-dimensional representations of these inherently high-dimensional documents, that would facilitate NLP tasks such as document categorization, question answering, machine translation, unstructured information management, etc. Moreover, many of these tasks can be formulated as problems of clustering, outlier detection, and statistical modeling. Many important questions arise:

- What is the best way to perform dimensionality reduction? The fact that documents can have diverse features in terms of vocabulary, genre, style, etc., makes the mapping into a common space very challenging. Is there a single best metric for measuring similarity between documents? Documents can be similar in many ways (in terms of content, style, etc); how do different vector representations facilitate different similarity judgments?
- How can the semantics of each word be incorporated into the analysis and representation? For example, there are many cases where related documents share very few common words (e.g., due to synonymy). On the other hand, documents with high vocabulary overlap are not necessarily on the same topic.
- It has been argued that sub-corpus dependent feature extraction (that is, document feature computation that depends on collective features of a subset of the corpus) yields far better retrieval results than when the features depend only on each document independently. Hence, efficient representation of documents into a common space becomes a “hard” problem: in principle, one would have to consider all possible subsets of a corpus in order to find the one that yields the best feature selection.
- There is a natural duality between the symbolic and stochastic approaches in NLP, which have been exploited in order to organize document corpora. Symbolic information can be used to define coordinates and/or similarities between documents, and conversely the stochastic approach can lead to the definition of symbolic information. As above, this correspondence is relative to different subsets, of both documents and symbols, and organizing and fully exploiting it, with efficient algorithms, is challenging.

We expect that this workshop will lead the way toward well-justified answers (in terms of theory and experimental results) to the questions above, and, hopefully, contribute to a better understanding of the rich medium of language.

Damianos Karakos
(Johns Hopkins University, Center for Language and Speech Processing)

Mauro Maggioni
(Yale University, Mathematics/Program in Applied Mathematics)

David Marchette
(Naval Surface Warfare Center)

Carey Priebe, Chair
(Johns Hopkins University, Center for Imaging Science/Applied Mathematics and Statistics)