Abstract

Trading Spaces: Measures of Document Proximity and Methods for Embedding Them

Michael Trosset
College of William and Mary
Mathematics

What is (or should be) associated with the phrase “document space”? I will argue that
the crucial concept involved in transforming a corpus of documents into a document space
is the notion of document proximity, i.e., a way of measuring the (dis)similarity of a pair of
documents.

In theory, document proximities might be obtained by direct comparison of actual documents;
more commonly, attributes of each document are quantified and then proximities
are computed from a mediating vector space model. Certain methods that might be used
for text mining, e.g., KNN classification and various methods for hierarchical agglomerative
clustering, operate directly on proximities; most methods, however, operate in the familiar
setting of an inner product space, typically a Euclidean space of relatively low dimension.
This document space is usually not the mediating vector space model. Accordingly, a second
crucial concept in the construction of a document space is the embedding of document proximities in a Euclidean space.

The intimately connected—but conceptually distinct—activities of computing proximities
and embedding proximities are often conflated. A fundamental goal of this presentation
is to decouple them. I will argue that much of the emphasis that has been placed on developing
new embedding methods might more properly be placed on developing appropriate
proximity measures, and that the latter is one of the central challenges of text mining. In the process, I will endeavor to use the concepts of proximity and embedding to elucidate how certain methods for statistical learning have (or have not) been utilized in text mining.

Audio (MP3 File, Podcast Ready)