Mathematics of Knowledge and Search Engines

September 10 - December 14, 2007

Overview

The rise of the search engine as a major tool for searches on the internet has spawned a large and growing industry that has changed modern commerce, education, and the study of scientific, financial, and social data bases. The underpinnings of these search engines are mathematical algorithms which are well adapted to large and rapid computations, mainly from linear algebra. While the impact of this industry has been enormous, there is a parallel development in the applications of these methods to other related problems concerning the extraction of knowledge from large databases. This long program at IPAM will be devoted to new mathematics and methodologies of knowledge engines: the mathematical procedures used to extract knowledge from large databases. While this includes topics related to search engines it is mainly devoted to the more general problem of finding features in a database or using defined features to search within a database. It is expected that this program will be of interest to a large number of scientific fields, including pure and applied mathematics, statistics, bioinformatics, and engineering. We also anticipate the study of applications to finance, the social sciences, and the humanities. Some of the topics to be investigated are listed below:

One set of problems consists of searching a data set that is presented as either a graph (perhaps with weighted edges) or a set of points in Euclidean space, perhaps of very high dimension. Many of the presently used algorithms for search engines use some variant of spectral graph theory, where one uses eigenfunctions of a Laplace operator (or some related object) to start searching. An exciting development is the related idea of Diffusion Geometry, where one has a given set in Euclidean space. While some of these ideas appear in a variety of contexts of data analysis, such as spectral graph theory, manifold learning, nonlinear principal components and kernel methods, one can augment these approaches by showing that the diffusion inference distances are key intrinsic geometric quantities linking spectral theory of the Markov process (Laplace operator or Kernels) to the corresponding “geometry of information”. It has recently been shown that one explanation for the robustness of such algorithms is that on domains in Euclidean space of dimension d with volume bounded by one (or manifolds with the correct assumption on curvature bounds), there are at any point in the domain d eigenfunctions for Laplacian (Dirichlet or Neumann) that not only provide local coordinates on B(x, e distance(x, boundary)) but map that ball to (approximately) a ball of radius one. In other words the eigenfunctions play a role similar to the classical Riemann Mapping Function of complex analysis. Another area is to use various PDE’s to define an initial diffusion, where the PDE could vary depending on location. Another area of interest here is the preprocessing stage, which requires recent advances (e.g. the work of Assaf Naor) on embeddings of metric spaces into Euclidean space.
The problems of numerical linear algebra that arise include computations of eigenfunctions and eigenvalues (or some analogue when one is not dealing with a Laplace operator), and squaring of matrices. An overriding theme is how to build sparse approximating matrices while respecting the appropriate features. This means that all problems are much more severe than simply approximating in the operator norm. One methodology, presently under intensive development, is to adapt multiscale techniques related to the work of Beylkin, Coifman, Rokhlin. It turns out that for many of these numerical issues a goal is to find the appropriate scales defined by a matrix, often in a coordinate system given by a nonlinear change of coordinates.
A third important area is to develop methodologies for steerable searches (“Dynamic Navigation”). In a certain sense this can be seen as the unifying goal of the entire area. This is of outsized importance in the world of commerce, where one has already seen a profusion of specialized search engines, but the next step is to allow searches which are more flexible in the sense that the goal can be changed on the fly.
An evolving area of search is developing effective similarity measures for different types of data, particularly data that is unstructured or which has a non-traditional type of structure. Some examples are images, music, video, patient databases.

Organizing Committee

Ronald Coifman (Yale University)
Yuval Kluger (New York University)
Yann LeCun (New York University)
Vladimir Rokhlin (Yale University)
Karin Verspoor (Los Alamos National Laboratory)