Semantic Hashing for Fast Document Retrieval

Ruslan Salakhutdinov
University of Toronto
Computer Science

In this talk I will first describe an efficient way of learning models with "deep" architectures, composed of multiple levels of non-linear operations. I will show that this framework can be successfully applied to a variety of machine learning problems, including nonlinear dimensionality reduction, information retrieval, building good generative models of highly-structured input data, as well as learning good classification and regression models.

In the second half of the talk I will concentrate on the problem of document retrieval and show how to learn a deep model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the model performs ``Semantic Hashing'': Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. I will demonstrate that this way of extending the efficiency of hash-coding to approximate matching achieves good performance and is considerably faster than most of the existing information retrieval algorithms.

Joint work with Geoffrey Hinton

Presentation (PDF File)

Back to Workshops II: Numerical Tools and Fast Algorithms for Massive Data Mining, Search Engines and Applications