Recent Advances in Minhashing

Mark Manasse
Microsoft Research

We have been revisiting some of the fundamental ideas in locality-sensitive hash schemes, and trying to find ways to combine some of the better aspects of each of minhashing and random hyperplane projection, in light of Monika Rauch-Henzinger's work last year comparing the two.

The principal differences between the two have been that

1) Minhashing measures a scaled L1 distance, while projection (for unit vectors) measures a scaled L2 distance, and

2) Minhashing mostly ignores multiplicity

The last point isn't quite true, we've known for years how to use multiplicity, at a performance cost roughly linear in the multiplicity (ignoring the cost of the sorting step to compute multiplicity).

In the last year, great advances have been made in applying multiplicity more cheaply. First, Gollapudi and Panigrahy showed that arbitrary multiplicities greater than or equal to one could be applied in time logarithmic in the multiplicity. More recently, we have shown how to apply arbitrary non-negative floating point weights in time independent of the weight.

Following from that, we apply one of the techniques we used back to unweighted minhashing, and find an expected reduction in consumption of pseudorandomness of a factor of four, compared to the best previous work. Practical implementations of this (with the assistance of Peter Montgomery) lead us to some clever hacks related to the primitivity of 2 mod 67, and to the rediscovery of some work by Leiserson, Prokop, and Randall of an application of de Bruijn sequences.

Mark Manasse, Frank McSherry, Kunal Talwar

Microsoft Research

Audio (MP3 File, Podcast Ready) Presentation (PowerPoint File)

Back to Workshop I: Dynamic Searches and Knowledge Building