AI Seminar: The Power of Theory in the Practice of Hashing with Focus on Similarity Estimation


Mikkel Thorup, Professor, Algorithms & Complexity Section, Department of Computer Science, University of Copenhagen.


Hash functions have become ubiquitous tools in modern data analysis, e.g., the construction of small randomized sketches of large data streams. We like to think of abstract hash functions, assigning independent uniformly random hash values to keys, but in practice, we have to choose a hash function that only has an element of randomness, e.g., 2-independence. While this works for sufficiently random input, the real world has structured data where such simple hash functions fail, calling for the need of more powerful hash functions. In this talk, we focus hashing for set similarity, which is an integral component in the analysis of Big Data. The basic idea is to use the same hash function to do coordinated sampling from different sets. Depending on the context, we want subsets sampled without replacement, or fixed-length vectors of samples that may be with replacement. The latter is used as input to support vector machines (SVMs) and locality sensitive hashing (LSH). The most efficient constructions require very powerful hash functions that are also needed for efficient size estimation. We discuss the interplay between the hash functions and the algorithms using them. Finally, we present experiments on both real and synthetic data where standard 2-independent hashing yield systematically poor similarity estimates, while the right theoretical choice is sharply concentrated, and faster than standard cryptographic hash functions with no proven guarantees.