Have you wondered how Google and Pinterest can find visually similar images so quickly? How apps like Shazam can identify the song stuck in your head from just a few hums?
This post is intended to provide a quick overview of the multi-indexing approach to fuzzy lookup, as well as some practical tips on how to optimize it.
Hashing
All the above services use some sort of hashing to reduce a complicated input into a much smaller amount of data that will be used for lookup. Usually this hash is encoded as binary, where each bit is meaningful. For instance, a simple audio hashing function might record whether the dominant frequency decreased (0) or increased (1) at each second, then concatenate all the 0’s and 1’s.
The hashing function should express what data you care about (the song being hummed) and ignore everything else (who is humming it and how well). However, a different problem like speaker identification would need a totally different hashing function – one that expresses features about the person humming and not what they are humming. If your hashing function does not suit the problem, no fuzzy lookup method can save you.
Multi-indexing
Once you have a good binary hash with n bits, you might start by creating a big lookup table from hash to result. In this example, we use n=8 for simplicity, but in reality, most datasets require using at least 32 bits, or else there will be too many hash collisions. In this table and the rest of the blog post, columns that you can efficiently query by are in bold.
This works great as long as we don’t need fuzzy matching. However, if we want hashes that are off by one bit like 10111101 to match A as well, we need to start making more queries. Here, our only option is brute force by making 9 queries: the original 10111101 as well as 00111101, 11111101, etc. – each one with one of the 8 bits toggled. If we also want hashes that are off by 2 bits to match (a Hamming distance of 2), we would need to make every possible query with up to 2 bits toggled – that’s 37 lookups to get one result!
This gets out of hand quickly as n and the Hamming distance k increase. Multi-indexing addresses this by splitting the hash into (k + 1) pieces. For example, with a Hamming distance of k = 1, the table above would become:
This allows you look up each partial hash individually. Now, we can take the two halves of our query, 1011 and 1101, and query them against indices 0 and 1, respectively, yielding result A and its complete hash. Any hash within distance k is guaranteed to return a match since at least one of its (k + 1) pieces is an exact match. This requires only (k + 1) queries, a huge improvement over the nearly-exponential number of queries the brute force approach required!
Once you have a list of results, there are two steps left to do:
- Filter out the duplicates. If your query were exactly A’s hash, you would have received two copies of A in your results – one for each index
- Filter out hashes that are outside the lookup distance. Multi-indexing guarantees that you get all results within the lookup distance, but you might get a few more. Check that the Hamming distance between the query and each result’s hash is small enough
Getting the most out of your indices
There are two main challenges you may encounter once you’ve got a multi-index table working.
1. My queries return so many results that I’m encountering performance issues, but almost all of those results are outside the fuzzy Hamming distance!
The first option you should consider is decreasing the number of indices, thereby increasing the number of bits per partial hash. For instance, if you have 230 results hashed and are using indices of 20 bits each, you should expect at least 210 spurious results per partial hash lookup. Using indices of 30 or more bits could bring that down to a much more manageable number.
It is also entirely possible that you get too many spurious results even though your indices are of a healthy size. This means that the entropy of each index is lower than it should be, either because the bits have low variance (too frequently 0 or too frequently 1) or because bits are highly correlated.
You can start to address the correlation between bits by choosing to split your hash into partial hashes better. For instance, if consecutive bits are correlated, you might choose one partial hash to be even bits and the other to be odd bits.
For complicated hashing functions, it is worth a data exploration to see how you can optimally assign bits to partial hashes. By using a greedy algorithm to swap bits between indices, we were able to improve lookup speed of one of our tables by roughly a factor of 5. This table used 256 bits with 8 indices. We originally broke the hash into consecutive 32-bit sequences before realizing that consecutive bits were often highly correlated.
This hash had high correlation (red) between sequential bits, which one can see in its correlation matrix (left). This meant that randomly shuffling bits greatly reduced correlations (center). A further optimization avoiding placing highly correlated bits in the same index (right).
Close-ups of the covariance matrix for the first partial hash show a huge decrease in the number of highly correlated bits.
If the entropy of the entire hash is fundamentally too low, you should focus on improving your hashing function.
If the above options don’t apply or suffice, you should rethink your design to minimize the number of queries you need to do. For instance, when hashing video frames, we were able to record a single row for a sequence of consecutive frames with identical hashes rather than a row per frame, reducing the number of repeats, especially on more frequent hashes. A more general approach that further reduces spurious results is to pair your multi-index table with an exact lookup table and store only unique hashes in the multi-index table:
In this way, you can get a smaller number of rows at first, then filter down to only the hashes within the right Hamming distance before querying the full results. This can greatly reduce the number of rows you receive, but makes the query a 2-step process.
2. My fuzzy lookup misses some things I want to match together, but already gets false positives for things I don’t want to match together!
The heart of the problem is that some matches have too large a Hamming distance, and that some non-matches have too small a Hamming distance. The only way to address this through multi-indexing is by removing or duplicating bits.
You may want to remove bits for two possible reasons:
- The bit represents information you don’t care about for this lookup problem
- The bit is redundant or highly correlated with another bit
By removing such a bit, you stop penalizing potential matches for disagreeing on a superfluous piece of data. If removing these bits is successful in bringing your true positive rate up, you may trade those gains for a reduction in false positives by reducing the fuzzy match distance.
In some circumstances, you may want to duplicate a bit by putting it in multiple indices. This is useful if one bit of the hashing function is especially important. If the bit is absolutely essential for a match, you should put it in each index.
If these approaches don’t work, your only hope is to improve your hashing function.
The moral of the story:
- Choose indices of appropriate size; probably log2(m) or slightly greater, where m is the number of results you expect to hash
- Split your hash in a way such that each partial hash gets bits that are as uncorrelated as possible
- Avoid hashing redundant results, possibly by pairing your multi-index table with an exact lookup table
- Remove redundant or uninformative bits from your partial hashes
- Duplicate essential bits across all partial hashes
- Reconsider your hashing function