BACK TO ALL BLOGS

Learning Hash Codes via Hamming Distance Targets

We achieved major recall and performance boosts against state-of-the-art methods for content-based image retrieval and approximate nearest neighbors tasks.

We recently submitted our paper Learning Hash Codes via Hamming Distance Targets to arXiv. This was a revamp and generalization of our previous work, CHASM (Convolutional Hashing for Automated Scene Matching). We achieved major recall and performance boosts against state-of-the-art methods for content-based image retrieval and approximate nearest neighbors tasks. Our method can train any differentiable model to hash for similarity search.

Similarity search with binary hash codes

Let’s start with everyone’s favorite example: ImageNet. A common information retrieval task selects 100 ImageNet classes and requires hashing “query” and “dataset” images to compare against each other. Methods seek to maximize the mean average precision (MAP) of the top 1000 dataset results by hash distance, such that most of the 1000 nearest dataset images to each query image come from the same ImageNet class.

This is an interesting challenge because it requires training a differentiable loss term, whereas the final hash is discrete. Trained models must either binarize their last layer into 0s and 1s (usually just taking its sign), or (like FaceNet) pair up with a nearest neighbors method such as k-d trees or Jegou et al.’s Product Quantization for Nearest Neighbor Search.

Insight 1: It’s not a classification task

While information retrieval on ImageNet is reminiscent of classification, its optimization goal is actually quite different. Every image retrieval paper we looked at implicitly treated similarity search as if it were a classification task.

Some papers make this assumption by using cross entropy terms, asserting that the probability two images with last layers of \mathbf{y_1}y​1​​ and \mathbf{y_2}y​2​​ are similar is something like

$$latex \frac{1}{1 + e^{-\alpha * (\mathbf{y_1} \cdot \mathbf{y_2})}}​1+e​−α∗(y​1​​⋅y​2​​)​​​​1$​​$

The issue here is that the model uses hashes at inference time, not the asserted probabilities. An example of this is Cao et al.’s HashNet: Deep Learning to Hash by Continuation.

Other papers make this assumption by simply training a classification model with an encoding layer, then hoping that the binarized encoding is a good hash. The flaw here is that, while the floating-point encoding contains all information used to classify the image, its binarized version might not make a good hash. Bits may be highly imbalanced, and there is no guarantee that binarizing the encoding preserves much of the information. An example of this is Lin et al.’s Deep Learning of Binary Hash Codes for Fast Image Retrieval.

Finally, a few papers make this assumption by first choosing a target hash for each class, then trying to minimize the distance between each image and its class’s target hash. This is actually a pretty good idea for ImageNet, but leaves something to be desired: it only works naturally for classification, rather than more general similarity search tasks, where similarity can be non-transitive and asymmetric. An example of this is Lu et al.’s Deep Binary Representation for Efficient Image Retrieval. This seems to be the second best performing method after ours.

We instead choose a loss function that easily extends to non-transitive, asymmetric similarity search tasks without training a classification model. I’ll elaborate on this in the next section.

Insight 2: There is a natural way to compare floating-point embeddings to binarized hashes

Previous papers have tried to wrestle floating-point embeddings into binarized hashes through a variety of means. Some add “binarization” loss terms, punishing the model for creating embeddings that are far from -1 or 1. Others learn “by continuation”, producing an embedding by passing its inputs through a tanh function that sharpens during training. The result is that their floating-point embeddings always lie close to \pm1±1, a finding that they boast. They use this in order to make Euclidean distance or correspond more closely to Hamming distance (the number of bits that differ). If your embedding is just -1’s and 1s, then Hamming distance is simply half of Euclidean distance.

However, this is actually a disaster for training. First of all, forcing all outputs to \pm1±1 does not even change their binarized values; 0.5 and 0.999 both binarize to 1. Also, it shrinks gradients for embedding components near \pm1±1, forcing the model to learn from an ever-shrinking gray area of remaining values near 0.

We resolve this by avoiding the contrived usage of Euclidean distance altogether. Instead, we use a sound statistical model for Hamming distance based on embeddings, making two approximations that turn out to be very accurate. First, we assume that a model’s last layer produces embeddings that consists of independent random unit normals (which we encourage with a gentle batch normalization). If we pick a random embedding \mathbf{y_i}y​i​​, this implies that \mathbf{z_i} = \mathbf{y_i} / ||\mathbf{y}||_2z​i​​=y​i​​/∣∣y∣∣​2​​ is a random point on the unit hypersphere. We can then simply evaluate the angle \theta = \arccos(\mathbf{z_i} \cdot \mathbf{z_j})θ=arccos(z​i​​⋅z​j​​) between \mathbf{z_i}z​i​​and \mathbf{z_j}z​j​​.

The probability that such vectors differ in sign on a particular component is \frac{\theta}{\pi}​π​​θ​​, so we make our second approximation: that the probability for each component to differ in sign is independent. This implies that probability for Hamming distance between \mathbf{z_i}z​i​​and\mathbf{z_j}z​j​​ to be dd is a binomial distribution:

P(\text{Hamming distance}=d) = B(d|n, \frac{\theta}{\pi})P(Hamming distance=d)=B(d∣n,​π​​θ​​)

This allows us to use a very accurate loss function for the true optimization goal, the chance for an input to be within a target Hamming distance of an input it is similar to (and dissimilar inputs to be outside that Hamming distance). Using the natural geometry of the Hamming embedding, we achieve far better results than previous work.

Insight 3: It’s important to structure your training batches right

Imagine this: you train your model to hash, passing in 32 pairs of random images from your training set, and averaging the 32 pairwise loss terms. Since you have 100 ImageNet classes, each batch consists of 31.68 dissimilar pairs and 0.32 similar pairs on average. How accurate is the gradient?

The bottleneck is learning about similar images. If random error for each similar pair is \sigmaσ, the expected random error in each batch is \sigma / \sqrt{0.32}σ/√​0.32​​​, even greater than \sigmaσ. This is a tremendous amount of noise, making it incredibly slow for the model to learn anything.

We can first improve by comparing every image in the batch to every other image. This takes {64 \choose 2}(​2​64​​) comparisons, which will be only slightly slower than the original 32 comparisons since we still only need to run the model once on each input. This gives us 2016 pairwise comparisons, with 1995.84 dissimilar pairs and 20.16 similar pairs on average. Our random error is now somewhere between \sigma/\sqrt{0.32}σ/√​0.32​​​ and \sigma/\sqrt{20.16}σ/√​20.16​​​ (probably closer to the latter), a big improvement.

But we can do even better by choosing constructing a batch with random similar images. By first choosing 32 random images, then for each one choosing a random image it is similar to, we get 51.84 similar pairs on average, 32 of which are independent. This reduces our random error to between \sigma/\sqrt{32}σ/√​32 ​​​and \sigma/\sqrt{51.84}σ/√​51.84​​​, another big improvement.

Under reasonable conditions, this improves training speed by a factor of 10.

Discussion

Read our paper for the full story! We boosted the ImageNet retrieval benchmark from 73.3% to 85.3% MAP for 16-bit hashes and performed straight-up approximate nearest neighbors with 2 to 8 times fewer distance comparisons than previously state-of-the-art methods at the same recall.