BACK TO ALL BLOGS

Multi-Indexing for Fuzzy Lookup

Have you wondered how Google and Pinterest can find visually similar images so quickly? How apps like Shazam can identify the song stuck in your head from just a few hums?

This post is intended to provide a quick overview of the multi-indexing approach to fuzzy lookup, as well as some practical tips on how to optimize it.

Hashing

All the above services use some sort of hashing to reduce a complicated input into a much smaller amount of data that will be used for lookup. Usually this hash is encoded as binary, where each bit is meaningful. For instance, a simple audio hashing function might record whether the dominant frequency decreased (0) or increased (1) at each second, then concatenate all the 0’s and 1’s.

The hashing function should express what data you care about (the song being hummed) and ignore everything else (who is humming it and how well). However, a different problem like speaker identification would need a totally different hashing function – one that expresses features about the person humming and not what they are humming. If your hashing function does not suit the problem, no fuzzy lookup method can save you.

Multi-indexing

Once you have a good binary hash with n bits, you might start by creating a big lookup table from hash to result. In this example, we use n=8 for simplicity, but in reality, most datasets require using at least 32 bits, or else there will be too many hash collisions. In this table and the rest of the blog post, columns that you can efficiently query by are in bold.

This works great as long as we don’t need fuzzy matching. However, if we want hashes that are off by one bit like 10111101 to match A as well, we need to start making more queries. Here, our only option is brute force by making 9 queries: the original 10111101 as well as 00111101, 11111101, etc. – each one with one of the 8 bits toggled. If we also want hashes that are off by 2 bits to match (a Hamming distance of 2), we would need to make every possible query with up to 2 bits toggled – that’s 37 lookups to get one result!

This gets out of hand quickly as n and the Hamming distance k increase. Multi-indexing addresses this by splitting the hash into (k + 1) pieces. For example, with a Hamming distance of k = 1, the table above would become:

This allows you look up each partial hash individually. Now, we can take the two halves of our query, 1011 and 1101, and query them against indices 0 and 1, respectively, yielding result A and its complete hash. Any hash within distance k is guaranteed to return a match since at least one of its (k + 1) pieces is an exact match. This requires only (k + 1) queries, a huge improvement over the nearly-exponential number of queries the brute force approach required!

Once you have a list of results, there are two steps left to do:

  • Filter out the duplicates. If your query were exactly A’s hash, you would have received two copies of A in your results – one for each index
  • Filter out hashes that are outside the lookup distance. Multi-indexing guarantees that you get all results within the lookup distance, but you might get a few more. Check that the Hamming distance between the query and each result’s hash is small enough

Getting the most out of your indices

There are two main challenges you may encounter once you’ve got a multi-index table working.

1. My queries return so many results that I’m encountering performance issues, but almost all of those results are outside the fuzzy Hamming distance!

The first option you should consider is decreasing the number of indices, thereby increasing the number of bits per partial hash. For instance, if you have 230 results hashed and are using indices of 20 bits each, you should expect at least 210 spurious results per partial hash lookup. Using indices of 30 or more bits could bring that down to a much more manageable number.

It is also entirely possible that you get too many spurious results even though your indices are of a healthy size. This means that the entropy of each index is lower than it should be, either because the bits have low variance (too frequently 0 or too frequently 1) or because bits are highly correlated.

You can start to address the correlation between bits by choosing to split your hash into partial hashes better. For instance, if consecutive bits are correlated, you might choose one partial hash to be even bits and the other to be odd bits.

For complicated hashing functions, it is worth a data exploration to see how you can optimally assign bits to partial hashes. By using a greedy algorithm to swap bits between indices, we were able to improve lookup speed of one of our tables by roughly a factor of 5. This table used 256 bits with 8 indices. We originally broke the hash into consecutive 32-bit sequences before realizing that consecutive bits were often highly correlated.

This hash had high correlation (red) between sequential bits, which one can see in its correlation matrix (left). This meant that randomly shuffling bits greatly reduced correlations (center). A further optimization avoiding placing highly correlated bits in the same index (right).

Close-ups of the covariance matrix for the first partial hash show a huge decrease in the number of highly correlated bits.

If the entropy of the entire hash is fundamentally too low, you should focus on improving your hashing function.

If the above options don’t apply or suffice, you should rethink your design to minimize the number of queries you need to do. For instance, when hashing video frames, we were able to record a single row for a sequence of consecutive frames with identical hashes rather than a row per frame, reducing the number of repeats, especially on more frequent hashes. A more general approach that further reduces spurious results is to pair your multi-index table with an exact lookup table and store only unique hashes in the multi-index table:

In this way, you can get a smaller number of rows at first, then filter down to only the hashes within the right Hamming distance before querying the full results. This can greatly reduce the number of rows you receive, but makes the query a 2-step process.

2. My fuzzy lookup misses some things I want to match together, but already gets false positives for things I don’t want to match together!

The heart of the problem is that some matches have too large a Hamming distance, and that some non-matches have too small a Hamming distance. The only way to address this through multi-indexing is by removing or duplicating bits.

You may want to remove bits for two possible reasons:

  • The bit represents information you don’t care about for this lookup problem
  • The bit is redundant or highly correlated with another bit

By removing such a bit, you stop penalizing potential matches for disagreeing on a superfluous piece of data. If removing these bits is successful in bringing your true positive rate up, you may trade those gains for a reduction in false positives by reducing the fuzzy match distance.

In some circumstances, you may want to duplicate a bit by putting it in multiple indices. This is useful if one bit of the hashing function is especially important. If the bit is absolutely essential for a match, you should put it in each index.

If these approaches don’t work, your only hope is to improve your hashing function.

The moral of the story:

  • Choose indices of appropriate size; probably log2(m) or slightly greater, where m is the number of results you expect to hash
  • Split your hash in a way such that each partial hash gets bits that are as uncorrelated as possible
  • Avoid hashing redundant results, possibly by pairing your multi-index table with an exact lookup table
  • Remove redundant or uninformative bits from your partial hashes
  • Duplicate essential bits across all partial hashes
  • Reconsider your hashing function
BACK TO ALL BLOGS

Step-by-step Guide to Deploying Deep Learning Models

Or, I just trained a machine learning model – now what

This post goes over a quick and dirty way to deploy a trained machine learning model to production

ML in production

When we first entered the machine learning space here at Hive, we already had millions of ground truth labeled images, allowing us to train a state-of-the-art deep convolutional image classification model from scratch (i.e. randomized weights) in under a week, specialized for our use case. The more typical ML use case, though, is usually on the order of hundreds of images, for which I would recommend fine-tuning an existing model. For instance, https://www.tensorflow.org/tutorials/image_retraining has a great tutorial on how to fine-tune an Imagenet model (trained on 1.2M images, 1000 classes) to classify a sample dataset of flowers (3647 images, 5 classes).

For a quick tl;dr of the linked TensorFlow tutorial, after installing bazel and TensorFlow, you would need to run the following code, which takes around 30 mins to build and 5 minutes to train:

(
cd "$HOME" && \
curl -O http://download.tensorflow.org/example_images/flower_photos.tgz && \
tar xzf flower_photos.tgz ;
) && \
bazel build tensorflow/examples/image_retraining:retrain \
          tensorflow/examples/image_retraining:label_image \
&& \
bazel-bin/tensorflow/examples/image_retraining/retrain \
  --image_dir "$HOME"/flower_photos \
  --how_many_training_steps=200
&& \
bazel-bin/tensorflow/examples/image_retraining/label_image \
  --graph=/tmp/output_graph.pb \
  --labels=/tmp/output_labels.txt \
  --output_layer=final_result:0 \
  --image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg

Alternatively, if you have Docker installed, you can use this prebuilt Docker image like so:

sudo docker run -it --net=host liubowei/simple-ml-serving:latest /bin/bash

>>> cat test.sh && bash test.sh

which puts you into an interactive shell inside the container and runs the above command; you can also follow along with the rest of this post inside the container if you wish.

Now, TensorFlow has saved the model information into /tmp/output_graph.pb and /tmp/output_labels.txt, which are passed above as command-line parameters to the label_image.py script . Google’s image_recognition tutorial also links to another inference script, but we will be sticking with label_image.py for now.

Converting one-shot inference to online inference (TensorFlow)

If we just want to accept file names from standard input, one per line, we can do “online” inference quite easily:

while read line ; do
bazel-bin/tensorflow/examples/image_retraining/label_image \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--output_layer=final_result:0 \
--image="$line" ;
done

From a performance standpoint, though, this is terrible – we are reloading the neural net, the weights, the entire TensorFlow framework, and python itself, for every input example!

We can do better. Let’s start by editing the label_image.py script — for me, this is located in bazel-bin/tensorflow/examples/image_retraining/label_image.runfiles/org_tensorflow/tensorflow/examples/image_retraining/label_image.py.

Let’s change the lines

141:  run_graph(image_data, labels, FLAGS.input_layer, FLAGS.output_layer,
142:        FLAGS.num_top_predictions)

TO

141:  for line in sys.stdin:
142:    run_graph(load_image(line), labels, FLAGS.input_layer,
142:        FLAGS.output_layer, FLAGS.num_top_predictions)

This is indeed a lot faster, but this is still not the best we can do!

The reason is the with tf.Session() as sess construction on line 100. TensorFlow is essentially loading all the computation into memory every time run_graph is called. This becomes apparent once you start trying to do inference on the GPU — you can see the GPU memory go up and down as TensorFlow loads and unloads the model parameters to and from the GPU. As far as I know, this construction is not present in other ML frameworks like Caffe or Pytorch.

The solution is then to pull the with statement out, and pass in a sess variable to run_graph:


def run_graph(image_data, labels, input_layer_name, output_layer_name,
              num_top_predictions, sess):
    # Feed the image_data as input to the graph.
    #   predictions will contain a two-dimensional array, where one
    #   dimension represents the input image count, and the other has
    #   predictions per class
    softmax_tensor = sess.graph.get_tensor_by_name(output_layer_name)
    predictions, = sess.run(softmax_tensor, {input_layer_name: image_data})
    # Sort to show labels in order of confidence
    top_k = predictions.argsort()[-num_top_predictions:][::-1]
    for node_id in top_k:
      human_string = labels[node_id]
      score = predictions[node_id]
      print('%s (score = %.5f)' % (human_string, score))
    return [ (labels[node_id], predictions[node_id].item()) for node_id in top_k ] # numpy floats are not json serializable, have to run item

...

  with tf.Session() as sess:
    for line in sys.stdin:
      run_graph(load_image(line), labels, FLAGS.input_layer, FLAGS.output_layer,
          FLAGS.num_top_predictions, sess)

(see code at https://github.com/hiveml/simple-ml-serving/blob/master/label_image.py)

If you run this, you should find that it takes around 0.1 sec per image, quite fast enough for online use.

Converting one-shot inference to online inference (Other ML Frameworks)

Caffe uses its net.forward code which is very easy to put into a callable framework: see http://nbviewer.jupyter.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb

Mxnet is also very unique: it actually has ready-to-go inference server code publicly available: https://github.com/awslabs/mxnet-model-server.

Further details coming soon!

Deployment

The plan is to wrap this code in a Flask app. If you haven’t heard of it, Flask is a very lightweight Python web framework which allows you to spin up an http api server with minimal work.

As a quick reference, here’s a Flask app that receives POST requests with multipart form data:

#!/usr/bin/env python
# usage: python echo.py to launch the server ; and then in another session, do
# curl -v -XPOST 127.0.0.1:12480 -F "data=@./image.jpg"
from flask import Flask, request
app = Flask(__name__)
@app.route('/', methods=['POST'])
def classify():
    try:
        data = request.files.get('data').read()
        print repr(data)[:1000]
        return data, 200
    except Exception as e:
        return repr(e), 500
app.run(host='127.0.0.1',port=12480)

And here is the corresponding Flask app hooked up to run_graph above:


#!/usr/bin/env python
# usage: bash tf_classify_server.sh
from flask import Flask, request
import tensorflow as tf
import label_image as tf_classify
import json
app = Flask(__name__)
FLAGS, unparsed = tf_classify.parser.parse_known_args()
labels = tf_classify.load_labels(FLAGS.labels)
tf_classify.load_graph(FLAGS.graph)
sess = tf.Session()
@app.route('/', methods=['POST'])
def classify():
    try:
        data = request.files.get('data').read()
        result = tf_classify.run_graph(data, labels, FLAGS.input_layer, FLAGS.output_layer, FLAGS.num_top_predictions, sess)
        return json.dumps(result), 200
    except Exception as e:
        return repr(e), 500
app.run(host='127.0.0.1',port=12480)

This looks quite good, except for the fact that Flask and TensorFlow are both fully synchronous – Flask processes one request at a time in the order they are received, and TensorFlow fully occupies the thread when doing the image classification.

As it’s written, the speed bottleneck is probably still in the actual computation work, so there’s not much point upgrading the Flask wrapper code. And maybe this code is sufficient to handle your load, for now.

There are 2 obvious ways to scale up request thoroughput : scale up horizontally by increasing the number of workers, which is covered in the next section, or scale up vertically by utilizing a GPU and batching logic. Implementing the latter requires a webserver that is able to handle multiple pending requests at once, and decide whether to keep waiting for a larger batch or send it off to the TensorFlow graph thread to be classified, for which this Flask app is horrendously unsuited. Two possibilities are using Twisted + Klein for keeping code in Python, or Node.js + ZeroMQ if you prefer first class event loop support and the ability to hook into non-Python AI frameworks such as Torch.

OK, so now we have a single server serving our model, but maybe it’s too slow or our load is getting too high. We’d like to spin up more of these servers – how can we distribute requests across each of them?

The ordinary method is to add a proxy layer, perhaps haproxy or nginx, which balances the load between the backend servers while presenting a single uniform interface to the client. For use later in this section, here is some sample code that runs a rudimentary Node.js load balancer http proxy:

// Usage : node basic_proxy.js WORKER_PORT_0,WORKER_PORT_1,...
const worker_ports = process.argv[2].split(',')
if (worker_ports.length === 0) { console.err('missing worker ports') ; process.exit(1) }

const proxy = require('http-proxy').createProxyServer({})
proxy.on('error', () => console.log('proxy error'))

let i = 0
require('http').createServer((req, res) => {
  proxy.web(req,res, {target: 'http://localhost:' + worker_ports[ (i++) % worker_ports.length ]})
}).listen(12480)
console.log(`Proxying localhost:${12480} to [${worker_ports.toString()}]`)

// spin up the AI workers
const { exec } = require('child_process')
worker_ports.map(port => exec(`/bin/bash ./tf_classify_server.sh ${port}`))

To automatically detect how many backend servers are up and where they are located, people generally use a “service discovery” tool, which may be bundled with the load balancer or be separate. Some well-known ones are Consul and Zookeeper. Setting up and learning how to use one is beyond the scope of this article, so I’ve included a very rudimentary proxy using the node.js service discovery package seaport.

Proxy code:

// Usage : node seaport_proxy.js
const seaportServer = require('seaport').createServer()
seaportServer.listen(12481)
const proxy = require('http-proxy').createProxyServer({})
proxy.on('error', () => console.log('proxy error'))

let i = 0
require('http').createServer((req, res) => {
  seaportServer.get('tf_classify_server', worker_ports => {
    const this_port = worker_ports[ (i++) % worker_ports.length ].port
    proxy.web(req,res, {target: 'http://localhost:' + this_port })
  })
}).listen(12480)
console.log(`Seaport proxy listening on ${12480} to '${'tf_classify_server'}' servers registered to ${12481}`)

Worker code:

// Usage : node tf_classify_server.js
const port = require('seaport').connect(12481).register('tf_classify_server')
console.log(`Launching tf classify worker on ${port}`)
require('child_process').exec(`/bin/bash ./tf_classify_server.sh ${port}`)

However, as applied to AI, this setup runs into a bandwidth problem.

At anywhere from tens to hundreds of images a second, the system becomes bottlenecked on network bandwidth. In the current setup, all the data has to go through our single seaport master, which is the single endpoint presented to the client.

To solve this, we need our clients to not hit the single endpoint at http://127.0.0.1:12480, but instead to automatically rotate between backend servers to hit. If you know some networking, this sounds precisely like a job for DNS!

However, setting up a custom DNS server is again beyond the scope of this article. Instead, by changing the clients to follow a 2-step “manual DNS” protocol, we can reuse our rudimentary seaport proxy to implement a “peer-to-peer” protocol in which clients connect directly to their servers:

Proxy code:

// Usage : node p2p_proxy.js
const seaportServer = require('seaport').createServer()
seaportServer.listen(12481)

let i = 0
require('http').createServer((req, res) => {
  seaportServer.get('tf_classify_server', worker_ports => {
    const this_port = worker_ports[ (i++) % worker_ports.length ].port
    res.end(`${this_port}
`)
  })
}).listen(12480)
console.log(`P2P seaport proxy listening on ${12480} to 'tf_classify_server' servers registered to ${12481}`)

(The worker code is the same as above.)

Client code:


curl -v -XPOST localhost:`curl localhost:12480` -F"data=@$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg
        

RPC Deployment

Coming soon! A version of the above with Flask replaced by ZeroMQ.

Conclusion and further reading

At this point you should have something working in production, but it’s certainly not futureproof. There are several important topics that were not covered in this guide:

  • Automatically deploying and setting up on new hardware.
    • Notable tools include Openstack/VMware if you’re on your own hardware, Chef/Puppet for installing Docker and handling networking routes, and Docker for installing TensorFlow, Python, and everything else.
    • Kubernetes or Marathon/Mesos are also great if you’re on the cloud
  • Model version management
    • Not too hard to handle this manually at first
    • TensorFlow Serving is a great tool that handles this, as well as batching and overall deployment, very thoroughly. The downsides are that it’s a bit hard to setup and to write client code for, and in addition doesn’t support Caffe/PyTorch.
  • How to migrate your AI code off Matlab
    • Don’t do matlab in production.
  • GPU drivers, Cuda, CUDNN
    • Use nvidia-docker and try to find some Dockerfiles online.
  • Postprocessing layers. Once you get a few different AI models in production, you might start wanting to mix and match them for different use cases — run model A only if model B is inconclusive, run model C in Caffe and pass the results to model D in TensorFlow, etc.

.

BACK TO ALL BLOGS

Inside a Neural Network’s Mind

Why do neural networks make the decisions they do? Often, the truth is that we don’t know; it’s a black box. Fortunately, there are now some techniques that help us peek under the hood to help us understand how they make decisions.

What has the neural network learned is attractive? Where does it look to decide if an image is safe for work? Using grad-cam, we explore the predictions of our models: sport type, action / non-action, drugs, violence, attractiveness, race, age, etc.

Github repo: https://github.com/hiveml/tensorflow-grad-cam

Hey, my face is up here! Clearly, the attractiveness model focuses on body over face in the mid-range shots above. Interestingly, it has also learned to localize people without any specific bounding box information in training. The model is trained on 200k images, labeled by Hive into three classes: hot, neutral, and not. Then the scores for each bucket are combined to create a rating 0-10. This classifier is available here1.

The main idea is to apply the logit layer with the last convolutional layer before global pooling. This creates a map showing the importance of each pixel in the network’s decision.

Sports action, NSFW, violence
Sports action, NSFW, violence

The pose of the football player tells the model that a play is in action. We can clearly locate the nudity and the guns in the NSFW and Violence images, too.

Snowboarding, TV show
Snowboarding, TV show

A person in a suit, center frame, apparently indicates that it is a TV show instead of a commercial (right). The TV / commercial model is a great example of how grad-CAM can uncover unexpected reasons behind the decisions our models make. They can also confirm what we expect, as seen in the snowboarding example (left).

The Simpsons, Rick and Morty
The Simpsons, Rick and Morty

This example uses our animated show classifier. Interestingly, the most important spot in the images above is the edge of Bart and Morty, including a substantial amount of the background in both cases.

CAM and GradCam

First developed by Zhou2, Class Activation Maps (CAM) show what the network is looking at. For each class, CAM illustrates the parts of the image most important for that class.

Ramprasaath3 extended CAM to apply to a wider range of architectures without any changes. Specifically, grad-CAM can handle fully connected layers and more complicated scenarios like question answering. However, almost all popular neural nets like ResNet, DenseNet, and even NasNet end with global average pooling. Therefore the heatmap can be computed directly using CAM without the backward pass. This is especially important for speed critical applications. Fortunately, with the ResNet used in this post we don’t have to modify the nets at all to compute CAM or grad-CAM.

Recently, grad-CAM++ Chattopadhyay4 further generalized the method to increase the precision of the output heat maps. Grad-CAM++ is better at dealing with multiple instances of the class and highlighting the entire class rather than just the most salient parts. It achieves this using a weighted combination of positive partial derivatives.

Here’s how it’s implemented in Tensorflow:

one_hot = tf.sparse_to_dense(predicted_class, [num_classes], 1.0)
signal = tf.multiply(end_points[‘Logits’], one_hot)
loss = tf.reduce_mean(signal)

This returns an array of num_classes elements with only the logit of the predicted class non-zero. This defines the loss.

grads = tf.gradients(loss, conv_layer)[0]
norm_grads = tf.divide(grads, tf.sqrt(tf.reduce_mean(tf.square(grads)))
	+ tf.constant(1e-5))

The pose of the football player tells the model that a play is in action. We can clearly locate the nudity and the guns in the NSFW and Violence images, too.

output, grads_val = sess.run([conv_layer, norm_grads],
	feed_dict={imgs0: img})

A person in a suit, center frame, apparently indicates that it is a TV show instead of a commercial (right). The TV / commercial model is a great example of how grad-CAM can uncover unexpected reasons behind the decisions our models make. They can also confirm what we expect, as seen in the snowboarding example (left).

weights = np.mean(grads_val, axis = (0, 1))             # [2048]
cam = np.ones(output.shape[0 : 2], dtype = np.float32)  # [10,10]

This example uses our animated show classifier. Interestingly, the most important spot in the images above is the edge of Bart and Morty, including a substantial amount of the background in both cases.

cam = np.ones(output.shape[0 : 2], dtype = np.float32)  # [10,10]
for i, w in enumerate(weights):
	cam += w * output[:, :, i]
cam = np.maximum(cam, 0)
cam = cam / np.max(cam)
cam = cv2.resize(cam, (eval_image_size, eval_image_size))

Pass the cam through a RELU to only take the positive suggestions for that class. Then we resize the coarse cam output to the input size and blend to display.

Finally, the main function grabs the tensorflow slim model definition and pre-processing function. With these it computes the grad-CAM output, and blends that with the input photo. In the code below, we use the class with the greatest softmax probability as input to grad_cam. Instead, we could choose any class. For example:

The model predicted alcohol as the top choice with 99% and gambling with only 0.4%. By changing the predicted_class from alcohol to gambling, we can see how-despite the low class probability, it can clearly pinpoint the gambling in the image.

References