BACK TO ALL BLOGS

Learning Hash Codes via Hamming Distance Targets

We recently submitted our paper Learning Hash Codes via Hamming Distance Targets to arXiv. This was a revamp and generalization of our previous work, CHASM (Convolutional Hashing for Automated Scene Matching). We achieved major recall and performance boosts against state-of-the-art methods for content-based image retrieval and approximate nearest neighbors tasks. Our method can train any differentiable model to hash for similarity search.

Similarity search with binary hash codes

Let’s start with everyone’s favorite example: ImageNet. A common information retrieval task selects 100 ImageNet classes and requires hashing “query” and “dataset” images to compare against each other. Methods seek to maximize the mean average precision (MAP) of the top 1000 dataset results by hash distance, such that most of the 1000 nearest dataset images to each query image come from the same ImageNet class.

This is an interesting challenge because it requires training a differentiable loss term, whereas the final hash is discrete. Trained models must either binarize their last layer into 0s and 1s (usually just taking its sign), or (like FaceNet) pair up with a nearest neighbors method such as k-d trees or Jegou et al.’s Product Quantization for Nearest Neighbor Search.

Insight 1: It’s not a classification task.

While information retrieval on ImageNet is reminiscent of classification, its optimization goal is actually quite different. Every image retrieval paper we looked at implicitly treated similarity search as if it were a classification task.

Some papers make this assumption by using cross entropy terms, asserting that the probability two images with last layers of

are similar is something like

The issue here is that the model uses hashes at inference time, not the asserted probabilities. An example of this is Cao et al.’s HashNet: Deep Learning to Hash by Continuation.

Other papers make this assumption by simply training a classification model with an encoding layer, then hoping that the binarized encoding is a good hash. The flaw here is that, while the floating-point encoding contains all information used to classify the image, its binarized version might not make a good hash. Bits may be highly imbalanced, and there is no guarantee that binarizing the encoding preserves much of the information. An example of this is Lin et al.’s Deep Learning of Binary Hash Codes for Fast Image Retrieval.

Finally, a few papers make this assumption by first choosing a target hash for each class, then trying to minimize the distance between each image and its class’s target hash. This is actually a pretty good idea for ImageNet, but leaves something to be desired: it only works naturally for classification, rather than more general similarity search tasks, where similarity can be non-transitive and asymmetric. An example of this is Lu et al.’s Deep Binary Representation for Efficient Image Retrieval. This seems to be the second best performing method after ours.

We instead choose a loss function that easily extends to non-transitive, asymmetric similarity search tasks without training a classification model. I’ll elaborate on this in the next section.

Insight 2: There is a natural way to compare floating-point embeddings to binarized hashes.

Previous papers have tried to wrestle floating-point embeddings into binarized hashes through a variety of means. Some add “binarization” loss terms, punishing the model for creating embeddings that are far from -1 or 1. Others learn “by continuation”, producing an embedding by passing its inputs through a tanh function that sharpens during training. The result is that their floating-point embeddings always lie close to

, a finding that they boast. They use this in order to make Euclidean distance or correspond more closely to Hamming distance (the number of bits that differ). If your embedding is just -1’s and 1s, then Hamming distance is simply half of Euclidean distance.

However, this is actually a disaster for training. First of all, forcing all outputs to

does not even change their binarized values; 0.5 and 0.999 both binarize to 1. Also, it shrinks gradients for embedding components near

, forcing the model to learn from an ever-shrinking gray area of remaining values near 0.

We resolve this by avoiding the contrived usage of Euclidean distance altogether. Instead, we use a sound statistical model for Hamming distance based on embeddings, making two approximations that turn out to be very accurate. First, we assume that a model’s last layer produces embeddings that consists of independent random unit normals (which we encourage with a gentle batch normalization). If we pick a random embedding

, this implies that

is a random point on the unit hypersphere. We can then simply evaluate the angle

between

and

The probability that such vectors differ in sign on a particular component is

, so we make our second approximation: that the probability for each component to differ in sign is independent. This implies that probability for Hamming distance between

and

to be

is a binomial distribution:

This allows us to use a very accurate loss function for the true optimization goal, the chance for an input to be within a target Hamming distance of an input it is similar to (and dissimilar inputs to be outside that Hamming distance). Using the natural geometry of the Hamming embedding, we achieve far better results than previous work.

Insight 3: It’s important to structure your training batches right.

Imagine this: you train your model to hash, passing in 32 pairs of random images from your training set, and averaging the 32 pairwise loss terms. Since you have 100 ImageNet classes, each batch consists of 31.68 dissimilar pairs and 0.32 similar pairs on average. How accurate is the gradient?

The bottleneck is learning about similar images. If random error for each similar pair is
𝜎σ
, the expected random error in each batch is

, even greater than
𝜎σ
This is a tremendous amount of noise, making it incredibly slow for the model to learn anything.

We can first improve by comparing every image in the batch to every other image. This takes

comparisons, which will be only slightly slower than the original 32 comparisons since we still only need to run the model once on each input. This gives us 2016 pairwise comparisons, with 1995.84 dissimilar pairs and 20.16 similar pairs on average. Our random error is now somewhere between

and

(probably closer to the latter), a big improvement.

But we can do even better by choosing constructing a batch with random similar images. By first choosing 32 random images, then for each one choosing a random image it is similar to, we get 51.84 similar pairs on average, 32 of which are independent. This reduces our random error to between

and

, another big improvement

Under reasonable conditions, this improves training speed by a factor of 10.

Discussion

Read our paper for the full story! We boosted the ImageNet retrieval benchmark from 73.3% to 85.3% MAP for 16-bit hashes and performed straight-up approximate nearest neighbors with 2 to 8 times fewer distance comparisons than previously state-of-the-art methods at the same recall.


BACK TO ALL BLOGS

Multi-label Classification

Classification challenges like Imagenet changed the way we train models. Given enough data, neural networks can distinguish between thousands of classes with remarkable accuracy.

However, there are some circumstances where basic classification breaks down, and something called multi-label classification is necessary. Here are two examples:

  • You need to classify a large number of brand logos and what medium they appear on (sign, billboard, soda bottle, etc.)
  • You have plenty of image data on a lot of different animals, but none on the platypus – which you want to identify in images

In the first example, should you train a classifier with one class for each logo and medium combination? The number of such combinations could be enormous, and it might be impossible to get data on some of them. Another option would be to train a classifier for logos and a classifier for medium; however, this doubles the runtime to get your results. In the second example, it seems impossible to train a platypus model without data on it.

Multi-label models step in by doing multiple classifications at once. In the first example, we can train a single model that outputs both a logo classification and a medium classification without increasing runtime. In the second example, we can use common sense to label animal features (fur vs. feathers vs. scales, bill vs. no bill, tail vs. no tail) for each of the animals we know about, train a single model that identifies all features for an animal at once, then infer that any animal with fur, a bill, and a tail is a platypus.

A simple way to accomplish this in a neural network is to group a logit layer into multiple softmax predictions:

You can then train such a network by simply adding the cross entropy loss for each softmax where a ground truth label is present.

To compare these approaches, let’s consider a subset of imagenet classes, and two features that distinguish them:

First, I trained two 50-layer resnet V2’s on this balanced dataset: one trained on the single-label classification problem, and the other trained on the multi-label classification problem. In this example, every training image has both labels, but real applications may have only a subset of labels available for each image.

The single-label model trained specifically on the 6-animal classification performed slightly better when distinguishing all 6 animals:

  • Single-label model: 90% accuracy
  • Multi-label model: 88% accuracy

However, the multihead model provides finer information granularity. Though it got only 88% accuracy on distinguishing all 6 animals, it achieved 92% accuracy at distinguishing scales/exoskeleton/fur and 95% accuracy at distinguishing spots/no spots. If we care about only one of these factors, we’re already better off with the multi-label model.

But this toy example hardly touches on the regime where multi-label classification really thrives: large datasets with many possible combinations of independent labels. In this regime, we get the interesting benefit of transfer learning. Imagine if we had categorized hundreds of animals into a dozen binary criteria. Training a separate model for each binary criterion would yield acceptable results, but learning the other features can actually help in some cases by effectively pre-training the network on a larger dataset.

At Hive, we recently deployed a multi-label classification model that replaced 8 separate classification models. For each image, we usually had truth data available for 2 to 5 of the labels. Out of the 8, 2 were better (think 93% instead of 91%). These were the labels with less data. This makes sense, since they would benefit most from domain-specific pretraining on the same images. But most importantly for this use case, we were able to run all the models together in 1/8th the time as before.

BACK TO ALL BLOGS

How to Use SSH Tunneling

This guide will present a step-by-step guide to solving common connectivity problems using SSH tunnels. To read some useful comments and context, skip to the end.

Here we will use ssh local port forwarding to PULL the service port through the ssh connection. The ssh tunnel command (to be run from A) is:

Step 0: If box A can already hit box B:8080, then good for you. Otherwise, follow the steps below to make that happen.

Step 1: If box A can ssh into box B, read the following section. If not, go to Step 2.

Here we will use ssh local port forwarding to PULL the service port through the ssh connection. The ssh tunnel command (to be run from A) is:

user@A > ssh -vCNnTL A:8080:B:8080 user@B

This pulls the service port over to box A at port 8080, so that anyone connecting to A:8080 will transparently get their requests forwarded over the the actual server on B:8080.

Step 2: If box B can ssh into box A, read the following section. Otherwise, skip to step 3.

Now we will use ssh remote port forwarding to PUSH the service port through the ssh connection. The command (to be run from B) is

user@B > ssh -vCNnTR localhost:8080:B:8080 user@A

Now users on A hitting localhost:8080 will be able to connect to B:8080. Users not on either A or B will still be unable to connect.

To enable listening on A:8080, you have 2 options:

A) If you have sudo, add the following line to /etc/ssh/sshd_config and reload the sshd service:

GatewayPorts clientspecified

Then rerun the above command with “localhost” replaced by “A”:

user@B > ssh -vCNnTR A:8080:B:8080 user@A

B) Pretend “localhost (A)” is another box, and apply step 1, since A can generally ssh into itself:

user@A > ssh -vCNnTL A:8080:localhost:8080 user@localhost

Now we come to the situation where neither A nor B can ssh into the other.

Step 3: If there are any TCP ports that allow A, B to connect, continue reading. Otherwise, move on to step 4.

Suppose that B is able to connect to A:4040. Then the way to allow B to ssh into A is to turn A:4040 into an ssh port. This is doable by applying Step 1 on A itself, to pull the ssh service on 22 over to listen on A:4040:

user@A > ssh -vCNnTL A:4040:A:22 user@A

And then you can apply Step 2, specifying port 4040 for ssh itself:

user@B > ssh -vCNnTL localhost:8080:B:8080 user@A -p 4040

Similarly, if A is able to connect to B:4040, you’ll want to forward B:22 to B:4040 using Step 1, then apply Step 1 again as usual.

Step 4: Find a box C which has some sort of connectivity to both A and B, and continue reading. Otherwise, skip to step 10.

If A and B have essentially no connectivity, then the way to proceed is to route through another box.

Step 5: If C is able to hit B:8080, continue reading. Otherwise, skip to step 9.

Step 6: If A is able to SSH to C, continue reading. Otherwise, skip to step 7.

This is very similar to step 1 — we will pull the connection through the SSH tunnel using LOCAL forwarding.

user@A > ssh -vCNnTL A:8080:B:8080 user@C

Step 7: If C is able to SSH to A, continue reading. Otherwise, skip to step 8.

Again, this is analogous to step 2 — we will push the connection through the SSH tunnel using REMOTE ssh forwarding.

user@C: ssh -vCNnTR localhost:8080:B:8080 user@A

Just as before, we will need to add an additional forwarding step to listen on a public A interface rather than localhost on A:

user@A: ssh -vCNnTL A:8080:localhost:8080 user@localhost

Step 8: If neither C nor A can ssh to each other, but there are TCP ports open between the 2:

Apply the technique in step 3. Otherwise, skip to step 10.

Now we are in the situation where C can’t hit B:8080 directly.

Step 9: The general idea is to first connect C:8080 to B:8080 using one of Steps 1 or 2, and then do the same to connection A:8080 to C:8080.

Note that it doesn’t matter which order you set up these connections.

9a) If C can ssh to B and C can ssh to A. This is a very common scenario – maybe C is your local laptop which is connected to 2 separate VPNs.

First pull B:8080 to C:1337 and then push C:1337 to A:8080:

user@C > ssh -vCNnTL localhost:1337:B:8080 user@B
user@C > ssh -vCNnTR localhost:8080:localhost:1337 user@A
user@C > ssh user@A
user@A > ssh -vCNnTL A:8080:localhost:8080 user@localhost

9b) If C can ssh to B and A can ssh to C: Again, a fairly common scenario if you have a “super-private” network accessible only from an already private network.

Do two pulls in succession:

user@C > ssh -vCNnTL localhost:1337:B:8080 user@B
user@A > ssh -vCNnTL A:8080:localhost:1337 user@C

You can actually combine these into a single command:

user@A : ssh -vC -A -L A:8080:localhost:1337 user@c ‘ssh -vCNnTL localhost:1337:B:8080 user@B’

9c) If B can ssh to C and C can ssh to A: Double-push.

user@B > ssh -vCNnTR localhost:1337:B:8080 user@C
user@C > ssh -vCNnTR localhost:8080:localhost:1337 user@A
user@C > ssh user@A
user@A > ssh -vCNnTL A:8080:localhost:8080 user@A

Again these can be combined into a single command.

9d) If B can ssh to C and A can ssh to C: Push from B and then pull from A.

user@B > ssh -vCNnTR localhost:1337:B:8080 user@C
user@A > ssh -vCNnTL A:8080:localhost:1337 user@C

Any number of these may also need the trick from Step 3 to enable SSH access.

Step 10: If box C doesn’t have any TCP connectivity to either B or A,then having box C doesn’t really help the situation at all. You’ll need to find a different box C which actually has connectivity to both and return to step 4, or find a chain (C, D, etc.) through which you could eventually patch a connection through to B. In this case you’ll need a series of commands such as those in 9a)-d) to gradually patch B:8080 to C:1337, to D:42069, etc., until you finally end up at A:8080.

Addendum 1: If your service uses UDP rather than TCP (this includes dns, some video streaming protocols, and most video games), you may have to add a few steps to convert to TCP; see https://superuser.com/questions/53103/udp-traffic-through-ssh-tunnel for a guide.

Addendum 2: If your service host and port come with url signing (e.g. signed s3 urls), changing your application to hit A:8080 rather than B:8080 may cause the url signatures to fail. To remedy this, you can add a line in your /etc/hosts file to redirect B to A; since your computer checks this file before doing a DNS lookup, you can do completely transparent ssh tunneling while still respecting SSL and signed s3/gcs urls.

Addendum 3: My preferred tools for checking which TCP ports are open between 2 boxes are nc -l 4040 on the receiving side and curl B:4040 on the sending side. ping B, traceroute B, and route -n are also useful for diagnostic information but may not tell you the full story.

Addendum 4: SSH tunnels will not work if there is something already listening on that port, such as the SSH tunnel you created yesterday and forgot to remove. To easily check this, try ps -efjww | grep ssh or sudo netstat -nap | grep LISTEN.

Addendum 5: All the ssh flags are explained in the man page: man ssh. To give a brief overview: -v is verbose logging, -C is compression, -nNT together disable the interactive part of ssh and make it only tunnel, -A forwards over your ssh credentials, -f backgrounds ssh after connecting, and -L and -R are for local and remote forwarding respectively. -o StrictHostKeyChecking=no is also useful for disabling the known-hosts check.

FURTHER COMMENTS

SSH tunnels are useful as a quick-fix solution to networking issues, but are generally recognized as inferior solutions compared to long-term proper networking fixes. They tend to be difficult to maintain for an number of reasons: the setup does not require any configuration and leaves no trace other than a running process; they don’t automatically come up when restarting a box — except if manually added to the startup daemon; and they can easily be killed by temporary network outages.

However we’ve made good use of them here at Hive, for instance most recently when we needed to keep up production services during a network migration, but also occasionally when provisioning burst GPU resources from AWS and integrating them seamlessly into our hardware resource pool. They also can be very useful when developing locally or debugging production services, or for getting gmail access in China.

If you’re interested in different and more powerful ways to tunnel, I’m no networking expert — all I can do is point you in the direction of some interesting networking vocabulary.

References

https://en.wikipedia.org/wiki/OSI_model

https://help.ubuntu.com/community/SSH/OpenSSH/PortForwarding#Dynamic_Port_Forwarding

https://en.wikipedia.org/wiki/SOCKS

https://en.wikipedia.org/wiki/Network_address_translation

https://en.wikipedia.org/wiki/IPsec

https://en.wikipedia.org/wiki/Iptables

https://wiki.archlinux.org/index.php/VPN_over_SSH

https://en.wikipedia.org/wiki/Routing_table

https://linux.die.net/man/8/route

BACK TO ALL BLOGS

Back to Our Roots: Hive Data in Academia

Hive was started by two PhD students at Stanford who were frustrated with the difficulty of generating quality datasets for machine learning research. We found that solutions on the market were either too inaccurate or too expensive to conduct the typical research study. From those early days, we’ve now built one of the world’s largest marketplaces of human labor.

In keeping with our academic roots, we always intended Hive Data to be the perfect partner for academic labs. The best way to showcase Hive Data’s impact on academia is through a real case study. A machine learning researcher, who we’ll call Professor X, had an urgent conference submission deadline coming up. He still had quite a lot of work remaining, and much of his work required him to label a large corpus of videos. He had tried many other services without success, and was urgently searching for a solution that could address all of his needs. Here were the constraints he was under, and how we solved them:

  1. Professor X didn’t want to pay a large upfront cost.
    Given his limited budget and inability to risk project failure, Professor X needed to ensure that the provider he chose had a competitive price point and offered him the flexibility that he needed. Other services on the market generally had fixed costs ran upwards of hundreds of thousands of dollars the first year. Even if he could afford a single engagement, if the service wasn’t up to par, he didn’t have the budget to try a different one. Hive doesn’t impose any upfront fees, so it made us a low-risk option.
  2. Professor X needed to make sure the data output quality would be high enough to publish research.
    While he did find some services whose rates were competitive with Hive’s, he noted quickly that all of them suffered from poor data quality, especially for tasks in data labeling for videos. This rendered the data unusable for his research project. Hive, on the other hand, offered a complex system of audits and a worker consensus model to ensure high data accuracy. Because the tasks passed through several rounds of worker auditing, Hive was able to offer the high-quality data that Professor X needed.
  3. Professor X needed his results in a fast turnaround time.
    As we mentioned, Professor X was on a tight deadline to submit his paper for publication. Most other services have inflexible, week-long timelines for returning datasets. Hive, however, offered a much faster turnaround time. Due to our remarkably large global workforce, we were able to scale up to finish jobs as quickly as the Professor needed. He was able to get his job finished in less than a day, whereas other providers had quoted him as long as a month!
  4. Professor X was searching for a service provider that could provide technical insight during the process.
    Part of Hive Data’s value proposition to its customers is in offering our own expertise in building machine learning models, as well as supplying the quality data to do so. We’d seen similar projects as the one Professor X was dealing with, and we understood the problems he would face in generating this dataset. Even before getting started, we helped Professor X optimize his project by structuring his tasks in a way that improved his results and helped him build an effective model off his data.

In addressing these needs, Professor X was able to submit his paper and get it published on time. He still continues to use Hive Data to power his AI research today.

Hive Data has already been used by top-tier university research labs all over the world, including at Stanford, MIT, Cornell, and Simon Fraser University. We’ve seen projects range from labeling datasets for vehicle detection in autonomous driving, object recognition for robotic arms, and pedestrian identification from security cameras. The number of research verticals we cater to is constantly growing, as we pride ourselves on rapid engineering cycles to release data labeling capabilities as soon as we see a need emerging.

If you’re an academic researcher and you’re curious about how we can partner together, contact me at david@thehive.ai. We’re excited to support your research!

BACK TO ALL BLOGS

Hive Media: Revolutionizing the Way We Understand On-Screen Content and Viewership

Hive is a full-stack deep learning platform focused on solving visual intelligence problems. While we are working with companies in sectors ranging from autonomous driving to facial recognition, our flagship enterprise product is Hive Media.

image

As the name suggests, Hive Media is our complete enterprise solution utilizing deep learning to revolutionize traditional media analytics. However, it is far more than a simple collection of neural net models. What we’ve built with Hive Media is an end-to-end solution, beginning with data ingestion and extending all the way to real-time, device-level viewership metrics.

The Vision

Imagine you could watch 100 different channels at the same time and remember every key element of what was on screen – what brand was shown, what actor was present, what commercial was playing etc. Now, suppose you could remember this forever and could query this information instantly. This would be a massively valuable dataset, because it seems like an impossible feat for a human to achieve. This, however, is precisely what we set Hive Media out to achieve. Essentially, we wanted to build a system that could “watch” all of broadcast television in the same way a human would and then store this information in an easily accessible manner.

Data Ingestion

The first step in our pipeline is accessing TV streams. Today, we are processing 400 channels in the US, with 300 more in Europe to come later this year. See Figure 1 for a graphical display of our present and planned TV coverage.

Figure 1
Figure 1

We are recording every second of every channel, totaling up to 10,000+ hours of footage per day! We expect this number to be well over 30,000+ hours a day by next year. In addition, all major channels are covered, as well as a wide range of local affiliates on the network side. As you can imagine, this is a lot of data and we are storing all of it in our own datacenter rollouts around the world. Ultimately, we are aiming to build the world’s largest repository of linear broadcast data.

Deep Learning Models

Having this much data is only useful if you can understand it. This is where our deep learning models come into play. Using Hive Data, we’ve built some of the world’s largest celebrity, logo, and brand databases in the world. These models, amongst several others, are applied to every second of our recorded footage and stored in a database in a manner that is optimized for easy retrieval. This means a query such as “How many times did a Nike logo appear on NBC in the month of September?” – previously impossible – can now be answered in a matter of seconds! Unlike some other products on the market, our models don’t rely upon any sort of metadata associated with the programming –- tags are generated purely based on the video content. This is extremely powerful, because it means our system can handle a large variety of content without having to constantly hard code in parameters.

User Viewership

The final piece of the puzzle is understanding how our tags affect viewership – the holy grail of media analytics. Everything we’ve described up until now has been generating what I call “cause” – to measure “effect;” we are currently working with actual device partners who give us real-time data on viewers. Today, we have access to millions of devices that send us real viewership data that we overlap with our tags to understand how on-screen content affects viewership. This means that every query we run not only tells us what aired, but also how it affects the viewership bottom line.

The easiest way to understand this system is to visually see some queries executed. In Figure 2, we show an example query for Chevrolet vs. Toyota commercials on NBC in a week’s period. You can see the tags our system found in the bottom right. The bottom left shows a video player illustrating the content corresponding to the tag, mainly to serve as video evidence that our tag is correct. What’s powerful about Hive Media is the fact that now, we can analyze viewership data at each of these tag occurrences to understand what their effect on viewership is. One important way to understand viewership, as shown in Figure 2, is the notion of tune-out, which is the percentage of users that change the channel in a time interval. This is often the strongest indicator of whether a viewer is enjoying content shown on screen. Interestingly enough, it seems that Chevy commercials generate almost twice as much tune-out as their Toyota counterparts in this case.

Figure 2
Figure 2

Let’s take another example query that looks for Nike logos, as shown in Figure 3. What we’re demonstrating here in the highlighted tag is a snippet of content that shows a Nike logo prominently placed in the center of the screen; even though this isn’t a true Nike commercial. Instead, this is Simone Biles, a Nike athlete, being featured in a Mattress Firm / Foster Kids commercial. But as part of any Nike athlete contract, Simone is obliged to wear Nike clothing whenever she appears on TV, Nike commercial or not. Nike would probably be highly interested in knowing how many similar logo placements occurred for Simone, as well as for all of their other sponsored athletes.

Figure 3
Figure 3

Today, we are only beginning our journey toward understanding the wealth of data we have at our disposal. Hive Media is pioneering a new way of thinking around media content, and we are eager to help both broadcasters and advertisers optimize their content to better retain viewers and inform advertising decisions.

BACK TO ALL BLOGS

Multi-Indexing for Fuzzy Lookup

Have you wondered how Google and Pinterest can find visually similar images so quickly? How apps like Shazam can identify the song stuck in your head from just a few hums?

This post is intended to provide a quick overview of the multi-indexing approach to fuzzy lookup, as well as some practical tips on how to optimize it.

Hashing

All the above services use some sort of hashing to reduce a complicated input into a much smaller amount of data that will be used for lookup. Usually this hash is encoded as binary, where each bit is meaningful. For instance, a simple audio hashing function might record whether the dominant frequency decreased (0) or increased (1) at each second, then concatenate all the 0’s and 1’s.

The hashing function should express what data you care about (the song being hummed) and ignore everything else (who is humming it and how well). However, a different problem like speaker identification would need a totally different hashing function – one that expresses features about the person humming and not what they are humming. If your hashing function does not suit the problem, no fuzzy lookup method can save you.

Multi-indexing

Once you have a good binary hash with n bits, you might start by creating a big lookup table from hash to result. In this example, we use n=8 for simplicity, but in reality, most datasets require using at least 32 bits, or else there will be too many hash collisions. In this table and the rest of the blog post, columns that you can efficiently query by are in bold.

This works great as long as we don’t need fuzzy matching. However, if we want hashes that are off by one bit like 10111101 to match A as well, we need to start making more queries. Here, our only option is brute force by making 9 queries: the original 10111101 as well as 00111101, 11111101, etc. – each one with one of the 8 bits toggled. If we also want hashes that are off by 2 bits to match (a Hamming distance of 2), we would need to make every possible query with up to 2 bits toggled – that’s 37 lookups to get one result!

This gets out of hand quickly as n and the Hamming distance k increase. Multi-indexing addresses this by splitting the hash into (k + 1) pieces. For example, with a Hamming distance of k = 1, the table above would become:

This allows you look up each partial hash individually. Now, we can take the two halves of our query, 1011 and 1101, and query them against indices 0 and 1, respectively, yielding result A and its complete hash. Any hash within distance k is guaranteed to return a match since at least one of its (k + 1) pieces is an exact match. This requires only (k + 1) queries, a huge improvement over the nearly-exponential number of queries the brute force approach required!

Once you have a list of results, there are two steps left to do:

  • Filter out the duplicates. If your query were exactly A’s hash, you would have received two copies of A in your results – one for each index
  • Filter out hashes that are outside the lookup distance. Multi-indexing guarantees that you get all results within the lookup distance, but you might get a few more. Check that the Hamming distance between the query and each result’s hash is small enough

Getting the most out of your indices

There are two main challenges you may encounter once you’ve got a multi-index table working.

1. My queries return so many results that I’m encountering performance issues, but almost all of those results are outside the fuzzy Hamming distance!

The first option you should consider is decreasing the number of indices, thereby increasing the number of bits per partial hash. For instance, if you have 230 results hashed and are using indices of 20 bits each, you should expect at least 210 spurious results per partial hash lookup. Using indices of 30 or more bits could bring that down to a much more manageable number.

It is also entirely possible that you get too many spurious results even though your indices are of a healthy size. This means that the entropy of each index is lower than it should be, either because the bits have low variance (too frequently 0 or too frequently 1) or because bits are highly correlated.

You can start to address the correlation between bits by choosing to split your hash into partial hashes better. For instance, if consecutive bits are correlated, you might choose one partial hash to be even bits and the other to be odd bits.

For complicated hashing functions, it is worth a data exploration to see how you can optimally assign bits to partial hashes. By using a greedy algorithm to swap bits between indices, we were able to improve lookup speed of one of our tables by roughly a factor of 5. This table used 256 bits with 8 indices. We originally broke the hash into consecutive 32-bit sequences before realizing that consecutive bits were often highly correlated.

This hash had high correlation (red) between sequential bits, which one can see in its correlation matrix (left). This meant that randomly shuffling bits greatly reduced correlations (center). A further optimization avoiding placing highly correlated bits in the same index (right).

Close-ups of the covariance matrix for the first partial hash show a huge decrease in the number of highly correlated bits.

If the entropy of the entire hash is fundamentally too low, you should focus on improving your hashing function.

If the above options don’t apply or suffice, you should rethink your design to minimize the number of queries you need to do. For instance, when hashing video frames, we were able to record a single row for a sequence of consecutive frames with identical hashes rather than a row per frame, reducing the number of repeats, especially on more frequent hashes. A more general approach that further reduces spurious results is to pair your multi-index table with an exact lookup table and store only unique hashes in the multi-index table:

In this way, you can get a smaller number of rows at first, then filter down to only the hashes within the right Hamming distance before querying the full results. This can greatly reduce the number of rows you receive, but makes the query a 2-step process.

2. My fuzzy lookup misses some things I want to match together, but already gets false positives for things I don’t want to match together!

The heart of the problem is that some matches have too large a Hamming distance, and that some non-matches have too small a Hamming distance. The only way to address this through multi-indexing is by removing or duplicating bits.

You may want to remove bits for two possible reasons:

  • The bit represents information you don’t care about for this lookup problem
  • The bit is redundant or highly correlated with another bit

By removing such a bit, you stop penalizing potential matches for disagreeing on a superfluous piece of data. If removing these bits is successful in bringing your true positive rate up, you may trade those gains for a reduction in false positives by reducing the fuzzy match distance.

In some circumstances, you may want to duplicate a bit by putting it in multiple indices. This is useful if one bit of the hashing function is especially important. If the bit is absolutely essential for a match, you should put it in each index.

If these approaches don’t work, your only hope is to improve your hashing function.

The moral of the story:

  • Choose indices of appropriate size; probably log2(m) or slightly greater, where m is the number of results you expect to hash
  • Split your hash in a way such that each partial hash gets bits that are as uncorrelated as possible
  • Avoid hashing redundant results, possibly by pairing your multi-index table with an exact lookup table
  • Remove redundant or uninformative bits from your partial hashes
  • Duplicate essential bits across all partial hashes
  • Reconsider your hashing function
BACK TO ALL BLOGS

Step-by-step Guide to Deploying Deep Learning Models

Or, I just trained a machine learning model – now what

This post goes over a quick and dirty way to deploy a trained machine learning model to production

ML in production

When we first entered the machine learning space here at Hive, we already had millions of ground truth labeled images, allowing us to train a state-of-the-art deep convolutional image classification model from scratch (i.e. randomized weights) in under a week, specialized for our use case. The more typical ML use case, though, is usually on the order of hundreds of images, for which I would recommend fine-tuning an existing model. For instance, https://www.tensorflow.org/tutorials/image_retraining has a great tutorial on how to fine-tune an Imagenet model (trained on 1.2M images, 1000 classes) to classify a sample dataset of flowers (3647 images, 5 classes).

For a quick tl;dr of the linked TensorFlow tutorial, after installing bazel and TensorFlow, you would need to run the following code, which takes around 30 mins to build and 5 minutes to train:

(
cd "$HOME" && \
curl -O http://download.tensorflow.org/example_images/flower_photos.tgz && \
tar xzf flower_photos.tgz ;
) && \
bazel build tensorflow/examples/image_retraining:retrain \
          tensorflow/examples/image_retraining:label_image \
&& \
bazel-bin/tensorflow/examples/image_retraining/retrain \
  --image_dir "$HOME"/flower_photos \
  --how_many_training_steps=200
&& \
bazel-bin/tensorflow/examples/image_retraining/label_image \
  --graph=/tmp/output_graph.pb \
  --labels=/tmp/output_labels.txt \
  --output_layer=final_result:0 \
  --image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg

Alternatively, if you have Docker installed, you can use this prebuilt Docker image like so:

sudo docker run -it --net=host liubowei/simple-ml-serving:latest /bin/bash

>>> cat test.sh && bash test.sh

which puts you into an interactive shell inside the container and runs the above command; you can also follow along with the rest of this post inside the container if you wish.

Now, TensorFlow has saved the model information into /tmp/output_graph.pb and /tmp/output_labels.txt, which are passed above as command-line parameters to the label_image.py script . Google’s image_recognition tutorial also links to another inference script, but we will be sticking with label_image.py for now.

Converting one-shot inference to online inference (TensorFlow)

If we just want to accept file names from standard input, one per line, we can do “online” inference quite easily:

while read line ; do
bazel-bin/tensorflow/examples/image_retraining/label_image \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--output_layer=final_result:0 \
--image="$line" ;
done

From a performance standpoint, though, this is terrible – we are reloading the neural net, the weights, the entire TensorFlow framework, and python itself, for every input example!

We can do better. Let’s start by editing the label_image.py script — for me, this is located in bazel-bin/tensorflow/examples/image_retraining/label_image.runfiles/org_tensorflow/tensorflow/examples/image_retraining/label_image.py.

Let’s change the lines

141:  run_graph(image_data, labels, FLAGS.input_layer, FLAGS.output_layer,
142:        FLAGS.num_top_predictions)

TO

141:  for line in sys.stdin:
142:    run_graph(load_image(line), labels, FLAGS.input_layer,
142:        FLAGS.output_layer, FLAGS.num_top_predictions)

This is indeed a lot faster, but this is still not the best we can do!

The reason is the with tf.Session() as sess construction on line 100. TensorFlow is essentially loading all the computation into memory every time run_graph is called. This becomes apparent once you start trying to do inference on the GPU — you can see the GPU memory go up and down as TensorFlow loads and unloads the model parameters to and from the GPU. As far as I know, this construction is not present in other ML frameworks like Caffe or Pytorch.

The solution is then to pull the with statement out, and pass in a sess variable to run_graph:


def run_graph(image_data, labels, input_layer_name, output_layer_name,
              num_top_predictions, sess):
    # Feed the image_data as input to the graph.
    #   predictions will contain a two-dimensional array, where one
    #   dimension represents the input image count, and the other has
    #   predictions per class
    softmax_tensor = sess.graph.get_tensor_by_name(output_layer_name)
    predictions, = sess.run(softmax_tensor, {input_layer_name: image_data})
    # Sort to show labels in order of confidence
    top_k = predictions.argsort()[-num_top_predictions:][::-1]
    for node_id in top_k:
      human_string = labels[node_id]
      score = predictions[node_id]
      print('%s (score = %.5f)' % (human_string, score))
    return [ (labels[node_id], predictions[node_id].item()) for node_id in top_k ] # numpy floats are not json serializable, have to run item

...

  with tf.Session() as sess:
    for line in sys.stdin:
      run_graph(load_image(line), labels, FLAGS.input_layer, FLAGS.output_layer,
          FLAGS.num_top_predictions, sess)

(see code at https://github.com/hiveml/simple-ml-serving/blob/master/label_image.py)

If you run this, you should find that it takes around 0.1 sec per image, quite fast enough for online use.

Converting one-shot inference to online inference (Other ML Frameworks)

Caffe uses its net.forward code which is very easy to put into a callable framework: see http://nbviewer.jupyter.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb

Mxnet is also very unique: it actually has ready-to-go inference server code publicly available: https://github.com/awslabs/mxnet-model-server.

Further details coming soon!

Deployment

The plan is to wrap this code in a Flask app. If you haven’t heard of it, Flask is a very lightweight Python web framework which allows you to spin up an http api server with minimal work.

As a quick reference, here’s a Flask app that receives POST requests with multipart form data:

#!/usr/bin/env python
# usage: python echo.py to launch the server ; and then in another session, do
# curl -v -XPOST 127.0.0.1:12480 -F "data=@./image.jpg"
from flask import Flask, request
app = Flask(__name__)
@app.route('/', methods=['POST'])
def classify():
    try:
        data = request.files.get('data').read()
        print repr(data)[:1000]
        return data, 200
    except Exception as e:
        return repr(e), 500
app.run(host='127.0.0.1',port=12480)

And here is the corresponding Flask app hooked up to run_graph above:


#!/usr/bin/env python
# usage: bash tf_classify_server.sh
from flask import Flask, request
import tensorflow as tf
import label_image as tf_classify
import json
app = Flask(__name__)
FLAGS, unparsed = tf_classify.parser.parse_known_args()
labels = tf_classify.load_labels(FLAGS.labels)
tf_classify.load_graph(FLAGS.graph)
sess = tf.Session()
@app.route('/', methods=['POST'])
def classify():
    try:
        data = request.files.get('data').read()
        result = tf_classify.run_graph(data, labels, FLAGS.input_layer, FLAGS.output_layer, FLAGS.num_top_predictions, sess)
        return json.dumps(result), 200
    except Exception as e:
        return repr(e), 500
app.run(host='127.0.0.1',port=12480)

This looks quite good, except for the fact that Flask and TensorFlow are both fully synchronous – Flask processes one request at a time in the order they are received, and TensorFlow fully occupies the thread when doing the image classification.

As it’s written, the speed bottleneck is probably still in the actual computation work, so there’s not much point upgrading the Flask wrapper code. And maybe this code is sufficient to handle your load, for now.

There are 2 obvious ways to scale up request thoroughput : scale up horizontally by increasing the number of workers, which is covered in the next section, or scale up vertically by utilizing a GPU and batching logic. Implementing the latter requires a webserver that is able to handle multiple pending requests at once, and decide whether to keep waiting for a larger batch or send it off to the TensorFlow graph thread to be classified, for which this Flask app is horrendously unsuited. Two possibilities are using Twisted + Klein for keeping code in Python, or Node.js + ZeroMQ if you prefer first class event loop support and the ability to hook into non-Python AI frameworks such as Torch.

OK, so now we have a single server serving our model, but maybe it’s too slow or our load is getting too high. We’d like to spin up more of these servers – how can we distribute requests across each of them?

The ordinary method is to add a proxy layer, perhaps haproxy or nginx, which balances the load between the backend servers while presenting a single uniform interface to the client. For use later in this section, here is some sample code that runs a rudimentary Node.js load balancer http proxy:

// Usage : node basic_proxy.js WORKER_PORT_0,WORKER_PORT_1,...
const worker_ports = process.argv[2].split(',')
if (worker_ports.length === 0) { console.err('missing worker ports') ; process.exit(1) }

const proxy = require('http-proxy').createProxyServer({})
proxy.on('error', () => console.log('proxy error'))

let i = 0
require('http').createServer((req, res) => {
  proxy.web(req,res, {target: 'http://localhost:' + worker_ports[ (i++) % worker_ports.length ]})
}).listen(12480)
console.log(`Proxying localhost:${12480} to [${worker_ports.toString()}]`)

// spin up the AI workers
const { exec } = require('child_process')
worker_ports.map(port => exec(`/bin/bash ./tf_classify_server.sh ${port}`))

To automatically detect how many backend servers are up and where they are located, people generally use a “service discovery” tool, which may be bundled with the load balancer or be separate. Some well-known ones are Consul and Zookeeper. Setting up and learning how to use one is beyond the scope of this article, so I’ve included a very rudimentary proxy using the node.js service discovery package seaport.

Proxy code:

// Usage : node seaport_proxy.js
const seaportServer = require('seaport').createServer()
seaportServer.listen(12481)
const proxy = require('http-proxy').createProxyServer({})
proxy.on('error', () => console.log('proxy error'))

let i = 0
require('http').createServer((req, res) => {
  seaportServer.get('tf_classify_server', worker_ports => {
    const this_port = worker_ports[ (i++) % worker_ports.length ].port
    proxy.web(req,res, {target: 'http://localhost:' + this_port })
  })
}).listen(12480)
console.log(`Seaport proxy listening on ${12480} to '${'tf_classify_server'}' servers registered to ${12481}`)

Worker code:

// Usage : node tf_classify_server.js
const port = require('seaport').connect(12481).register('tf_classify_server')
console.log(`Launching tf classify worker on ${port}`)
require('child_process').exec(`/bin/bash ./tf_classify_server.sh ${port}`)

However, as applied to AI, this setup runs into a bandwidth problem.

At anywhere from tens to hundreds of images a second, the system becomes bottlenecked on network bandwidth. In the current setup, all the data has to go through our single seaport master, which is the single endpoint presented to the client.

To solve this, we need our clients to not hit the single endpoint at http://127.0.0.1:12480, but instead to automatically rotate between backend servers to hit. If you know some networking, this sounds precisely like a job for DNS!

However, setting up a custom DNS server is again beyond the scope of this article. Instead, by changing the clients to follow a 2-step “manual DNS” protocol, we can reuse our rudimentary seaport proxy to implement a “peer-to-peer” protocol in which clients connect directly to their servers:

Proxy code:

// Usage : node p2p_proxy.js
const seaportServer = require('seaport').createServer()
seaportServer.listen(12481)

let i = 0
require('http').createServer((req, res) => {
  seaportServer.get('tf_classify_server', worker_ports => {
    const this_port = worker_ports[ (i++) % worker_ports.length ].port
    res.end(`${this_port}
`)
  })
}).listen(12480)
console.log(`P2P seaport proxy listening on ${12480} to 'tf_classify_server' servers registered to ${12481}`)

(The worker code is the same as above.)

Client code:


curl -v -XPOST localhost:`curl localhost:12480` -F"data=@$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg
        

RPC Deployment

Coming soon! A version of the above with Flask replaced by ZeroMQ.

Conclusion and further reading

At this point you should have something working in production, but it’s certainly not futureproof. There are several important topics that were not covered in this guide:

  • Automatically deploying and setting up on new hardware.
    • Notable tools include Openstack/VMware if you’re on your own hardware, Chef/Puppet for installing Docker and handling networking routes, and Docker for installing TensorFlow, Python, and everything else.
    • Kubernetes or Marathon/Mesos are also great if you’re on the cloud
  • Model version management
    • Not too hard to handle this manually at first
    • TensorFlow Serving is a great tool that handles this, as well as batching and overall deployment, very thoroughly. The downsides are that it’s a bit hard to setup and to write client code for, and in addition doesn’t support Caffe/PyTorch.
  • How to migrate your AI code off Matlab
    • Don’t do matlab in production.
  • GPU drivers, Cuda, CUDNN
    • Use nvidia-docker and try to find some Dockerfiles online.
  • Postprocessing layers. Once you get a few different AI models in production, you might start wanting to mix and match them for different use cases — run model A only if model B is inconclusive, run model C in Caffe and pass the results to model D in TensorFlow, etc.

.

BACK TO ALL BLOGS

Inside a Neural Network’s Mind

Why do neural networks make the decisions they do? Often, the truth is that we don’t know; it’s a black box. Fortunately, there are now some techniques that help us peek under the hood to help us understand how they make decisions.

What has the neural network learned is attractive? Where does it look to decide if an image is safe for work? Using grad-cam, we explore the predictions of our models: sport type, action / non-action, drugs, violence, attractiveness, race, age, etc.

Github repo: https://github.com/hiveml/tensorflow-grad-cam

Hey, my face is up here! Clearly, the attractiveness model focuses on body over face in the mid-range shots above. Interestingly, it has also learned to localize people without any specific bounding box information in training. The model is trained on 200k images, labeled by Hive into three classes: hot, neutral, and not. Then the scores for each bucket are combined to create a rating 0-10. This classifier is available here1.

The main idea is to apply the logit layer with the last convolutional layer before global pooling. This creates a map showing the importance of each pixel in the network’s decision.

Sports action, NSFW, violence
Sports action, NSFW, violence

The pose of the football player tells the model that a play is in action. We can clearly locate the nudity and the guns in the NSFW and Violence images, too.

Snowboarding, TV show
Snowboarding, TV show

A person in a suit, center frame, apparently indicates that it is a TV show instead of a commercial (right). The TV / commercial model is a great example of how grad-CAM can uncover unexpected reasons behind the decisions our models make. They can also confirm what we expect, as seen in the snowboarding example (left).

The Simpsons, Rick and Morty
The Simpsons, Rick and Morty

This example uses our animated show classifier. Interestingly, the most important spot in the images above is the edge of Bart and Morty, including a substantial amount of the background in both cases.

CAM and GradCam

First developed by Zhou2, Class Activation Maps (CAM) show what the network is looking at. For each class, CAM illustrates the parts of the image most important for that class.

Ramprasaath3 extended CAM to apply to a wider range of architectures without any changes. Specifically, grad-CAM can handle fully connected layers and more complicated scenarios like question answering. However, almost all popular neural nets like ResNet, DenseNet, and even NasNet end with global average pooling. Therefore the heatmap can be computed directly using CAM without the backward pass. This is especially important for speed critical applications. Fortunately, with the ResNet used in this post we don’t have to modify the nets at all to compute CAM or grad-CAM.

Recently, grad-CAM++ Chattopadhyay4 further generalized the method to increase the precision of the output heat maps. Grad-CAM++ is better at dealing with multiple instances of the class and highlighting the entire class rather than just the most salient parts. It achieves this using a weighted combination of positive partial derivatives.

Here’s how it’s implemented in Tensorflow:

one_hot = tf.sparse_to_dense(predicted_class, [num_classes], 1.0)
signal = tf.multiply(end_points[‘Logits’], one_hot)
loss = tf.reduce_mean(signal)

This returns an array of num_classes elements with only the logit of the predicted class non-zero. This defines the loss.

grads = tf.gradients(loss, conv_layer)[0]
norm_grads = tf.divide(grads, tf.sqrt(tf.reduce_mean(tf.square(grads)))
	+ tf.constant(1e-5))

The pose of the football player tells the model that a play is in action. We can clearly locate the nudity and the guns in the NSFW and Violence images, too.

output, grads_val = sess.run([conv_layer, norm_grads],
	feed_dict={imgs0: img})

A person in a suit, center frame, apparently indicates that it is a TV show instead of a commercial (right). The TV / commercial model is a great example of how grad-CAM can uncover unexpected reasons behind the decisions our models make. They can also confirm what we expect, as seen in the snowboarding example (left).

weights = np.mean(grads_val, axis = (0, 1))             # [2048]
cam = np.ones(output.shape[0 : 2], dtype = np.float32)  # [10,10]

This example uses our animated show classifier. Interestingly, the most important spot in the images above is the edge of Bart and Morty, including a substantial amount of the background in both cases.

cam = np.ones(output.shape[0 : 2], dtype = np.float32)  # [10,10]
for i, w in enumerate(weights):
	cam += w * output[:, :, i]
cam = np.maximum(cam, 0)
cam = cam / np.max(cam)
cam = cv2.resize(cam, (eval_image_size, eval_image_size))

Pass the cam through a RELU to only take the positive suggestions for that class. Then we resize the coarse cam output to the input size and blend to display.

Finally, the main function grabs the tensorflow slim model definition and pre-processing function. With these it computes the grad-CAM output, and blends that with the input photo. In the code below, we use the class with the greatest softmax probability as input to grad_cam. Instead, we could choose any class. For example:

The model predicted alcohol as the top choice with 99% and gambling with only 0.4%. By changing the predicted_class from alcohol to gambling, we can see how-despite the low class probability, it can clearly pinpoint the gambling in the image.

References

BACK TO ALL BLOGS

Hive – A Full-Stack Approach to Deep Learning

Here at Hive, we build deep learning models dedicated to solving visual intelligence problems – we take in unstructured visual data like raw image and video, and produce a structured output that helps understand the meaning of this content.

The problems we’ve solved span numerous verticals, ranging from identifying winter Olympic sports to the model of a car. Our full collection of vision APIs can be found in Hive Predict, and we’ve embedded these APIs into enterprise applications, like in Hive Media.

We are often asked how we achieve such high accuracy and recall for our models, especially the ones for entity recognition such as our celebrity and logo models. The answer is simple: data quality.

Deep learning, or the construction of convolutional neural nets to mimic a human’s brain ability to recognize visual imagery, is a remarkably powerful tool, but any model is only as good as the data it is trained on.

What makes Hive unique is that we tend not to use public datasets that many other models are trained on, but instead opt to generate our own custom datasets. In doing so, we convert millions of raw, unlabeled items to an ever-growing collection of pristine data to improve our models every day. So how do we do it?

“Deep learning… is a remarkably powerful tool, but any model is only as good as the data it is trained on”

Hive Data

Unlike other deep learning startups, we took the unusual step early on in our company’s history to invest heavily in our own massively distributed data labeling platform, Hive Data.

Hive Data is a fully self-serve work platform where our workers are given a set of tools to complete a wide range of data labeling tasks, including categorization, bounding boxes, and pixel-level semantic segmentation (see Figure 1).

Figure 1: Workers are given a collection of tools to markup items in a variety of ways.
Figure 1: Workers are given a collection of tools to markup items in a variety of ways.

Unlike other data labeling platforms, Hive Data doesn’t have a set schedule for workers, and tasks are routed to workers in an ad-like fashion based on the average time it takes a worker to complete the task.

The result is a steady average hourly rate no matter how complex the task is, and workers get to fully define their own work schedule, working for as little as 1 minute or as much as 12 hours a day.

We think of Hive Data as the beating heart of the company, generating high-quality datasets for all of our machine learning initiatives. Today, we’ve labeled hundreds of millions of items through Hive Data, and since releasing Hive Data as a platform for external partners, we’ve helped institutions ranging from academic labs to large corporations in labeling their data as well.

Hive Data Workers

When we first started work on what would become Hive Data, our thesis was that there was a significant global workforce of untapped human labor that had access to the internet and would be willing to do data labeling work on demand.

What we didn’t expect was just how strong the response to our platform would be. Since our launch in August 2016, we’ve had over 70,000 workers sign up without having spent a single dollar on acquisition.

Part of what makes our service so remarkable is how global this workforce is, resulting in not only 24/7 coverage of tasks, but also a balanced human viewpoint on tasks that may carry cultural subjectivity.

Figure 2: Geographic distribution of our workers
Figure 2: Geographic distribution of our workers

Because of how our system is built, we can ensure a competitive wage for our workers while simultaneously cutting down the net cost for data labeling down to a fraction of other services.

Maintaining Data Quality

One of the questions we’re often asked is how we maintain such accuracy in a self-serve, distributed model like Hive Data. This was also something we focused on heavily when building out the platform, and our solution revolves around two key concepts: 1) Pre-labeled sampling, and 2) Consensus.

When a task is uploaded to Hive Data, we mandate that the task include a small set of pre-labeled items that we sprinkle into a worker’s feed (of course, the worker cannot distinguish between these and real task items).

A pre-labeled item is simply an item that has a pre-defined correct answer. Depending on the experience level of the worker, anywhere from 10% to 50% of his work will be these pre-labeled items. Based on this, we can accurately gauge a proxy for how accurate a worker is. We usually mandate each worker to have >95% accuracy for a single task in order to be allowed to continue working on it.

For a result to be returned for a given task item, we further mandate a consensus, meaning a certain number of workers must agree on a task for an answer to be returned. When you have, say, 3 workers each at 95% accuracy agreeing on an answer, the final accuracy is in the ballpark of 99.7%! This is how we can maintain a superior level of accuracy to other services, while simultaneously operating at a price point that is an order of magnitude lower.

The Future of Hive Data

As hardware capabilities continue to improve at a remarkable clip, deep learning models will become increasingly complex and data hungry.1, 2, 3 The bottleneck in improving these applications will be on the data side, and we believe Hive Data will evolve to be the de facto platform for any sort of data labeling needs.

Over the next few years, we intend to expand Hive Data’s capabilities to handle virtually any sort of data labeling need that a machine learning researcher might require, while holding to our mandate of having the highest accuracy at the lowest price.

Built by deep learning researchers for deep learning researchers, Hive Data is currently the only distributed work platform optimized for building enterprise grade deep learning applications, and we’re excited to help usher in a new era of AI.

References