BACK TO ALL BLOGS

Hive – A Full-Stack Approach to Deep Learning

Here at Hive, we build deep learning models dedicated to solving visual intelligence problems – we take in unstructured visual data like raw image and video, and produce a structured output that helps understand the meaning of this content.

The problems we’ve solved span numerous verticals, ranging from identifying winter Olympic sports to the model of a car. Our full collection of vision APIs can be found in Hive Predict, and we’ve embedded these APIs into enterprise applications, like in Hive Media.

We are often asked how we achieve such high accuracy and recall for our models, especially the ones for entity recognition such as our celebrity and logo models. The answer is simple: data quality.

Deep learning, or the construction of convolutional neural nets to mimic a human’s brain ability to recognize visual imagery, is a remarkably powerful tool, but any model is only as good as the data it is trained on.

What makes Hive unique is that we tend not to use public datasets that many other models are trained on, but instead opt to generate our own custom datasets. In doing so, we convert millions of raw, unlabeled items to an ever-growing collection of pristine data to improve our models every day. So how do we do it?

“Deep learning… is a remarkably powerful tool, but any model is only as good as the data it is trained on”

Hive Data

Unlike other deep learning startups, we took the unusual step early on in our company’s history to invest heavily in our own massively distributed data labeling platform, Hive Data.

Hive Data is a fully self-serve work platform where our workers are given a set of tools to complete a wide range of data labeling tasks, including categorization, bounding boxes, and pixel-level semantic segmentation (see Figure 1).

Figure 1: Workers are given a collection of tools to markup items in a variety of ways.
Figure 1: Workers are given a collection of tools to markup items in a variety of ways.

Unlike other data labeling platforms, Hive Data doesn’t have a set schedule for workers, and tasks are routed to workers in an ad-like fashion based on the average time it takes a worker to complete the task.

The result is a steady average hourly rate no matter how complex the task is, and workers get to fully define their own work schedule, working for as little as 1 minute or as much as 12 hours a day.

We think of Hive Data as the beating heart of the company, generating high-quality datasets for all of our machine learning initiatives. Today, we’ve labeled hundreds of millions of items through Hive Data, and since releasing Hive Data as a platform for external partners, we’ve helped institutions ranging from academic labs to large corporations in labeling their data as well.

Hive Data Workers

When we first started work on what would become Hive Data, our thesis was that there was a significant global workforce of untapped human labor that had access to the internet and would be willing to do data labeling work on demand.

What we didn’t expect was just how strong the response to our platform would be. Since our launch in August 2016, we’ve had over 70,000 workers sign up without having spent a single dollar on acquisition.

Part of what makes our service so remarkable is how global this workforce is, resulting in not only 24/7 coverage of tasks, but also a balanced human viewpoint on tasks that may carry cultural subjectivity.

Figure 2: Geographic distribution of our workers
Figure 2: Geographic distribution of our workers

Because of how our system is built, we can ensure a competitive wage for our workers while simultaneously cutting down the net cost for data labeling down to a fraction of other services.

Maintaining Data Quality

One of the questions we’re often asked is how we maintain such accuracy in a self-serve, distributed model like Hive Data. This was also something we focused on heavily when building out the platform, and our solution revolves around two key concepts: 1) Pre-labeled sampling, and 2) Consensus.

When a task is uploaded to Hive Data, we mandate that the task include a small set of pre-labeled items that we sprinkle into a worker’s feed (of course, the worker cannot distinguish between these and real task items).

A pre-labeled item is simply an item that has a pre-defined correct answer. Depending on the experience level of the worker, anywhere from 10% to 50% of his work will be these pre-labeled items. Based on this, we can accurately gauge a proxy for how accurate a worker is. We usually mandate each worker to have >95% accuracy for a single task in order to be allowed to continue working on it.

For a result to be returned for a given task item, we further mandate a consensus, meaning a certain number of workers must agree on a task for an answer to be returned. When you have, say, 3 workers each at 95% accuracy agreeing on an answer, the final accuracy is in the ballpark of 99.7%! This is how we can maintain a superior level of accuracy to other services, while simultaneously operating at a price point that is an order of magnitude lower.

The Future of Hive Data

As hardware capabilities continue to improve at a remarkable clip, deep learning models will become increasingly complex and data hungry.1, 2, 3 The bottleneck in improving these applications will be on the data side, and we believe Hive Data will evolve to be the de facto platform for any sort of data labeling needs.

Over the next few years, we intend to expand Hive Data’s capabilities to handle virtually any sort of data labeling need that a machine learning researcher might require, while holding to our mandate of having the highest accuracy at the lowest price.

Built by deep learning researchers for deep learning researchers, Hive Data is currently the only distributed work platform optimized for building enterprise grade deep learning applications, and we’re excited to help usher in a new era of AI.

References