BACK TO ALL BLOGS

Back to Our Roots: Hive Data in Academia

Hive was started by two PhD students at Stanford who were frustrated with the difficulty of generating quality datasets for machine learning research. We found that solutions on the market were either too inaccurate or too expensive to conduct the typical research study. From those early days, we’ve now built one of the world’s largest marketplaces of human labor.

In keeping with our academic roots, we always intended Hive Data to be the perfect partner for academic labs. The best way to showcase Hive Data’s impact on academia is through a real case study. A machine learning researcher, who we’ll call Professor X, had an urgent conference submission deadline coming up. He still had quite a lot of work remaining, and much of his work required him to label a large corpus of videos. He had tried many other services without success, and was urgently searching for a solution that could address all of his needs. Here were the constraints he was under, and how we solved them:

  1. Professor X didn’t want to pay a large upfront cost.
    Given his limited budget and inability to risk project failure, Professor X needed to ensure that the provider he chose had a competitive price point and offered him the flexibility that he needed. Other services on the market generally had fixed costs ran upwards of hundreds of thousands of dollars the first year. Even if he could afford a single engagement, if the service wasn’t up to par, he didn’t have the budget to try a different one. Hive doesn’t impose any upfront fees, so it made us a low-risk option.
  2. Professor X needed to make sure the data output quality would be high enough to publish research.
    While he did find some services whose rates were competitive with Hive’s, he noted quickly that all of them suffered from poor data quality, especially for tasks in data labeling for videos. This rendered the data unusable for his research project. Hive, on the other hand, offered a complex system of audits and a worker consensus model to ensure high data accuracy. Because the tasks passed through several rounds of worker auditing, Hive was able to offer the high-quality data that Professor X needed.
  3. Professor X needed his results in a fast turnaround time.
    As we mentioned, Professor X was on a tight deadline to submit his paper for publication. Most other services have inflexible, week-long timelines for returning datasets. Hive, however, offered a much faster turnaround time. Due to our remarkably large global workforce, we were able to scale up to finish jobs as quickly as the Professor needed. He was able to get his job finished in less than a day, whereas other providers had quoted him as long as a month!
  4. Professor X was searching for a service provider that could provide technical insight during the process.
    Part of Hive Data’s value proposition to its customers is in offering our own expertise in building machine learning models, as well as supplying the quality data to do so. We’d seen similar projects as the one Professor X was dealing with, and we understood the problems he would face in generating this dataset. Even before getting started, we helped Professor X optimize his project by structuring his tasks in a way that improved his results and helped him build an effective model off his data.

In addressing these needs, Professor X was able to submit his paper and get it published on time. He still continues to use Hive Data to power his AI research today.

Hive Data has already been used by top-tier university research labs all over the world, including at Stanford, MIT, Cornell, and Simon Fraser University. We’ve seen projects range from labeling datasets for vehicle detection in autonomous driving, object recognition for robotic arms, and pedestrian identification from security cameras. The number of research verticals we cater to is constantly growing, as we pride ourselves on rapid engineering cycles to release data labeling capabilities as soon as we see a need emerging.

If you’re an academic researcher and you’re curious about how we can partner together, contact me at david@thehive.ai. We’re excited to support your research!

BACK TO ALL BLOGS

Hive Media: Revolutionizing the Way We Understand On-Screen Content and Viewership

Hive is a full-stack deep learning platform focused on solving visual intelligence problems. While we are working with companies in sectors ranging from autonomous driving to facial recognition, our flagship enterprise product is Hive Media.

image

As the name suggests, Hive Media is our complete enterprise solution utilizing deep learning to revolutionize traditional media analytics. However, it is far more than a simple collection of neural net models. What we’ve built with Hive Media is an end-to-end solution, beginning with data ingestion and extending all the way to real-time, device-level viewership metrics.

The Vision

Imagine you could watch 100 different channels at the same time and remember every key element of what was on screen – what brand was shown, what actor was present, what commercial was playing etc. Now, suppose you could remember this forever and could query this information instantly. This would be a massively valuable dataset, because it seems like an impossible feat for a human to achieve. This, however, is precisely what we set Hive Media out to achieve. Essentially, we wanted to build a system that could “watch” all of broadcast television in the same way a human would and then store this information in an easily accessible manner.

Data Ingestion

The first step in our pipeline is accessing TV streams. Today, we are processing 400 channels in the US, with 300 more in Europe to come later this year. See Figure 1 for a graphical display of our present and planned TV coverage.

Figure 1
Figure 1

We are recording every second of every channel, totaling up to 10,000+ hours of footage per day! We expect this number to be well over 30,000+ hours a day by next year. In addition, all major channels are covered, as well as a wide range of local affiliates on the network side. As you can imagine, this is a lot of data and we are storing all of it in our own datacenter rollouts around the world. Ultimately, we are aiming to build the world’s largest repository of linear broadcast data.

Deep Learning Models

Having this much data is only useful if you can understand it. This is where our deep learning models come into play. Using Hive Data, we’ve built some of the world’s largest celebrity, logo, and brand databases in the world. These models, amongst several others, are applied to every second of our recorded footage and stored in a database in a manner that is optimized for easy retrieval. This means a query such as “How many times did a Nike logo appear on NBC in the month of September?” – previously impossible – can now be answered in a matter of seconds! Unlike some other products on the market, our models don’t rely upon any sort of metadata associated with the programming –- tags are generated purely based on the video content. This is extremely powerful, because it means our system can handle a large variety of content without having to constantly hard code in parameters.

User Viewership

The final piece of the puzzle is understanding how our tags affect viewership – the holy grail of media analytics. Everything we’ve described up until now has been generating what I call “cause” – to measure “effect;” we are currently working with actual device partners who give us real-time data on viewers. Today, we have access to millions of devices that send us real viewership data that we overlap with our tags to understand how on-screen content affects viewership. This means that every query we run not only tells us what aired, but also how it affects the viewership bottom line.

The easiest way to understand this system is to visually see some queries executed. In Figure 2, we show an example query for Chevrolet vs. Toyota commercials on NBC in a week’s period. You can see the tags our system found in the bottom right. The bottom left shows a video player illustrating the content corresponding to the tag, mainly to serve as video evidence that our tag is correct. What’s powerful about Hive Media is the fact that now, we can analyze viewership data at each of these tag occurrences to understand what their effect on viewership is. One important way to understand viewership, as shown in Figure 2, is the notion of tune-out, which is the percentage of users that change the channel in a time interval. This is often the strongest indicator of whether a viewer is enjoying content shown on screen. Interestingly enough, it seems that Chevy commercials generate almost twice as much tune-out as their Toyota counterparts in this case.

Figure 2
Figure 2

Let’s take another example query that looks for Nike logos, as shown in Figure 3. What we’re demonstrating here in the highlighted tag is a snippet of content that shows a Nike logo prominently placed in the center of the screen; even though this isn’t a true Nike commercial. Instead, this is Simone Biles, a Nike athlete, being featured in a Mattress Firm / Foster Kids commercial. But as part of any Nike athlete contract, Simone is obliged to wear Nike clothing whenever she appears on TV, Nike commercial or not. Nike would probably be highly interested in knowing how many similar logo placements occurred for Simone, as well as for all of their other sponsored athletes.

Figure 3
Figure 3

Today, we are only beginning our journey toward understanding the wealth of data we have at our disposal. Hive Media is pioneering a new way of thinking around media content, and we are eager to help both broadcasters and advertisers optimize their content to better retain viewers and inform advertising decisions.