BACK TO ALL BLOGS

Forbes

BACK TO ALL BLOGS

Forbes

BACK TO ALL BLOGS

Hive’s Presentation at Plug and Play’s Media & Advertising Innovation Summit

Dan Calpin, President of Hive Media, shares an overview of Hive and our media business at the 2019 Plug and Play Fall Innovation Summit in Sunnyvale, CA.

BACK TO ALL BLOGS

Bain & Company introduces Bain Media Lab; Announces partnership with Hive and launch of Mensio, an AI-powered analytics platform to analyze TV advertising and sponsorships

LOS ANGELES – April 30, 2019 – Bain & Company announced today the formation of Bain Media Lab, a business that will feature a portfolio of digital products and related services that combine breakthrough technologies with powerful datasets. Hive, a full-stack deep learning company based in San Francisco, will be the launch partner for Bain Media Lab.

Bain Media Lab is a new venture incubated in the Bain Innovation Exchange, a business unit that leverages Bain’s network of venture capitalists, startups, and tech leaders to help clients innovate through the ecosystem, as well as support Bain in creating new ventures.

“We are excited to introduce Bain Media Lab and to announce our partnership with Hive,” said Elizabeth Spaulding, the co-lead of Bain & Company’s Global Digital practice. “Today’s milestone launch exemplifies our strategy to deepen select Bain Innovation Exchange relationships through the formation of new businesses like Bain Media Lab, which will pair Bain’s expertise with best-in-class innovation to create disruptive solutions. It will also be a powerful vehicle to dramatically accelerate the visibility and growth of innovative technology companies like Hive.”

In partnership, Bain Media Lab and Hive have developed Mensio, an artificial intelligence-powered analytics platform focused on bringing “digital-like” measurement, intelligence, and attribution to traditional television advertising and sponsorships.

Mensio addresses a pain point shared by marketers and media companies – the lack of recent and granular data on the performance of traditional television advertising and sponsorships. As digital marketing has continued to grow its share of advertising dollars, marketers have become accustomed to seeing real-time campaign performance data with granular measurement of audience reach and outcomes. This dynamic has added pressure on television network owners to source comparable data to defend their share of marketers’ advertising budgets.

“Our partnership with Hive is the result of an extensive evaluation of the landscape and our resulting conviction that together we can uniquely create truly differentiated solutions,” said Dan Calpin, who leads Bain Media Lab. “Our launch product, Mensio, unlocks the speed and granularity of data for TV advertising and sponsorships that marketers have come to expect from their digital ad spend. Mensio arms marketers and their agencies to transition from post-mortem analysis of TV ad spend to real-time optimization, and gives network owners long-elusive data that can help them recast the narrative on advertising.”

“We are excited to partner with Bain & Company as the launch partner of Bain Media Lab,” said Kevin Guo, co-founder and CEO of Hive. “In jointly developing Mensio, we have blended the distinctive competencies of our two firms into a seamlessly integrated go-to-market offering. Hive’s ambition is to leverage artificial intelligence in practical applications to transform industries, and Mensio is our flagship product in the media space.”

Subscribers to the Mensio platform access a self-service, cloud-based dashboard that provides point-and-click reporting. Two tiers of the dashboard product are available: one for the buyers of TV advertising (marketers and their agencies) and one for the sellers (TV network owners). Selected features available in the Mensio dashboard and from related services include:

  1. Reach: Measurement of exposure to a brand’s TV advertisements for a given population, ranging from total population to specific behavior-defined segments like frequent guests at quick service restaurants
  2. Frequency: Reporting on the distribution of frequency for a given population (e.g., what percent of households were exposed to more than 20 TV ads for a given brand over the course of a month)
  3. Attribution: Evaluation of the impact of exposure to TV advertising and sponsorships on a broad set of outcomes, including online activity, store visitation, and purchases as well as qualitative brand metrics
  4. Competitive intelligence for brands: Insight into a brand’s relative share of voice versus peers, as well as the mix of networks, programs, genres, dayparts, and ad formats used by a given brand relative to its competitive set
  5. Competitive intelligence for TV network owners: Insights into trends in spending by industry vertical and brand, as well as relative share of a given TV network owner vs. competitors
  6. Sponsorship measurement and return on investment: Measurement of the volume, quality, and equivalent media value of sponsorship placements and earned media, with the ability to link to outcomes

The Mensio product suite uses Hive’s computer vision models – trained using data labeled by Hive’s distributed global workforce of over 1 million people – to enrich recorded television content with metadata including the identification of commercials and sponsorship placements as well as contextual elements like beach scenes. Second-by-second viewership of that content is derived using data from nearly 20 million U.S. households, inclusive of cable and satellite set-top boxes as well as Smart TVs, that is then scaled nationally and can be matched in a privacy-safe environment to a range of outcome behaviors. Outcome datasets enable household-level viewership of content to be matched to online activity (including search and website visits), retail store visits, and purchases (including retail purchases as well as several data sets specific to certain industries such as automotive and consumer packaged goods).

Mensio is currently in beta in the U.S. with a growing number of clients across industries. It will begin to expand into other geographies over the next year. For more information, visit: www.bainmedialab.com/mensio.

Bain & Company and Hive are additionally collaborating on other related products and services for television network owners addressing programming optimization and content tagging use cases.

Editor’s note: To arrange an interview with Mrs. Spaulding or Mr. Calpin, contact Dan Pinkney at dan.pinkney@bain.com or +1 646 562 8102. To arrange an interview with Mr. Guo, contact Kristy Yang at press@thehive.ai or +1 415 562 6943.

About Hive

Hive is a full-stack deep learning company based in San Francisco that focuses on solving visual intelligence challenges. Today, Hive works with many of the world’s biggest companies in media, retail, security, and autonomous driving in building best–in-class computer vision models. Through its flagship enterprise platform, Hive Media, the company is aiming to build the world’s largest database of structured media content. Hive has raised over $50M from a number of well-known venture investors and strategic partners, including General Catalyst, 8VC, and Founders Fund. For more information visit: www.thehive.ai. Follow us on Twitter @hive_ai.

About Bain & Company

Bain & Company is the management consulting firm that the world’s business leaders come to when they want results. Bain advises clients on private equity, mergers and acquisitions, operations excellence, consumer products and retail, marketing, digital transformation and strategy, technology, and advanced analytics, developing practical insights that clients act on and transferring skills that make change stick. The firm aligns its incentives with clients by linking its fees to their results. Bain clients have outperformed the stock market 4 to 1. Founded in 1973, Bain has 57 offices in 36 countries, and its deep expertise and client roster cross every industry and economic sector. For more information visit: www.bain.com. Follow us on Twitter @BainAlerts.

BACK TO ALL BLOGS

Bain PR Release

BACK TO ALL BLOGS

The Effect of Dirty Data on Deep Learning Systems

Introduction

Better training data can significantly boost the performance of a deep learning model, especially when deployed in production. In this blog post, we will illustrate the impact of dirty data, and why correct labeling is important for increasing the model accuracy.

Background

An adversarial attack fools an image classifier by adding an imperceptible amount of noise to an image. One possible way to defend against this is to simply train machine learning models on adversarial examples. We can collect various hard mining examples and add them to the dataset. Another interesting model architecture to explore is generative adversarial network, which generally consist of two parts: a generator to generate fake examples in order to fool the discriminator, and a discriminator to discriminate between clean/fake examples.

Another possible type of attack, data poisoning, can happen during training time. The attacker can identify the weak parts of a machine learning architecture, and potentially modify the training data to confuse the model. Even slight perturbations to the training data and label can result in worse performance. There are several methods to defend against such data poisoning attacks. For example, it is possible to separate clean training examples from poisoned ones, so that the outliers are deleted from the dataset.

In this blog post, we investigate the impact of data poisoning (dirty data) using the simulation method: random labeling loss. We will show that with the same model architecture and dataset size, we are able to get huge accuracy increase with better data labeling.

Data

We experiment with the CIFAR-100 dataset, which has 100 classes and 600 32×32 coloured images per class.

We use the following steps to preprocess the images in the dataset

  • Pad each image to 36×36, then randomly crop to 32×32 patch
  • Apply random flip horizontally
  • Distort image brightness and contrast randomly

The dataset is randomly split into 50k training images and 10k evaluation images. Random labeling is the substitution of training data labels with random labels drawn from the marginal of data labels. Different amounts of random labeling loss are added to the training data. We simply shuffle certain amount of labels for each class. The images to be shuffled are chosen randomly from each class. Because of the randomness, the generated dataset is still balanced. Note that evaluation labels are not changed.

We test the model with 4 different datasets, 1 clean and 3 noisy ones.

  • Clean: No random noise. We assume that all labeling is correct for CIFAR-100 dataset. Named as ‘no_noise’.
  • Noisy: 20% random labeling noise. Named as ‘noise_20’.
  • Noisy: 40% random labeling noise. Named as ‘noise_40’.
  • Noisy: 60% random labeling noise. Named as ‘noise_60’.

Note that we choose aggressive data poisoning because the production model we build is robust to small amount of random noise. Note that the random labeling scheme allows us to simulate the effect of dirty data (data poisoning) in real world scenario.

Model

We investigate the impact of dirty data on one of the popular model, ResNet-152 model architecture. Normally it is a good idea to perform fine-tuning on pre-trained checkpoints to get better accuracy with fewer training steps. In this blog the model is trained from scratch, because we want to get a general idea of how noisy data would affect the training and final results without any prior knowledge gained from pretraining.

We optimize the model with SGD (stochastic gradient descent) optimizer with cosine learning rate decay.

Results

Quantitative results:

Accuracy

Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%. Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades.

Qualitative results:

Learning curve
Learning curve
Losses
Losses
Precision recall curves
Precision recall curves

Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%. Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades.

Conclusion

In this post we investigate the impact of data poisoning attacks on performances using image classification as an example task, by the random labeling simulation method. We show that popular model (ResNet) is somewhat robust to data poisoning, but the performance still significantly degrades after poisoning. High-quality labeling is thus crucial to modern deep learning systems.