BACK TO ALL BLOGS

The Effect of Dirty Data on Deep Learning Systems

Introduction

Better training data can significantly boost the performance of a deep learning model, especially when deployed in production. In this blog post, we will illustrate the impact of dirty data, and why correct labeling is important for increasing the model accuracy.

Background

An adversarial attack fools an image classifier by adding an imperceptible amount of noise to an image. One possible way to defend against this is to simply train machine learning models on adversarial examples. We can collect various hard mining examples and add them to the dataset. Another interesting model architecture to explore is generative adversarial network, which generally consist of two parts: a generator to generate fake examples in order to fool the discriminator, and a discriminator to discriminate between clean/fake examples.

Another possible type of attack, data poisoning, can happen during training time. The attacker can identify the weak parts of a machine learning architecture, and potentially modify the training data to confuse the model. Even slight perturbations to the training data and label can result in worse performance. There are several methods to defend against such data poisoning attacks. For example, it is possible to separate clean training examples from poisoned ones, so that the outliers are deleted from the dataset.

In this blog post, we investigate the impact of data poisoning (dirty data) using the simulation method: random labeling loss. We will show that with the same model architecture and dataset size, we are able to get huge accuracy increase with better data labeling.

Data

We experiment with the CIFAR-100 dataset, which has 100 classes and 600 32×32 coloured images per class.

We use the following steps to preprocess the images in the dataset

  • Pad each image to 36×36, then randomly crop to 32×32 patch
  • Apply random flip horizontally
  • Distort image brightness and contrast randomly

The dataset is randomly split into 50k training images and 10k evaluation images. Random labeling is the substitution of training data labels with random labels drawn from the marginal of data labels. Different amounts of random labeling loss are added to the training data. We simply shuffle certain amount of labels for each class. The images to be shuffled are chosen randomly from each class. Because of the randomness, the generated dataset is still balanced. Note that evaluation labels are not changed.

We test the model with 4 different datasets, 1 clean and 3 noisy ones.

  • Clean: No random noise. We assume that all labeling is correct for CIFAR-100 dataset. Named as ‘no_noise’.
  • Noisy: 20% random labeling noise. Named as ‘noise_20’.
  • Noisy: 40% random labeling noise. Named as ‘noise_40’.
  • Noisy: 60% random labeling noise. Named as ‘noise_60’.

Note that we choose aggressive data poisoning because the production model we build is robust to small amount of random noise. Note that the random labeling scheme allows us to simulate the effect of dirty data (data poisoning) in real world scenario.

Model

We investigate the impact of dirty data on one of the popular model, ResNet-152 model architecture. Normally it is a good idea to perform fine-tuning on pre-trained checkpoints to get better accuracy with fewer training steps. In this blog the model is trained from scratch, because we want to get a general idea of how noisy data would affect the training and final results without any prior knowledge gained from pretraining.

We optimize the model with SGD (stochastic gradient descent) optimizer with cosine learning rate decay.

Results

Quantitative results:

Accuracy

Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%. Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades.

Qualitative results:

Learning curve
Learning curve
Losses
Losses
Precision recall curves
Precision recall curves

Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%. Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades.

Conclusion

In this post we investigate the impact of data poisoning attacks on performances using image classification as an example task, by the random labeling simulation method. We show that popular model (ResNet) is somewhat robust to data poisoning, but the performance still significantly degrades after poisoning. High-quality labeling is thus crucial to modern deep learning systems.

BACK TO ALL BLOGS

One Shining Moment for Brands

We’ve reached the end of the road. 1-seed Virginia were crowned NCAA Champions with their overtime win against 3-seed Texas Tech. Hive wrapped up the tournament with analysis on the Final Four and NCAA Championship game. Here’s how March Madness played out for brands this year.

At a Glance:
  • Hive analyzed the NCAA Championship game to assess logo distribution by Brand Prominence*, earned media exposure, and viewership trends across games.
  • AT&T’s sponsored logos on digital overlays and during the halftime shows won the most screen time across all of March Madness.
  • Apparel sponsors placed their own March Madness bets by choosing which of the teams to sponsor as gear providers. Of the 64 teams, Nike backed 59% of them, followed by Under Armour with 25%, and Adidas with 16%. Under Armour’s sponsorship bets paid off by the championship as Texas Tech went head to head with Nike-backed Virginia.
  • The NCAA Championship game viewership fell slightly from last year but still reached nearly twice as many households as the Duke vs. Virginia Tech game (the highest viewed non-finals game in the tournament).

Another March Madness, another CBS ‘One Shining Moment’ montage. Texas Tech and Virginia beat out Michigan State and Auburn in the Final Four to face each other in the championship game. Both schools had never made it this far before in NCAA history and their game was the first time two first-time participants went head to head in 40 years. Hive used its best-in-class computer vision models in conjunction with viewership data (powered by 605, an independent TV data and analytics company) from more than 10M households to analyze brand logo distribution and viewership levels.

Brands

Figure 1. Results from the Hive Logo Model
Figure 1. Results from the Hive Logo Model

Hive recapped all the rounds and mapped out logo placements and earned media for all sponsors. AT&T remained consistent throughout the entire course of the tournament and scored 50% more airtime than Nike, who had the second highest amount of screen time.

Figure 2. Results from the Hive Logo Model
Figure 2. Results from the Hive Logo Model

One brand’s March Madness bets paid off big time this year. The tournament started with Nike sponsoring 40 teams followed by Under Armour with 17 and Adidas with 11. Adidas got the boot in the earlier rounds but Under Armour edged its way into the NCAA Championship game backing the Texas Tech Raiders as they went head to head with the Nike-backed Virginia Cavaliers. Under Armour went from sponsoring 25% of teams in the First Four to nearly 50% by the finals, earning them the same amount of screen time as competitor Nike. Figure 2 shows their fight from the beginning of the tournament to the very last game. These sponsors gave brackets a whole new meaning.

Games

Two defensive-minded teams faced each other in the championship this year. Texas Tech was unranked when their season started and soon found themselves in their first ever National Championship game. After being the first 1-seed to lose to a 16-seed in NCAA history last year, Virginia proved everyone wrong this year and also made their National Championship debut. The game itself got off to a slow start before we saw Virginia take a 10-point lead, fall to a 3-point deficit, then tie the game 68-68 to force overtime. Texas Tech fought hard, but at the end of the day, Virginia had the last say.

Figure 3. Viewership data powered by 605
Figure 3. Viewership data powered by 605

As the biggest night of the year for college basketball, the NCAA Championship game reached 12% of American households with a peak of 15% – almost double the amount of viewers than the Duke vs. Virginia Tech game, the highest performing non-finals game of the tournament.

Conclusion

March Madness is a huge opportunity for brands. We’ve learned which brands performed the best, what elements drove viewership, what aspects retained viewership. We also learned that you don’t need a Zion to go to the Final Four but it helps to have a star player to hike up viewership levels. Hive is the premier solution for brands looking to improve their real-time reach and ROI measurement for commercials, earned media, and sponsorships during events like March Madness.

Kevin Guo is Co-founder and CEO of Hive, a full-stack deep learning company based in San Francisco building an AI-powered dashboard for media analytics. For inquiries, please contact him at kevin.guo@thehive.ai.

Viewership data was powered by 605, an independent TV data and analytics company.

*A Brand Prominence Score is defined as the combination of a logo’s clarity, size, and location on-screen, in addition to the presence of other brands or objects on screen.

BACK TO ALL BLOGS

Survive and Advance: Winners of the Sweet 16 and Elite Eight

March Madness lived up to its name last weekend in this year’s Sweet 16 and Elite Eight. The road to the Final Four has been exhilarating for some and heartbreaking for others. Only four teams remain in the NCAA tournament, and Hive followed the journeys of the teams, viewers, and advertisers. Here’s how everyone’s stories unfolded in the next two chapters of March Madness.

At a Glance:
  • Hive analyzed the Sweet 16 and Elite Eight to assess logo distribution by brand prominence, earned media exposure, and viewership trends across games.
  • Buffalo Wild Wings capitalized on its overtime commercial spots, with the highest average household reach (5.7% of households) on its placements.
  • AT&T’s logo placements showed consistency, maintaining its spot for the most screen time with a majority of logos scoring above average on Brand Prominence.*
  • The highest average household viewership occurred during the Sweet 16 where Duke vs. Virginia Tech had a 7.6% average household reach and a peak of 9.2%. Second place went to Duke vs. Michigan State in the Elite Eight with a 6.8% average household reach and a peak of 10.5% – the highest of any game.
  • Hive assessed how the point gap in the last minutes of the games drove increased viewership and found a strong correlation, with the closest games seeing up to a 200% bump in viewership in the last minutes.

Texas Tech, Virginia, Auburn, and Michigan State fought tough battles and earned themselves spots in the Final Four. Over the course of four days, six teams upset their opponents, three games went into overtime, and two teams found out that they would make their Final Four debuts. Hive used its best-in-class computer vision models in conjunction with viewership data (powered by 605, an independent TV data and analytics company) from more than 10M households to analyze brand logo distribution and sources of viewership fluctuation.

Brand winners

Figure 1. Viewership data powered by 605
Figure 1. Viewership data powered by 605

Official NCAA Corporate Partner Buffalo Wild Wings snagged the most effective commercial spots in the Sweet 16 and Elite Eight, earning the top household reach per commercial airing. Their overtime-specific commercial was created and set to air only during overtime games, which paid off big time in these two rounds. With Purdue vs. Tennessee in the Sweet 16 and Purdue vs. Virginia and Auburn vs. Kentucky in the Elite Eight all going into overtime, the brand’s ad slots earned them a number of extra opportunities to get in front of fans with relevant content. Google reached the second highest percentage of household reach per commercial airing followed by Apple.

Figure 2. Results from the Hive Logo Model
Figure 2. Results from the Hive Logo Model

The Hive Logo Model also scanned every second of the 12 games this week for logo placements and earned media. AT&T’s digital overlays and halftime sponsorship earned the most airtime again this week. Their logos were not only frequently on screen, but they were also quite prominent with a majority of logos scoring more than 20 on Brand Prominence.* Apparel and gear sponsors Nike, Under Armour, and Spalding all received lots of screen time but their logos were low prominence, usually appearing in the action on jerseys, shoes, or hoops. Courtside sponsors Lowes, Buick, Infiniti, and Coca Cola were all consistently mid-scoring with a few very strong placements when the camera caught the logo in the background of a close-up.

Top games

Figure 3. Viewership data powered by 605
Figure 3. Viewership data powered by 605

The Sweet 16 and Elite Eight once again proved that the Zion effect is real. The top two games with the highest average viewership over the course of the two rounds went to both Duke games. In the Sweet 16, fans held their breath as the Blue Devils they narrowly escaped Virginia Tech 75-73. The game itself raked in the largest audience size in the tournament yet. However, the Zion show came to an end after Michigan State shocked Duke in the Elite Eight. The final score read 68-67, a bracket-busting win for Michigan State.

Figure 4. Viewership data powered by 605
Figure 4. Viewership data powered by 605

No. 3 seed Purdue put up a fight in this year’s tournament with two overtime games. Figure 4 shows a graph of their battle against No. 2 seed Tennessee in the Sweet 16 overlaid with the Florida State vs. Gonzaga game that started a few minutes before. The CBS game retained steady viewership as it approached halftime while the TBS game started just in time for the other game’s viewers to switch over. They flipped back to CBS during Purdue vs. Tennessee’s halftime show, but they did not return when it ended. This may be attributed to the fact that barely five minutes into the second half, Purdue took an 18-point lead over Tennessee. However, Tennessee began to make a comeback and viewership spiked to 7% as they forced OT. Purdue prevailed, securing their spot in the Elite Eight for the first time since 2000.

Interestingly, viewers in this round overwhelmingly followed the action on both channels. The loss in viewership during halftime on the CBS show was almost perfectly mirrored with a bump in viewership on the TBS game. When the CBS game returned, most switched back until Tennessee started to come back from their double-digit deficit, stealing a majority of the viewership as the CBS game tailed off in the last few minutes.

Purdue’s Elite Eight performance drew an even bigger crowd than the last round. An average of 6% of American households watched them play 1-seed Virginia, arguably one of the most exciting games in the entire tournament. Within the last two minutes of regulation, Carson Edwards gave Purdue the lead, impressing America with his tenth three-pointer of the game. With only six seconds remaining, all of the stars were aligned as UVA’s Ty Jerome perfectly missed his second free throw, commencing the play that allowed Mamadi Diakite to tie up the game and force OT. Ultimately, Virginia edged out Purdue 80-75 preventing what could have been their first ever Final Four appearance.

Two teams, however, anticipated their Final Four debuts. After defeating Kansas and North Carolina, 5-seed Auburn beat 2-seed Kentucky in the Elite Eight proving that the Tigers can hang with the blue bloods.Texas Tech will also be showing up to the Final Four for the first time in program history after upsetting No. 1 Gonzaga. This game had the highest average household viewership on TBS during these two rounds.

Figure 5. Viewership data powered by 605
Figure 5. Viewership data powered by 605

Given the last minute shifts in viewership during the Purdue vs. Tennessee nail-biter, Hive decided to analyze how point gap in the last 10 minutes of the game drives viewership. As would be expected, increases in game viewership in the last ten minutes was strongly driven by how close the scores were. As the average point differential near the end of the game decreased, the viewership grew substantially with the closest games seeing up to a 250% bump. Auburn vs. North Carolina was an exception to this, seeing viewership rise 100% during the last ten minutes despite a double-digit point gap. This was likely due to its interest relative to the competing game, LSU vs. Michigan, which had a similarly wide point gap but in favor of the higher seeded team. Auburn’s upset coupled with Chuma Okeke’s unfortunate injury increased attention to the game despite Auburn’s substantial lead.

Conclusion

Heading into the Final Four, all but one 1-seed team have packed their bags and gone home. If your bracket wasn’t busted before, it most likely is now. We’ve almost reached the end of the road, but there is still more madness to come. Next week, we’ll find out who will cut the nets in Minneapolis and which team and brand will be crowned NCAA Champions.

Kevin Guo is Co-founder and CEO of Hive, a full-stack deep learning company based in San Francisco building an AI-powered dashboard for media analytics. For inquiries, please contact him at kevin.guo@thehive.ai.

Viewership data was powered by 605, an independent TV data and analytics company.

*A Brand Prominence Score is defined as the combination of a logo’s clarity, size, and location on-screen, in addition to the presence of other brands or objects on screen.