BACK TO ALL BLOGS Bain & Company introduces Bain Media Lab; Announces partnership with Hive and launch of Mensio, an AI-powered analytics platform to analyze TV advertising and sponsorships HiveMay 3, 2019July 4, 2024 LOS ANGELES – April 30, 2019 – Bain & Company announced today the formation of Bain Media Lab, a business that will feature a portfolio of digital products and related services that combine breakthrough technologies with powerful datasets. Hive, a full-stack deep learning company based in San Francisco, will be the launch partner for Bain Media Lab. Bain Media Lab is a new venture incubated in the Bain Innovation Exchange, a business unit that leverages Bain’s network of venture capitalists, startups, and tech leaders to help clients innovate through the ecosystem, as well as support Bain in creating new ventures. “We are excited to introduce Bain Media Lab and to announce our partnership with Hive,” said Elizabeth Spaulding, the co-lead of Bain & Company’s Global Digital practice. “Today’s milestone launch exemplifies our strategy to deepen select Bain Innovation Exchange relationships through the formation of new businesses like Bain Media Lab, which will pair Bain’s expertise with best-in-class innovation to create disruptive solutions. It will also be a powerful vehicle to dramatically accelerate the visibility and growth of innovative technology companies like Hive.” In partnership, Bain Media Lab and Hive have developed Mensio, an artificial intelligence-powered analytics platform focused on bringing “digital-like” measurement, intelligence, and attribution to traditional television advertising and sponsorships. Mensio addresses a pain point shared by marketers and media companies – the lack of recent and granular data on the performance of traditional television advertising and sponsorships. As digital marketing has continued to grow its share of advertising dollars, marketers have become accustomed to seeing real-time campaign performance data with granular measurement of audience reach and outcomes. This dynamic has added pressure on television network owners to source comparable data to defend their share of marketers’ advertising budgets. “Our partnership with Hive is the result of an extensive evaluation of the landscape and our resulting conviction that together we can uniquely create truly differentiated solutions,” said Dan Calpin, who leads Bain Media Lab. “Our launch product, Mensio, unlocks the speed and granularity of data for TV advertising and sponsorships that marketers have come to expect from their digital ad spend. Mensio arms marketers and their agencies to transition from post-mortem analysis of TV ad spend to real-time optimization, and gives network owners long-elusive data that can help them recast the narrative on advertising.” “We are excited to partner with Bain & Company as the launch partner of Bain Media Lab,” said Kevin Guo, co-founder and CEO of Hive. “In jointly developing Mensio, we have blended the distinctive competencies of our two firms into a seamlessly integrated go-to-market offering. Hive’s ambition is to leverage artificial intelligence in practical applications to transform industries, and Mensio is our flagship product in the media space.” Subscribers to the Mensio platform access a self-service, cloud-based dashboard that provides point-and-click reporting. Two tiers of the dashboard product are available: one for the buyers of TV advertising (marketers and their agencies) and one for the sellers (TV network owners). Selected features available in the Mensio dashboard and from related services include: Reach: Measurement of exposure to a brand’s TV advertisements for a given population, ranging from total population to specific behavior-defined segments like frequent guests at quick service restaurantsFrequency: Reporting on the distribution of frequency for a given population (e.g., what percent of households were exposed to more than 20 TV ads for a given brand over the course of a month)Attribution: Evaluation of the impact of exposure to TV advertising and sponsorships on a broad set of outcomes, including online activity, store visitation, and purchases as well as qualitative brand metricsCompetitive intelligence for brands: Insight into a brand’s relative share of voice versus peers, as well as the mix of networks, programs, genres, dayparts, and ad formats used by a given brand relative to its competitive setCompetitive intelligence for TV network owners: Insights into trends in spending by industry vertical and brand, as well as relative share of a given TV network owner vs. competitorsSponsorship measurement and return on investment: Measurement of the volume, quality, and equivalent media value of sponsorship placements and earned media, with the ability to link to outcomes The Mensio product suite uses Hive’s computer vision models – trained using data labeled by Hive’s distributed global workforce of over 1 million people – to enrich recorded television content with metadata including the identification of commercials and sponsorship placements as well as contextual elements like beach scenes. Second-by-second viewership of that content is derived using data from nearly 20 million U.S. households, inclusive of cable and satellite set-top boxes as well as Smart TVs, that is then scaled nationally and can be matched in a privacy-safe environment to a range of outcome behaviors. Outcome datasets enable household-level viewership of content to be matched to online activity (including search and website visits), retail store visits, and purchases (including retail purchases as well as several data sets specific to certain industries such as automotive and consumer packaged goods). Mensio is currently in beta in the U.S. with a growing number of clients across industries. It will begin to expand into other geographies over the next year. For more information, visit: www.bainmedialab.com/mensio. Bain & Company and Hive are additionally collaborating on other related products and services for television network owners addressing programming optimization and content tagging use cases. Editor’s note: To arrange an interview with Mrs. Spaulding or Mr. Calpin, contact Dan Pinkney at dan.pinkney@bain.com or +1 646 562 8102. To arrange an interview with Mr. Guo, contact Kristy Yang at press@thehive.ai or +1 415 562 6943. About Hive Hive is a full-stack deep learning company based in San Francisco that focuses on solving visual intelligence challenges. Today, Hive works with many of the world’s biggest companies in media, retail, security, and autonomous driving in building best–in-class computer vision models. Through its flagship enterprise platform, Hive Media, the company is aiming to build the world’s largest database of structured media content. Hive has raised over $50M from a number of well-known venture investors and strategic partners, including General Catalyst, 8VC, and Founders Fund. For more information visit: www.thehive.ai. Follow us on Twitter @hive_ai. About Bain & Company Bain & Company is the management consulting firm that the world’s business leaders come to when they want results. Bain advises clients on private equity, mergers and acquisitions, operations excellence, consumer products and retail, marketing, digital transformation and strategy, technology, and advanced analytics, developing practical insights that clients act on and transferring skills that make change stick. The firm aligns its incentives with clients by linking its fees to their results. Bain clients have outperformed the stock market 4 to 1. Founded in 1973, Bain has 57 offices in 36 countries, and its deep expertise and client roster cross every industry and economic sector. For more information visit: www.bain.com. Follow us on Twitter @BainAlerts.
BACK TO ALL BLOGS The Effect of Dirty Data on Deep Learning Systems HiveApril 23, 2019July 4, 2024 Introduction Better training data can significantly boost the performance of a deep learning model, especially when deployed in production. In this blog post, we will illustrate the impact of dirty data, and why correct labeling is important for increasing the model accuracy. Background An adversarial attack fools an image classifier by adding an imperceptible amount of noise to an image. One possible way to defend against this is to simply train machine learning models on adversarial examples. We can collect various hard mining examples and add them to the dataset. Another interesting model architecture to explore is generative adversarial network, which generally consist of two parts: a generator to generate fake examples in order to fool the discriminator, and a discriminator to discriminate between clean/fake examples. Another possible type of attack, data poisoning, can happen during training time. The attacker can identify the weak parts of a machine learning architecture, and potentially modify the training data to confuse the model. Even slight perturbations to the training data and label can result in worse performance. There are several methods to defend against such data poisoning attacks. For example, it is possible to separate clean training examples from poisoned ones, so that the outliers are deleted from the dataset. In this blog post, we investigate the impact of data poisoning (dirty data) using the simulation method: random labeling loss. We will show that with the same model architecture and dataset size, we are able to get huge accuracy increase with better data labeling. Data We experiment with the CIFAR-100 dataset, which has 100 classes and 600 32×32 coloured images per class. We use the following steps to preprocess the images in the dataset Pad each image to 36×36, then randomly crop to 32×32 patchApply random flip horizontallyDistort image brightness and contrast randomly The dataset is randomly split into 50k training images and 10k evaluation images. Random labeling is the substitution of training data labels with random labels drawn from the marginal of data labels. Different amounts of random labeling loss are added to the training data. We simply shuffle certain amount of labels for each class. The images to be shuffled are chosen randomly from each class. Because of the randomness, the generated dataset is still balanced. Note that evaluation labels are not changed. We test the model with 4 different datasets, 1 clean and 3 noisy ones. Clean: No random noise. We assume that all labeling is correct for CIFAR-100 dataset. Named as ‘no_noise’.Noisy: 20% random labeling noise. Named as ‘noise_20’.Noisy: 40% random labeling noise. Named as ‘noise_40’.Noisy: 60% random labeling noise. Named as ‘noise_60’. Note that we choose aggressive data poisoning because the production model we build is robust to small amount of random noise. Note that the random labeling scheme allows us to simulate the effect of dirty data (data poisoning) in real world scenario. Model We investigate the impact of dirty data on one of the popular model, ResNet-152 model architecture. Normally it is a good idea to perform fine-tuning on pre-trained checkpoints to get better accuracy with fewer training steps. In this blog the model is trained from scratch, because we want to get a general idea of how noisy data would affect the training and final results without any prior knowledge gained from pretraining. We optimize the model with SGD (stochastic gradient descent) optimizer with cosine learning rate decay. Results Quantitative results: Accuracy Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%. Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades. Qualitative results: Learning curve Losses Precision recall curves Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%. Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades. Conclusion In this post we investigate the impact of data poisoning attacks on performances using image classification as an example task, by the random labeling simulation method. We show that popular model (ResNet) is somewhat robust to data poisoning, but the performance still significantly degrades after poisoning. High-quality labeling is thus crucial to modern deep learning systems.
BACK TO ALL BLOGS One Shining Moment for Brands We’ve reached the end of the road. 1-seed Virginia were crowned NCAA Champions with their overtime win against 3-seed Texas Tech. Hive wrapped up the tournament with analysis on the Final Four and NCAA Championship game. Here’s how March Madness played out for brands this year. HiveApril 17, 2019July 4, 2024 At a Glance: Hive analyzed the NCAA Championship game to assess logo distribution by Brand Prominence*, earned media exposure, and viewership trends across games.AT&T’s sponsored logos on digital overlays and during the halftime shows won the most screen time across all of March Madness.Apparel sponsors placed their own March Madness bets by choosing which of the teams to sponsor as gear providers. Of the 64 teams, Nike backed 59% of them, followed by Under Armour with 25%, and Adidas with 16%. Under Armour’s sponsorship bets paid off by the championship as Texas Tech went head to head with Nike-backed Virginia.The NCAA Championship game viewership fell slightly from last year but still reached nearly twice as many households as the Duke vs. Virginia Tech game (the highest viewed non-finals game in the tournament). Another March Madness, another CBS ‘One Shining Moment’ montage. Texas Tech and Virginia beat out Michigan State and Auburn in the Final Four to face each other in the championship game. Both schools had never made it this far before in NCAA history and their game was the first time two first-time participants went head to head in 40 years. Hive used its best-in-class computer vision models in conjunction with viewership data (powered by 605, an independent TV data and analytics company) from more than 10M households to analyze brand logo distribution and viewership levels. Brands Figure 1. Results from the Hive Logo Model Hive recapped all the rounds and mapped out logo placements and earned media for all sponsors. AT&T remained consistent throughout the entire course of the tournament and scored 50% more airtime than Nike, who had the second highest amount of screen time. Figure 2. Results from the Hive Logo Model One brand’s March Madness bets paid off big time this year. The tournament started with Nike sponsoring 40 teams followed by Under Armour with 17 and Adidas with 11. Adidas got the boot in the earlier rounds but Under Armour edged its way into the NCAA Championship game backing the Texas Tech Raiders as they went head to head with the Nike-backed Virginia Cavaliers. Under Armour went from sponsoring 25% of teams in the First Four to nearly 50% by the finals, earning them the same amount of screen time as competitor Nike. Figure 2 shows their fight from the beginning of the tournament to the very last game. These sponsors gave brackets a whole new meaning. Games Two defensive-minded teams faced each other in the championship this year. Texas Tech was unranked when their season started and soon found themselves in their first ever National Championship game. After being the first 1-seed to lose to a 16-seed in NCAA history last year, Virginia proved everyone wrong this year and also made their National Championship debut. The game itself got off to a slow start before we saw Virginia take a 10-point lead, fall to a 3-point deficit, then tie the game 68-68 to force overtime. Texas Tech fought hard, but at the end of the day, Virginia had the last say. Figure 3. Viewership data powered by 605 As the biggest night of the year for college basketball, the NCAA Championship game reached 12% of American households with a peak of 15% – almost double the amount of viewers than the Duke vs. Virginia Tech game, the highest performing non-finals game of the tournament. Conclusion March Madness is a huge opportunity for brands. We’ve learned which brands performed the best, what elements drove viewership, what aspects retained viewership. We also learned that you don’t need a Zion to go to the Final Four but it helps to have a star player to hike up viewership levels. Hive is the premier solution for brands looking to improve their real-time reach and ROI measurement for commercials, earned media, and sponsorships during events like March Madness. Kevin Guo is Co-founder and CEO of Hive, a full-stack deep learning company based in San Francisco building an AI-powered dashboard for media analytics. For inquiries, please contact him at kevin.guo@thehive.ai. Viewership data was powered by 605, an independent TV data and analytics company. *A Brand Prominence Score is defined as the combination of a logo’s clarity, size, and location on-screen, in addition to the presence of other brands or objects on screen.
BACK TO ALL BLOGS Survive and Advance: Winners of the Sweet 16 and Elite Eight March Madness lived up to its name last weekend in this year’s Sweet 16 and Elite Eight. The road to the Final Four has been exhilarating for some and heartbreaking for others. Only four teams remain in the NCAA tournament, and Hive followed the journeys of the teams, viewers, and advertisers. Here’s how everyone’s stories unfolded in the next two chapters of March Madness. HiveApril 8, 2019July 4, 2024 At a Glance: Hive analyzed the Sweet 16 and Elite Eight to assess logo distribution by brand prominence, earned media exposure, and viewership trends across games.Buffalo Wild Wings capitalized on its overtime commercial spots, with the highest average household reach (5.7% of households) on its placements.AT&T’s logo placements showed consistency, maintaining its spot for the most screen time with a majority of logos scoring above average on Brand Prominence.*The highest average household viewership occurred during the Sweet 16 where Duke vs. Virginia Tech had a 7.6% average household reach and a peak of 9.2%. Second place went to Duke vs. Michigan State in the Elite Eight with a 6.8% average household reach and a peak of 10.5% – the highest of any game.Hive assessed how the point gap in the last minutes of the games drove increased viewership and found a strong correlation, with the closest games seeing up to a 200% bump in viewership in the last minutes. Texas Tech, Virginia, Auburn, and Michigan State fought tough battles and earned themselves spots in the Final Four. Over the course of four days, six teams upset their opponents, three games went into overtime, and two teams found out that they would make their Final Four debuts. Hive used its best-in-class computer vision models in conjunction with viewership data (powered by 605, an independent TV data and analytics company) from more than 10M households to analyze brand logo distribution and sources of viewership fluctuation. Brand winners Figure 1. Viewership data powered by 605 Official NCAA Corporate Partner Buffalo Wild Wings snagged the most effective commercial spots in the Sweet 16 and Elite Eight, earning the top household reach per commercial airing. Their overtime-specific commercial was created and set to air only during overtime games, which paid off big time in these two rounds. With Purdue vs. Tennessee in the Sweet 16 and Purdue vs. Virginia and Auburn vs. Kentucky in the Elite Eight all going into overtime, the brand’s ad slots earned them a number of extra opportunities to get in front of fans with relevant content. Google reached the second highest percentage of household reach per commercial airing followed by Apple. Figure 2. Results from the Hive Logo Model The Hive Logo Model also scanned every second of the 12 games this week for logo placements and earned media. AT&T’s digital overlays and halftime sponsorship earned the most airtime again this week. Their logos were not only frequently on screen, but they were also quite prominent with a majority of logos scoring more than 20 on Brand Prominence.* Apparel and gear sponsors Nike, Under Armour, and Spalding all received lots of screen time but their logos were low prominence, usually appearing in the action on jerseys, shoes, or hoops. Courtside sponsors Lowes, Buick, Infiniti, and Coca Cola were all consistently mid-scoring with a few very strong placements when the camera caught the logo in the background of a close-up. Top games Figure 3. Viewership data powered by 605 The Sweet 16 and Elite Eight once again proved that the Zion effect is real. The top two games with the highest average viewership over the course of the two rounds went to both Duke games. In the Sweet 16, fans held their breath as the Blue Devils they narrowly escaped Virginia Tech 75-73. The game itself raked in the largest audience size in the tournament yet. However, the Zion show came to an end after Michigan State shocked Duke in the Elite Eight. The final score read 68-67, a bracket-busting win for Michigan State. Figure 4. Viewership data powered by 605 No. 3 seed Purdue put up a fight in this year’s tournament with two overtime games. Figure 4 shows a graph of their battle against No. 2 seed Tennessee in the Sweet 16 overlaid with the Florida State vs. Gonzaga game that started a few minutes before. The CBS game retained steady viewership as it approached halftime while the TBS game started just in time for the other game’s viewers to switch over. They flipped back to CBS during Purdue vs. Tennessee’s halftime show, but they did not return when it ended. This may be attributed to the fact that barely five minutes into the second half, Purdue took an 18-point lead over Tennessee. However, Tennessee began to make a comeback and viewership spiked to 7% as they forced OT. Purdue prevailed, securing their spot in the Elite Eight for the first time since 2000. Interestingly, viewers in this round overwhelmingly followed the action on both channels. The loss in viewership during halftime on the CBS show was almost perfectly mirrored with a bump in viewership on the TBS game. When the CBS game returned, most switched back until Tennessee started to come back from their double-digit deficit, stealing a majority of the viewership as the CBS game tailed off in the last few minutes. Purdue’s Elite Eight performance drew an even bigger crowd than the last round. An average of 6% of American households watched them play 1-seed Virginia, arguably one of the most exciting games in the entire tournament. Within the last two minutes of regulation, Carson Edwards gave Purdue the lead, impressing America with his tenth three-pointer of the game. With only six seconds remaining, all of the stars were aligned as UVA’s Ty Jerome perfectly missed his second free throw, commencing the play that allowed Mamadi Diakite to tie up the game and force OT. Ultimately, Virginia edged out Purdue 80-75 preventing what could have been their first ever Final Four appearance. Two teams, however, anticipated their Final Four debuts. After defeating Kansas and North Carolina, 5-seed Auburn beat 2-seed Kentucky in the Elite Eight proving that the Tigers can hang with the blue bloods.Texas Tech will also be showing up to the Final Four for the first time in program history after upsetting No. 1 Gonzaga. This game had the highest average household viewership on TBS during these two rounds. Figure 5. Viewership data powered by 605 Given the last minute shifts in viewership during the Purdue vs. Tennessee nail-biter, Hive decided to analyze how point gap in the last 10 minutes of the game drives viewership. As would be expected, increases in game viewership in the last ten minutes was strongly driven by how close the scores were. As the average point differential near the end of the game decreased, the viewership grew substantially with the closest games seeing up to a 250% bump. Auburn vs. North Carolina was an exception to this, seeing viewership rise 100% during the last ten minutes despite a double-digit point gap. This was likely due to its interest relative to the competing game, LSU vs. Michigan, which had a similarly wide point gap but in favor of the higher seeded team. Auburn’s upset coupled with Chuma Okeke’s unfortunate injury increased attention to the game despite Auburn’s substantial lead. Conclusion Heading into the Final Four, all but one 1-seed team have packed their bags and gone home. If your bracket wasn’t busted before, it most likely is now. We’ve almost reached the end of the road, but there is still more madness to come. Next week, we’ll find out who will cut the nets in Minneapolis and which team and brand will be crowned NCAA Champions. Kevin Guo is Co-founder and CEO of Hive, a full-stack deep learning company based in San Francisco building an AI-powered dashboard for media analytics. For inquiries, please contact him at kevin.guo@thehive.ai. Viewership data was powered by 605, an independent TV data and analytics company. *A Brand Prominence Score is defined as the combination of a logo’s clarity, size, and location on-screen, in addition to the presence of other brands or objects on screen.
BACK TO ALL BLOGS Brand Madness: The Road to the Final Four March Madness is one of the most popular sports showcases in the nation and Hive is following all month long. Here’s how the First and Second Round advertisers did and which games elicited the most commotion. HiveMarch 28, 2019July 5, 2024 At a Glance: March Madness is the most anticipated tournament of the year for college basketball fans, bracket-holders, and advertisers alike.Hive analyzed the First and Second Rounds to assess viewership trends across games, earned media exposure, and sponsorship winners.The highest average viewership occurred during the Round of 32 where Auburn vs. Kansas had a 0.8% average household reach. However, First Round Cinderella story UC Irvine’s Second Round game against Oregon achieved the highest peak viewership, reaching 1.1% of households.Although UC Irvine’s fairy tale ending wasn’t meant to be, the potential of another upset in the Second Round bumped their viewership 108% from the last round. The victor of that game, Oregon, was the only double-digit seed to survive through to the Sweet 16.AT&T optimized their sponsorship spot and earned the most screen time while maintaining a high Brand Prominence Score.Progressive had the most effective airings with over 2% average reach for their spots in total but GMC and AT&T generated the most viewership with over 100 airings and an average household reach of just under 2%. March Madness is a live TV experience like no other. Millions of brackets are filled out every year, and unlike other one-time sporting events such as Super Bowl Sunday, March Madness is an extended series of games with an all-day TV schedule. This results in more data points and in turn, more opportunities to assess patterns and trends. The elongated showcase gives marketers some madness of their own – TV advertisers and sponsors receive a unique chance to hold their audience’s attention and craft a story. In the First and Second Rounds, Hive used its best-in-class computer vision models in conjunction with viewership data (powered by 605, an independent TV data and analytics company) from more than 10M households to analyze which brands made an appearance, which games viewers were watching, and which advertisers optimized their NCAA sponsorship real-estate. Brands that stole the show Because of the diversity in viewership of the tournament, brands have the opportunity to market content to both fans and non-fans. Acquiring March Madness real-estate means unlocking millions of impressions from one of the largest audiences in the nation. Hive is able to measure earned media and sponsorship exposure by using computer vision AI to identify brand logos in content during regular programming, creating a holistic “media fingerprint.” This visual contextual meta-data is overlaid with the most robust viewership dataset available to enable brands with an unparalleled level of data on their earned media and sponsorship. Hive Media helps brands understand how a dollar of advertising spend may translate to real-life consumer purchases. VIDEO Hive’s AI models capture every second that a brand’s logo is on screen and assign that logo a Brand Prominence Score.* AT&T won big in the first two rounds of the tournament. As an Official NCAA Corporate Champion and halftime sponsor, it had prominent digital overlays, one of the highest brand prominences, and the most seconds on screen with almost six hours. Its earned media was equivalent to 260 30-second spots. In second place Capital One, another Corporate Champion, with 3 hours and 22 minutes followed by Nike with 3 hours and 20 minutes of screen time. Figure 1. Results from Hive Logo Model Apparel and gear sponsors such as Nike, Under Armour, and Spalding earned a significant amount of minutes on screen because they appeared on locations such as jerseys and backboards. However, as a result of their locations, the logos appeared with much less prominence. In addition to on-site logos, Hive also tracked the top brands by commercial airings and average household reach. Progressive, Coca-Cola, State Farm and Apple all earned higher than 2% average household reach with a selective placement strategy, however GMC and AT&T were the big winners in terms of volume as each earned almost 2% average reach with significantly more airings. Figure 2. Results from Hive Commercial AI Model, viewership powered by 605 Top games Here are the most viewed games on broadcast networks. Figure 3. Viewership data powered by 605 And here are the top five cable games from each round. Figure 5. Viewership data powered by 605 Figure 6. Viewership data powered by 605 In the First Round, the Florida vs. Nevada game on TNT had the highest average viewership levels on cable TV. Figure 5 shows the household reach of Florida vs. Nevada on TNT alongside that of St. Mary’s vs. Villanova on TBS, games that broadcasted at the same time. Household tune-in remained steady in the first half with dips in viewership during commercial breaks. By halftime, Florida had secured a healthy lead and viewership dropped as viewers switched over to the TBS game. However, viewership recovered with strength during the second half as Florida began to squander a double-digit lead just as the TBS game went to halftime. Viewers were retained even as the TBS game returned, with viewership continuing to rise until Florida narrowly edged out a win over Nevada. The head to head comparison illustrates powerful correlations between viewership and excitement during live games. Figure 7. Viewership data powered by 605 March Madness always has the entire nation buzzing about Duke, a national brand with millions of people following the blue blood powerhouse. This year, fan loyalty, coupled with the intrigue of Zion Williamson, unsurprisingly earned the team one of the largest overall audiences of the first two rounds. To top it off, the Blue Devils narrowly escaped what would have been the biggest upset of the tournament with a one-point victory. Despite CBS’s broadcast games driving the most viewership, many households switched to cable to follow the Round of 64’s most exciting underdog story, UC Irvine. After the First Round, it seemed as if Cinderella had moved to sunny California as UC Irvine ran away with its first tournament victory in NCAA history. Although Irvine’s average viewership level did not make the top 5 in the First Round, audience tune-in continued to soar throughout the game. After this exciting upset, UCI reached a peak of 1.1% of U.S. households in the Second Round to make it in the top five most-viewed games in the Second Round as they went head to head with Oregon. The dream of a fairy tale ending came to an end as Oregon defeated Irvine to become the only double-digit seed to secure a spot in the Sweet 16. Figure 8. Viewership data powered by 605 Conclusion The first two rounds may be over, but we’ve only just begun the games. With no true Cinderella run this year, March Madness continues on with only the top programs. Keep an eye out to learn how this week’s bracket busters will affect audience retention in the Sweet 16 and Elite Eight as Hive continues to track viewership and advertising trends. Kevin Guo is Co-founder and CEO of Hive, a full-stack deep learning company based in San Francisco building an AI-powered dashboard for media analytics. For inquiries, please contact him at kevin.guo@thehive.ai. Viewership data was powered by 605, an independent TV data and analytics company. *A Brand Prominence Score is defined as the combination of a logo’s clarity, size, and location on-screen, in addition to the presence of other brands or objects on screen.
BACK TO ALL BLOGS Spark on Mesos Part 2: The Great Disk Leak HiveMarch 8, 2019July 5, 2024 After ramping up our usage of Spark, we found that our Mesos agents were running out of disk space. It was happening rapidly on some of our agents with small disks: The issue turned out to be that Spark was leaving behind binaries and jars in both driver and executor directories: Each uncompressed Spark binary directory folder contains 248MB, so to sum this up: For a small pipeline with one driver and one executor, this adds up to 957MB. At our level of usage, this was 100GB of dead weight added every day. I looked into ways to at least avoid storing the compressed Spark binaries, since Spark only really needs the uncompressed version. It turns out that Spark uses the Mesos fetcher to copy and extract files. By enabling caching on the Mesos fetcher, Mesos will store only one cached copy of the compressed Spark binaries, then extract it directly into each sandbox directory. In the spark documentation, it looks like this should be solved by setting the spark.mesos.fetcherCache.enable option to true; If set to true, all URIs (example: spark.executor.uri, spark.mesos.uris) will be cached by the Mesos Fetcher Cache.” Adding this to our Spark application confs, we found that the cache option was turned for the executor, but not driver: This brought our disk leak down to 740MB per Spark application. Reading through the Spark code, I found that the driver’s fetch configuration is defined by the MesosClusterScheduler, whereas the executor’s are defined by the MesosCourseGrainedSchedulerBackend. There were two oddities about the MesosClusterScheduler: It reads options from the dispatcher’s configuration instead of the submitted application’s configurationIt uses the spark.mesos.fetchCache.enable option instead of spark.mesos.fetcherCache.enable So bizarre! Finding no documentation for either of these issues online, I filed two bugs. By now, my PRs to fix them have been merged in, and should show up in upcoming releases. In the meantime, I implemented a workaround by adding the spark.mesos.fetchCache.enable=true option to the dispatcher. Now the Driver also used caching, reducing the disk leak to 523MB per Spark application: Finally, I took advantage of Spark’s shutdown hook functionality to manually clean up the driver’s uberjar and uncompressed spark binaries: //shutdown hook to clean driver spark binaries after application finishes sys.env.get("MESOS_SANDBOX").foreach((sandboxDirectory) => { sparkSession.sparkContext.addSparkListener(new SparkListener { override def onApplicationEnd(sparkListenerApplicationEnd: SparkListenerApplicationEnd): Unit = { val sandboxItems = new File(sandboxDirectory).listFiles() val regexes = Array( "^spark-\d+.\d+.\d+-bin".r, "^hive-spark_.*\.jar".r ) sandboxItems .filter((item) => regexes.exists((regex) => regex.findFirstIn(item.getName).isDefined)) .foreach((item) => { FileUtils.forceDelete(item) }) } }) }) This reduced the disk leak to just 248MB per application: This still isn’t perfect, but I don’t think there will be a way to delete the uncompressed spark binaries from your Mesos executor sandbox directories until Spark adds more complete Mesos functionality. For now, it’s a 74% reduction in the disk leak. Last, and perhaps most importantly, we reduced the time to live for our completed Mesos frameworks and sandboxes from one month to one day. This effectively cut our equilibrium disk usage by 97%. Our Mesos agents’ disk usage now stays at a healthy level.