BACK TO ALL BLOGS Spark on Mesos Part 1: Setting Up HiveFebruary 12, 2019July 5, 2024 At Hive, we’ve created a data platform including Apache Spark applications that use Mesos as a resource manager. While both Spark and Mesos are popular frameworks used by many top tech companies, using the two together is a relatively new idea with incomplete documentation. Why choose Mesos as Spark’s resource manager? Spark needs a resource manager to tell it what machines are available and how much CPU and memory each one has. It uses this information and then requests that the resource manager add tasks for the executors it needs. There are currently four resource manager options: standalone, YARN, Kubernetes, and Mesos. The next table should make clear why we chose to use Mesos. We wanted our Spark applications to use the same in-house pool of resources that other, non-Hadoop workloads do, so only Kubernetes and Mesos were options to us. There are great posts out there contrasting the two of these, but for us the deciding factor was that we already use Mesos. Spark applications can share resources with your other frameworks in Mesos. Learnings from Spark on Mesos Spark’s guide on running Spark on Mesos is the best place to start setting this up. However, we ran into a few notable quirks it does not mention. A word on Spark versions While Spark has technically provided support for Mesos since version 1.0, it wasn’t very functional until recently. We strongly recommend using Spark 2.4.0 or later. Even in Spark 2.3.2, there were some pretty major bugs: The obsolete MESOS_DIRECTORY environment variable was used instead of MESOS_SANDBOX, which caused an error during sparkSession.stop in certain applications.Spark-submit to mesos would not properly escape your application’s configurations. To run the equivalent of spark-submit –master local[4] –conf1 “a b c” –class package.Main my.jar on mesos, you would need to run spark-submit –master –conf1 “a\ b\ c” mesos://url –deploy-mode cluster –class package.Main my.jar. Spark 2.4.0 still has this issue for arguments.Basically everything under the version 2.4.0 Jira page with Mesos in the name. Still, Spark’s claim that it “does not require any special patches of Mesos” is usually wrong on one count: accessing jars and spark binaries in S3. To access these, your mesos agents will need hadoop libraries. Otherwise, you will only be able to access files stored locally on the agents or accessible by http. In order to use S3 links or HDFS links, one must configure every Mesos agent with a local path to the Hadoop client. This allows the Mesos Fetcher to successfully grab the Spark bin and begin executing the job. Spark Mesos dispatcher The Spark dispatcher is a very simple Mesos Framework for running Spark jobs inside a Mesos cluster. The dispatcher actually does not manage the resource allocation nor the application lifecycle of the jobs. Instead, for each new job it receives, it launches a Spark Driver within the cluster. The Driver itself is also a Mesos Framework with its own UI and is given the responsibility of provisioning resources and executing its specific job. The dispatcher is solely responsible for launching and keeping track of Spark Drivers. How a Spark Driver runs jobs in any clustered configuration. While setup for the dispatcher is as simple as running the provided startup script, one operational challenge to consider is the placement of your dispatcher. The two pragmatic locations for us were running the dispatcher on a separate instance outside the cluster, or as an application inside the Marathon Mesos framework. Both had their trade offs, but we decided to run the dispatcher on a small dedicated instance as it was an easy way to have a persistent endpoint for the service. One small concern worth mentioning is the lack of HA for the dispatcher. While Spark Drivers continue to run when the dispatcher is down and state recovery is available with Apache Zookeeper, multiple dispatchers cannot be coordinated together. If HA is an important feature, it may be worthwhile to run the service on Marathon and setting up some form of service discovery so you can have a persistent endpoint for the dispatcher. Dependency management There are at least three ways to use manage dependencies for your Spark repo: Copying dependency jars to the Spark driver yourself and specifying spark.driver.extraClassPath and spark.driver.extraClassPath.Specifying spark.jars.packages and optionally spark.jars.repositories.Creating an uberjar that includes both your code and all necessary dependencies’ jars. Option 1 gives you total control over which jars you use and where they come from, in case there are some items in the dependency tree you know you don’t need. This can save some application startup time, but is very tedious. Option 2 streamlines option 1 by listing the required jars only once and pulling from the list of repositories automatically, but loses the very fine control by pulling the full dependency tree of each dependency. Option 3 gives back that very fine control, and is the most simple, but duplicates the dependencies in every uberjar you make. Overall, we found option 3 most appealing. Compared to option 2, it saved 5 seconds of startup time on every Spark application and removed the worry that the maven repository would become unavailable. Better automating option 1 might be the most ideal solution of all, but for now, it isn’t worth our effort. What next? Together with Spark’s guide to running on Mesos, this should address many of hiccups you’ll encounter. But join us next time as we tackle one more: the great disk leak.
BACK TO ALL BLOGS Learning Hash Codes via Hamming Distance Targets HiveJanuary 18, 2019July 5, 2024 We recently submitted our paper Learning Hash Codes via Hamming Distance Targets to arXiv. This was a revamp and generalization of our previous work, CHASM (Convolutional Hashing for Automated Scene Matching). We achieved major recall and performance boosts against state-of-the-art methods for content-based image retrieval and approximate nearest neighbors tasks. Our method can train any differentiable model to hash for similarity search. Similarity search with binary hash codes Let’s start with everyone’s favorite example: ImageNet. A common information retrieval task selects 100 ImageNet classes and requires hashing “query” and “dataset” images to compare against each other. Methods seek to maximize the mean average precision (MAP) of the top 1000 dataset results by hash distance, such that most of the 1000 nearest dataset images to each query image come from the same ImageNet class. This is an interesting challenge because it requires training a differentiable loss term, whereas the final hash is discrete. Trained models must either binarize their last layer into 0s and 1s (usually just taking its sign), or (like FaceNet) pair up with a nearest neighbors method such as k-d trees or Jegou et al.’s Product Quantization for Nearest Neighbor Search. Insight 1: It’s not a classification task. While information retrieval on ImageNet is reminiscent of classification, its optimization goal is actually quite different. Every image retrieval paper we looked at implicitly treated similarity search as if it were a classification task. Some papers make this assumption by using cross entropy terms, asserting that the probability two images with last layers of are similar is something like The issue here is that the model uses hashes at inference time, not the asserted probabilities. An example of this is Cao et al.’s HashNet: Deep Learning to Hash by Continuation. Other papers make this assumption by simply training a classification model with an encoding layer, then hoping that the binarized encoding is a good hash. The flaw here is that, while the floating-point encoding contains all information used to classify the image, its binarized version might not make a good hash. Bits may be highly imbalanced, and there is no guarantee that binarizing the encoding preserves much of the information. An example of this is Lin et al.’s Deep Learning of Binary Hash Codes for Fast Image Retrieval. Finally, a few papers make this assumption by first choosing a target hash for each class, then trying to minimize the distance between each image and its class’s target hash. This is actually a pretty good idea for ImageNet, but leaves something to be desired: it only works naturally for classification, rather than more general similarity search tasks, where similarity can be non-transitive and asymmetric. An example of this is Lu et al.’s Deep Binary Representation for Efficient Image Retrieval. This seems to be the second best performing method after ours. We instead choose a loss function that easily extends to non-transitive, asymmetric similarity search tasks without training a classification model. I’ll elaborate on this in the next section. Insight 2: There is a natural way to compare floating-point embeddings to binarized hashes. Previous papers have tried to wrestle floating-point embeddings into binarized hashes through a variety of means. Some add “binarization” loss terms, punishing the model for creating embeddings that are far from -1 or 1. Others learn “by continuation”, producing an embedding by passing its inputs through a tanh function that sharpens during training. The result is that their floating-point embeddings always lie close to , a finding that they boast. They use this in order to make Euclidean distance or correspond more closely to Hamming distance (the number of bits that differ). If your embedding is just -1’s and 1s, then Hamming distance is simply half of Euclidean distance. However, this is actually a disaster for training. First of all, forcing all outputs to does not even change their binarized values; 0.5 and 0.999 both binarize to 1. Also, it shrinks gradients for embedding components near , forcing the model to learn from an ever-shrinking gray area of remaining values near 0. We resolve this by avoiding the contrived usage of Euclidean distance altogether. Instead, we use a sound statistical model for Hamming distance based on embeddings, making two approximations that turn out to be very accurate. First, we assume that a model’s last layer produces embeddings that consists of independent random unit normals (which we encourage with a gentle batch normalization). If we pick a random embedding , this implies that is a random point on the unit hypersphere. We can then simply evaluate the angle between and The probability that such vectors differ in sign on a particular component is , so we make our second approximation: that the probability for each component to differ in sign is independent. This implies that probability for Hamming distance between and to be is a binomial distribution: This allows us to use a very accurate loss function for the true optimization goal, the chance for an input to be within a target Hamming distance of an input it is similar to (and dissimilar inputs to be outside that Hamming distance). Using the natural geometry of the Hamming embedding, we achieve far better results than previous work. Insight 3: It’s important to structure your training batches right. Imagine this: you train your model to hash, passing in 32 pairs of random images from your training set, and averaging the 32 pairwise loss terms. Since you have 100 ImageNet classes, each batch consists of 31.68 dissimilar pairs and 0.32 similar pairs on average. How accurate is the gradient? The bottleneck is learning about similar images. If random error for each similar pair is𝜎σ, the expected random error in each batch is , even greater than𝜎σThis is a tremendous amount of noise, making it incredibly slow for the model to learn anything. We can first improve by comparing every image in the batch to every other image. This takes comparisons, which will be only slightly slower than the original 32 comparisons since we still only need to run the model once on each input. This gives us 2016 pairwise comparisons, with 1995.84 dissimilar pairs and 20.16 similar pairs on average. Our random error is now somewhere between and (probably closer to the latter), a big improvement. But we can do even better by choosing constructing a batch with random similar images. By first choosing 32 random images, then for each one choosing a random image it is similar to, we get 51.84 similar pairs on average, 32 of which are independent. This reduces our random error to between and , another big improvement Under reasonable conditions, this improves training speed by a factor of 10. Discussion Read our paper for the full story! We boosted the ImageNet retrieval benchmark from 73.3% to 85.3% MAP for 16-bit hashes and performed straight-up approximate nearest neighbors with 2 to 8 times fewer distance comparisons than previously state-of-the-art methods at the same recall.
BACK TO ALL BLOGS Multi-label Classification HiveMay 31, 2018July 5, 2024 Classification challenges like Imagenet changed the way we train models. Given enough data, neural networks can distinguish between thousands of classes with remarkable accuracy. However, there are some circumstances where basic classification breaks down, and something called multi-label classification is necessary. Here are two examples: You need to classify a large number of brand logos and what medium they appear on (sign, billboard, soda bottle, etc.)You have plenty of image data on a lot of different animals, but none on the platypus – which you want to identify in images In the first example, should you train a classifier with one class for each logo and medium combination? The number of such combinations could be enormous, and it might be impossible to get data on some of them. Another option would be to train a classifier for logos and a classifier for medium; however, this doubles the runtime to get your results. In the second example, it seems impossible to train a platypus model without data on it. Multi-label models step in by doing multiple classifications at once. In the first example, we can train a single model that outputs both a logo classification and a medium classification without increasing runtime. In the second example, we can use common sense to label animal features (fur vs. feathers vs. scales, bill vs. no bill, tail vs. no tail) for each of the animals we know about, train a single model that identifies all features for an animal at once, then infer that any animal with fur, a bill, and a tail is a platypus. A simple way to accomplish this in a neural network is to group a logit layer into multiple softmax predictions: You can then train such a network by simply adding the cross entropy loss for each softmax where a ground truth label is present. To compare these approaches, let’s consider a subset of imagenet classes, and two features that distinguish them: First, I trained two 50-layer resnet V2’s on this balanced dataset: one trained on the single-label classification problem, and the other trained on the multi-label classification problem. In this example, every training image has both labels, but real applications may have only a subset of labels available for each image. The single-label model trained specifically on the 6-animal classification performed slightly better when distinguishing all 6 animals: Single-label model: 90% accuracyMulti-label model: 88% accuracy However, the multihead model provides finer information granularity. Though it got only 88% accuracy on distinguishing all 6 animals, it achieved 92% accuracy at distinguishing scales/exoskeleton/fur and 95% accuracy at distinguishing spots/no spots. If we care about only one of these factors, we’re already better off with the multi-label model. But this toy example hardly touches on the regime where multi-label classification really thrives: large datasets with many possible combinations of independent labels. In this regime, we get the interesting benefit of transfer learning. Imagine if we had categorized hundreds of animals into a dozen binary criteria. Training a separate model for each binary criterion would yield acceptable results, but learning the other features can actually help in some cases by effectively pre-training the network on a larger dataset. At Hive, we recently deployed a multi-label classification model that replaced 8 separate classification models. For each image, we usually had truth data available for 2 to 5 of the labels. Out of the 8, 2 were better (think 93% instead of 91%). These were the labels with less data. This makes sense, since they would benefit most from domain-specific pretraining on the same images. But most importantly for this use case, we were able to run all the models together in 1/8th the time as before.
BACK TO ALL BLOGS How to Use SSH Tunneling HiveMarch 15, 2018July 5, 2024 This guide will present a step-by-step guide to solving common connectivity problems using SSH tunnels. To read some useful comments and context, skip to the end. Here we will use ssh local port forwarding to PULL the service port through the ssh connection. The ssh tunnel command (to be run from A) is: Step 0: If box A can already hit box B:8080, then good for you. Otherwise, follow the steps below to make that happen. Step 1: If box A can ssh into box B, read the following section. If not, go to Step 2. Here we will use ssh local port forwarding to PULL the service port through the ssh connection. The ssh tunnel command (to be run from A) is: user@A > ssh -vCNnTL A:8080:B:8080 user@B This pulls the service port over to box A at port 8080, so that anyone connecting to A:8080 will transparently get their requests forwarded over the the actual server on B:8080. Step 2: If box B can ssh into box A, read the following section. Otherwise, skip to step 3. Now we will use ssh remote port forwarding to PUSH the service port through the ssh connection. The command (to be run from B) is user@B > ssh -vCNnTR localhost:8080:B:8080 user@A Now users on A hitting localhost:8080 will be able to connect to B:8080. Users not on either A or B will still be unable to connect. To enable listening on A:8080, you have 2 options: A) If you have sudo, add the following line to /etc/ssh/sshd_config and reload the sshd service: GatewayPorts clientspecified Then rerun the above command with “localhost” replaced by “A”: user@B > ssh -vCNnTR A:8080:B:8080 user@A B) Pretend “localhost (A)” is another box, and apply step 1, since A can generally ssh into itself: user@A > ssh -vCNnTL A:8080:localhost:8080 user@localhost Now we come to the situation where neither A nor B can ssh into the other. Step 3: If there are any TCP ports that allow A, B to connect, continue reading. Otherwise, move on to step 4. Suppose that B is able to connect to A:4040. Then the way to allow B to ssh into A is to turn A:4040 into an ssh port. This is doable by applying Step 1 on A itself, to pull the ssh service on 22 over to listen on A:4040: user@A > ssh -vCNnTL A:4040:A:22 user@A And then you can apply Step 2, specifying port 4040 for ssh itself: user@B > ssh -vCNnTL localhost:8080:B:8080 user@A -p 4040 Similarly, if A is able to connect to B:4040, you’ll want to forward B:22 to B:4040 using Step 1, then apply Step 1 again as usual. Step 4: Find a box C which has some sort of connectivity to both A and B, and continue reading. Otherwise, skip to step 10. If A and B have essentially no connectivity, then the way to proceed is to route through another box. Step 5: If C is able to hit B:8080, continue reading. Otherwise, skip to step 9. Step 6: If A is able to SSH to C, continue reading. Otherwise, skip to step 7. This is very similar to step 1 — we will pull the connection through the SSH tunnel using LOCAL forwarding. user@A > ssh -vCNnTL A:8080:B:8080 user@C Step 7: If C is able to SSH to A, continue reading. Otherwise, skip to step 8. Again, this is analogous to step 2 — we will push the connection through the SSH tunnel using REMOTE ssh forwarding. user@C: ssh -vCNnTR localhost:8080:B:8080 user@A Just as before, we will need to add an additional forwarding step to listen on a public A interface rather than localhost on A: user@A: ssh -vCNnTL A:8080:localhost:8080 user@localhost Step 8: If neither C nor A can ssh to each other, but there are TCP ports open between the 2: Apply the technique in step 3. Otherwise, skip to step 10. Now we are in the situation where C can’t hit B:8080 directly. Step 9: The general idea is to first connect C:8080 to B:8080 using one of Steps 1 or 2, and then do the same to connection A:8080 to C:8080. Note that it doesn’t matter which order you set up these connections. 9a) If C can ssh to B and C can ssh to A. This is a very common scenario – maybe C is your local laptop which is connected to 2 separate VPNs. First pull B:8080 to C:1337 and then push C:1337 to A:8080: user@C > ssh -vCNnTL localhost:1337:B:8080 user@B user@C > ssh -vCNnTR localhost:8080:localhost:1337 user@A user@C > ssh user@A user@A > ssh -vCNnTL A:8080:localhost:8080 user@localhost 9b) If C can ssh to B and A can ssh to C: Again, a fairly common scenario if you have a “super-private” network accessible only from an already private network. Do two pulls in succession: user@C > ssh -vCNnTL localhost:1337:B:8080 user@B user@A > ssh -vCNnTL A:8080:localhost:1337 user@C You can actually combine these into a single command: user@A : ssh -vC -A -L A:8080:localhost:1337 user@c ‘ssh -vCNnTL localhost:1337:B:8080 user@B’ 9c) If B can ssh to C and C can ssh to A: Double-push. user@B > ssh -vCNnTR localhost:1337:B:8080 user@C user@C > ssh -vCNnTR localhost:8080:localhost:1337 user@A user@C > ssh user@A user@A > ssh -vCNnTL A:8080:localhost:8080 user@A Again these can be combined into a single command. 9d) If B can ssh to C and A can ssh to C: Push from B and then pull from A. user@B > ssh -vCNnTR localhost:1337:B:8080 user@C user@A > ssh -vCNnTL A:8080:localhost:1337 user@C Any number of these may also need the trick from Step 3 to enable SSH access. Step 10: If box C doesn’t have any TCP connectivity to either B or A,then having box C doesn’t really help the situation at all. You’ll need to find a different box C which actually has connectivity to both and return to step 4, or find a chain (C, D, etc.) through which you could eventually patch a connection through to B. In this case you’ll need a series of commands such as those in 9a)-d) to gradually patch B:8080 to C:1337, to D:42069, etc., until you finally end up at A:8080. Addendum 1: If your service uses UDP rather than TCP (this includes dns, some video streaming protocols, and most video games), you may have to add a few steps to convert to TCP; see https://superuser.com/questions/53103/udp-traffic-through-ssh-tunnel for a guide.Addendum 2: If your service host and port come with url signing (e.g. signed s3 urls), changing your application to hit A:8080 rather than B:8080 may cause the url signatures to fail. To remedy this, you can add a line in your /etc/hosts file to redirect B to A; since your computer checks this file before doing a DNS lookup, you can do completely transparent ssh tunneling while still respecting SSL and signed s3/gcs urls.Addendum 3: My preferred tools for checking which TCP ports are open between 2 boxes are nc -l 4040 on the receiving side and curl B:4040 on the sending side. ping B, traceroute B, and route -n are also useful for diagnostic information but may not tell you the full story.Addendum 4: SSH tunnels will not work if there is something already listening on that port, such as the SSH tunnel you created yesterday and forgot to remove. To easily check this, try ps -efjww | grep ssh or sudo netstat -nap | grep LISTEN.Addendum 5: All the ssh flags are explained in the man page: man ssh. To give a brief overview: -v is verbose logging, -C is compression, -nNT together disable the interactive part of ssh and make it only tunnel, -A forwards over your ssh credentials, -f backgrounds ssh after connecting, and -L and -R are for local and remote forwarding respectively. -o StrictHostKeyChecking=no is also useful for disabling the known-hosts check. FURTHER COMMENTS SSH tunnels are useful as a quick-fix solution to networking issues, but are generally recognized as inferior solutions compared to long-term proper networking fixes. They tend to be difficult to maintain for an number of reasons: the setup does not require any configuration and leaves no trace other than a running process; they don’t automatically come up when restarting a box — except if manually added to the startup daemon; and they can easily be killed by temporary network outages. However we’ve made good use of them here at Hive, for instance most recently when we needed to keep up production services during a network migration, but also occasionally when provisioning burst GPU resources from AWS and integrating them seamlessly into our hardware resource pool. They also can be very useful when developing locally or debugging production services, or for getting gmail access in China. If you’re interested in different and more powerful ways to tunnel, I’m no networking expert — all I can do is point you in the direction of some interesting networking vocabulary. References https://en.wikipedia.org/wiki/OSI_model https://help.ubuntu.com/community/SSH/OpenSSH/PortForwarding#Dynamic_Port_Forwarding https://en.wikipedia.org/wiki/SOCKS https://en.wikipedia.org/wiki/Network_address_translation https://en.wikipedia.org/wiki/IPsec https://en.wikipedia.org/wiki/Iptables https://wiki.archlinux.org/index.php/VPN_over_SSH https://en.wikipedia.org/wiki/Routing_table https://linux.die.net/man/8/route
BACK TO ALL BLOGS Back to Our Roots: Hive Data in Academia HiveFebruary 26, 2018July 5, 2024 Hive was started by two PhD students at Stanford who were frustrated with the difficulty of generating quality datasets for machine learning research. We found that solutions on the market were either too inaccurate or too expensive to conduct the typical research study. From those early days, we’ve now built one of the world’s largest marketplaces of human labor. In keeping with our academic roots, we always intended Hive Data to be the perfect partner for academic labs. The best way to showcase Hive Data’s impact on academia is through a real case study. A machine learning researcher, who we’ll call Professor X, had an urgent conference submission deadline coming up. He still had quite a lot of work remaining, and much of his work required him to label a large corpus of videos. He had tried many other services without success, and was urgently searching for a solution that could address all of his needs. Here were the constraints he was under, and how we solved them: Professor X didn’t want to pay a large upfront cost.Given his limited budget and inability to risk project failure, Professor X needed to ensure that the provider he chose had a competitive price point and offered him the flexibility that he needed. Other services on the market generally had fixed costs ran upwards of hundreds of thousands of dollars the first year. Even if he could afford a single engagement, if the service wasn’t up to par, he didn’t have the budget to try a different one. Hive doesn’t impose any upfront fees, so it made us a low-risk option.Professor X needed to make sure the data output quality would be high enough to publish research.While he did find some services whose rates were competitive with Hive’s, he noted quickly that all of them suffered from poor data quality, especially for tasks in data labeling for videos. This rendered the data unusable for his research project. Hive, on the other hand, offered a complex system of audits and a worker consensus model to ensure high data accuracy. Because the tasks passed through several rounds of worker auditing, Hive was able to offer the high-quality data that Professor X needed.Professor X needed his results in a fast turnaround time.As we mentioned, Professor X was on a tight deadline to submit his paper for publication. Most other services have inflexible, week-long timelines for returning datasets. Hive, however, offered a much faster turnaround time. Due to our remarkably large global workforce, we were able to scale up to finish jobs as quickly as the Professor needed. He was able to get his job finished in less than a day, whereas other providers had quoted him as long as a month!Professor X was searching for a service provider that could provide technical insight during the process.Part of Hive Data’s value proposition to its customers is in offering our own expertise in building machine learning models, as well as supplying the quality data to do so. We’d seen similar projects as the one Professor X was dealing with, and we understood the problems he would face in generating this dataset. Even before getting started, we helped Professor X optimize his project by structuring his tasks in a way that improved his results and helped him build an effective model off his data. In addressing these needs, Professor X was able to submit his paper and get it published on time. He still continues to use Hive Data to power his AI research today. Hive Data has already been used by top-tier university research labs all over the world, including at Stanford, MIT, Cornell, and Simon Fraser University. We’ve seen projects range from labeling datasets for vehicle detection in autonomous driving, object recognition for robotic arms, and pedestrian identification from security cameras. The number of research verticals we cater to is constantly growing, as we pride ourselves on rapid engineering cycles to release data labeling capabilities as soon as we see a need emerging. If you’re an academic researcher and you’re curious about how we can partner together, contact me at david@thehive.ai. We’re excited to support your research!
BACK TO ALL BLOGS Hive Media: Revolutionizing the Way We Understand On-Screen Content and Viewership HiveFebruary 16, 2018July 5, 2024 Hive is a full-stack deep learning platform focused on solving visual intelligence problems. While we are working with companies in sectors ranging from autonomous driving to facial recognition, our flagship enterprise product is Hive Media. image As the name suggests, Hive Media is our complete enterprise solution utilizing deep learning to revolutionize traditional media analytics. However, it is far more than a simple collection of neural net models. What we’ve built with Hive Media is an end-to-end solution, beginning with data ingestion and extending all the way to real-time, device-level viewership metrics. The Vision Imagine you could watch 100 different channels at the same time and remember every key element of what was on screen – what brand was shown, what actor was present, what commercial was playing etc. Now, suppose you could remember this forever and could query this information instantly. This would be a massively valuable dataset, because it seems like an impossible feat for a human to achieve. This, however, is precisely what we set Hive Media out to achieve. Essentially, we wanted to build a system that could “watch” all of broadcast television in the same way a human would and then store this information in an easily accessible manner. Data Ingestion The first step in our pipeline is accessing TV streams. Today, we are processing 400 channels in the US, with 300 more in Europe to come later this year. See Figure 1 for a graphical display of our present and planned TV coverage. Figure 1 We are recording every second of every channel, totaling up to 10,000+ hours of footage per day! We expect this number to be well over 30,000+ hours a day by next year. In addition, all major channels are covered, as well as a wide range of local affiliates on the network side. As you can imagine, this is a lot of data and we are storing all of it in our own datacenter rollouts around the world. Ultimately, we are aiming to build the world’s largest repository of linear broadcast data. Deep Learning Models Having this much data is only useful if you can understand it. This is where our deep learning models come into play. Using Hive Data, we’ve built some of the world’s largest celebrity, logo, and brand databases in the world. These models, amongst several others, are applied to every second of our recorded footage and stored in a database in a manner that is optimized for easy retrieval. This means a query such as “How many times did a Nike logo appear on NBC in the month of September?” – previously impossible – can now be answered in a matter of seconds! Unlike some other products on the market, our models don’t rely upon any sort of metadata associated with the programming –- tags are generated purely based on the video content. This is extremely powerful, because it means our system can handle a large variety of content without having to constantly hard code in parameters. User Viewership The final piece of the puzzle is understanding how our tags affect viewership – the holy grail of media analytics. Everything we’ve described up until now has been generating what I call “cause” – to measure “effect;” we are currently working with actual device partners who give us real-time data on viewers. Today, we have access to millions of devices that send us real viewership data that we overlap with our tags to understand how on-screen content affects viewership. This means that every query we run not only tells us what aired, but also how it affects the viewership bottom line. The easiest way to understand this system is to visually see some queries executed. In Figure 2, we show an example query for Chevrolet vs. Toyota commercials on NBC in a week’s period. You can see the tags our system found in the bottom right. The bottom left shows a video player illustrating the content corresponding to the tag, mainly to serve as video evidence that our tag is correct. What’s powerful about Hive Media is the fact that now, we can analyze viewership data at each of these tag occurrences to understand what their effect on viewership is. One important way to understand viewership, as shown in Figure 2, is the notion of tune-out, which is the percentage of users that change the channel in a time interval. This is often the strongest indicator of whether a viewer is enjoying content shown on screen. Interestingly enough, it seems that Chevy commercials generate almost twice as much tune-out as their Toyota counterparts in this case. Figure 2 Let’s take another example query that looks for Nike logos, as shown in Figure 3. What we’re demonstrating here in the highlighted tag is a snippet of content that shows a Nike logo prominently placed in the center of the screen; even though this isn’t a true Nike commercial. Instead, this is Simone Biles, a Nike athlete, being featured in a Mattress Firm / Foster Kids commercial. But as part of any Nike athlete contract, Simone is obliged to wear Nike clothing whenever she appears on TV, Nike commercial or not. Nike would probably be highly interested in knowing how many similar logo placements occurred for Simone, as well as for all of their other sponsored athletes. Figure 3 Today, we are only beginning our journey toward understanding the wealth of data we have at our disposal. Hive Media is pioneering a new way of thinking around media content, and we are eager to help both broadcasters and advertisers optimize their content to better retain viewers and inform advertising decisions.