Classification challenges like Imagenet changed the way we train models. Given enough data, neural networks can distinguish between thousands of classes with remarkable accuracy.
However, there are some circumstances where basic classification breaks down, and something called multi-label classification is necessary. Here are two examples:
- You need to classify a large number of brand logos and what medium they appear on (sign, billboard, soda bottle, etc.)
- You have plenty of image data on a lot of different animals, but none on the platypus – which you want to identify in images
In the first example, should you train a classifier with one class for each logo and medium combination? The number of such combinations could be enormous, and it might be impossible to get data on some of them. Another option would be to train a classifier for logos and a classifier for medium; however, this doubles the runtime to get your results. In the second example, it seems impossible to train a platypus model without data on it.
Multi-label models step in by doing multiple classifications at once. In the first example, we can train a single model that outputs both a logo classification and a medium classification without increasing runtime. In the second example, we can use common sense to label animal features (fur vs. feathers vs. scales, bill vs. no bill, tail vs. no tail) for each of the animals we know about, train a single model that identifies all features for an animal at once, then infer that any animal with fur, a bill, and a tail is a platypus.
A simple way to accomplish this in a neural network is to group a logit layer into multiple softmax predictions:
You can then train such a network by simply adding the cross entropy loss for each softmax where a ground truth label is present.
To compare these approaches, let’s consider a subset of imagenet classes, and two features that distinguish them:
First, I trained two 50-layer resnet V2’s on this balanced dataset: one trained on the single-label classification problem, and the other trained on the multi-label classification problem. In this example, every training image has both labels, but real applications may have only a subset of labels available for each image.
The single-label model trained specifically on the 6-animal classification performed slightly better when distinguishing all 6 animals:
- Single-label model: 90% accuracy
- Multi-label model: 88% accuracy
However, the multihead model provides finer information granularity. Though it got only 88% accuracy on distinguishing all 6 animals, it achieved 92% accuracy at distinguishing scales/exoskeleton/fur and 95% accuracy at distinguishing spots/no spots. If we care about only one of these factors, we’re already better off with the multi-label model.
But this toy example hardly touches on the regime where multi-label classification really thrives: large datasets with many possible combinations of independent labels. In this regime, we get the interesting benefit of transfer learning. Imagine if we had categorized hundreds of animals into a dozen binary criteria. Training a separate model for each binary criterion would yield acceptable results, but learning the other features can actually help in some cases by effectively pre-training the network on a larger dataset.
At Hive, we recently deployed a multi-label classification model that replaced 8 separate classification models. For each image, we usually had truth data available for 2 to 5 of the labels. Out of the 8, 2 were better (think 93% instead of 91%). These were the labels with less data. This makes sense, since they would benefit most from domain-specific pretraining on the same images. But most importantly for this use case, we were able to run all the models together in 1/8th the time as before.