Many machine learning classifiers are required to distinguish specific classes from a “background” or “other” class.
The biggest challenge with such a classifier is achieving a high true positive rate for the background class; it has no identifiable features, and available training data might not cover the whole distribution of inputs. We propose a few suggestions to address these problems, along with intuition.
For this post, we’ll consider an image classification example problem of distinguishing 9 types of dogs from any other image (10 classes total):
We’ll address 3 conundrums:
- whether or not to make your classifier predict a probability for “background”
- how to get the most out of background training data that doesn’t match the true background distribution
Use Data for the Background Class?
- 1. Get data for all 10 classes (include your background as a recognizable class)
- 2. Make a 9-class model for the dog breeds, and predict “background” if none of the 9 classes have high enough confidence
From my experience, the second approach fares poorly. My visual intuition is that such a model learns to make an embedding (the last layer before the logit layer and softmax) centered around dogs, and the threshold generally maps embeddings near the boundaries of the classes to “background”:
But in reality, dogs are just a subspace of all images, probably near an extremum (in terms of furriness and some other features the classifier will need to learn):
Controlling What your Background Class Learns
Let’s say we’ve collected some data for the background class. Inevitably, our background data is often an incomplete distribution of images – in this case a few types of household appliances and kitchen equipment:
If we simply train a model on these images and treat background like any other class, the embedding diagram looks far from perfect:
I have a suggestion that can improve this situation a surprising amount: prevent the background class from learning features. Allow it to learn only one trainable variable, its logit. This way, the background class won’t be embedded in a very specific region, hollowing out more space in the embedding diagram for these truly “background” images:
Of course, this is still not perfect. The dog classes can, to some extent, learn to activate especially weakly on appliance images. But it certainly fares better in practice.
I trained a model using training data for these 9 dog and 6 appliance Imagenet classes mapped into the 10 classes for this problem, and evaluated them on the full Imagenet validation set. I applied weights in training and validation to compensate for class imbalance (see the next section for specifics on handling the case when the true distribution is imbalanced). In the validation diagram below, the green line is the model that treats all classes the same, and the blue is the model that prevents the background class from learning features:
Validation accuracy versus number of training steps. The green is a model that allows the background class to learn features, and the blue is a model that does not. These two models have actually converged to their best accuracies. I use a cosine learning rate so that learning never quite flattens out.
The model that does not learn background features remains consistently ahead for most of training. By not training the background class, accuracies increased on both the 15-class dataset (9 dog classes, 1 background of 6 appliances) and 1000-class dataset (9 dog classes, 1 background of 991 other classes):
Again, note that these results weight each of the 10 classes to be equally important in validation.
Background Class Imbalance
Finally, we’ll consider the case where the background class vastly outnumbers actual dog images in testing. Let’s say the test set’s distribution is all of Imagenet. Our classifier performs much worse under these circumstances, since it was trained to think that all classes are equally likely on average.
The simplest way to compensate for this is to add a higher threshold to each positive prediction, such that we predict a specific dog class only if its probability is both the highest and exceeds the threshold. This gives the following precision-recall curve, where each example is treated as “positive” if the classifier predicts 1 of our 9 dog classes, and negative if it predicts “background”:
For example, if we require confidence of 90% for our dog predictions, we’ll correctly label about 85% of dog images, and about 7% of our dog predictions will be correct.
But we can do slightly better. We can shift our probability distribution from “uniform” to “shifted” by using a prior and the following Bayesian formula, which holds under the assumption that
P(\text{image}|x, \text{shifted}) = P(\text{image}|x, \text{uniform})P(image∣x,shifted)=P(image∣x,uniform) for each class x:
P(x|\text{image},\text{shifted}) = \frac{P(x|\text{image},\text{uniform})P(x|\text{shifted})}{P(x|\text{uniform})}P(x∣image,shifted)=P(x∣uniform)P(x∣image,uniform)P(x∣shifted)
For our Imagenet use case, this just means that we run our model as normal, multiply the dog probabilities by 0.001, multiply the background probability by 0.991, and then normalize by multiplying all probabilities by a scalar such that they add to 1 again. Our assumption is used to shift the distribution from 10% probability for each class to 0.1% for each dog and 99.1% for “background” (notably, I make no pretense that this prior can compensate for the sort of background shift we handled in the previous section. In fact, that sort of shift is in direct contradiction with the assumption for this prior). If we do this first, we get a better PR curve:
One of the perks of this trick is that it does not involve collecting more data or even retraining the model. It gives slightly better predictions for free, backed by a simple stats formula.
Conclusion
To sum up, I’ve proposed 3 techniques that can improve your classifier’s accuracy when it has a background class:
- Have some data on your background class and make the model predict a probability for it (rather than just training the non-background classes and mapping low probabilities to the background class)
- Prevent your model from using any features to inform the background class’s activation (to improve generalization to background images not represented in your training set)
- Use a prior to adjust for class imbalance (this can help even if you don’t have a background class)
Using these techniques can change your model’s predictions on background images from nearly random to informative.