Hive was thrilled to have our CTO Dmitriy present at the Workshop on Multimodal Content Moderation during CVPR last week, where we provided an overview of a few important considerations when building machine learning models for classification tasks. What are the effects of data quantity and quality on model performance? Can we use synthetic data in the absence of real data? And after model training is done, how do we spot and address bias in the model’s performance?
Read on to learn some of the research that has made our models truly best-in-class.
The Importance of Quality Data
Data is, of course, a crucial component in machine learning. Without data, models would have no examples to learn from. It is widely accepted in the field that the more data you train a machine learning model with, the better. Similarly, the cleaner that data is, the better. This is fairly intuitive — the basic principle is true for human learners, too. The more examples to learn from, the easier it is to learn. And if those examples aren’t very good? Learning becomes more difficult.
But how important is good, clean data to building a good machine learning model? Good data is not always easy to come by. Is it better to use more data at the expense of having more noise?
To investigate this, we trained a binary image classifier to detect NSFW content, varying the amount of data between 10 images and 100k images. We also varied the noise by flipping a percentage of the labels on between 0% and 50% of the data. We then plotted the balanced accuracy of the resulting models using the same test set.
The result? It turns out that data quality is more important than we may think. It was clear that, as expected, accuracy was the best when the data was both as large as possible (100k examples) and as clean as possible (0% noise). From there, however, the table gets more interesting.
As seen above, the model trained with only 10k data and no noise performs better than the model trained with ten times as much data at 100k and 10% noise. The general trend appears to be similar — clean data matters very much, and it can quickly tank performance even when using the maximum amount of data. In other words, less data is sometimes preferable to more data if it is cleaner.
We wondered how this would change with a more detailed classification problem, so we built a new binary image classifier. This time, we trained the model to detect images of smoking, which is detecting signal from a small part of an image.
The outcome, shown below, echoes the results from the NSFW model — clean data has a great impact on performance even with a very large dataset. But the quantity of data appears to be more important than it was in the NSFW model. While 5000 examples with no noise got around 90% balanced accuracy for the NSFW model, that same amount of noiseless data only got around 77% for the smoking classifier. The increase in performance, while still strongly tied to data quantity, was noticeably slower and only the largest datasets produced well-performing models.
It makes sense that quantity of data would be more important with a more difficult classification task. Data noise also remained a crucial factor for the models trained with more data — the 50k model with 10% noise performed about the same as the 100k model with 10% noise, illustrating once more that more data is not always better if it is still noisy.
Our general takeaways here are that while both data quality and quantity matter quite a bit, clean data is more important beyond a certain quantity threshold. This threshold is where performance increases begin to plateau as the data grows larger, yet noisy data continues to have significant effects on model quality. And as we saw by comparing the NSFW model and the smoking one, this quality threshold also changes depending on the difficulty of the classification task itself.
Training on Synthetic Data: Does it Help or Hurt?
So having lots of clean data is important, what can be done when good data is hard to find or costly to acquire? With the rise of AI image generation over the past few years, more and more companies have been experimenting with generated images to supplement visual datasets. Can this kind of synthetic data be used to train visual classification models that will eventually classify real data?
In order to try this out, we trained five different binary classification models to detect smoking. Three of the models were trained exclusively with real data (10k, 20k, and 40k examples respectively), one was a mix of real and synthetic images (10k real and 30k synthetic), and one was trained entirely on synthetic data (40k). Each datatest had an even split of 50% smoking and 50% nonsmoking examples. To evaluate the models, we used two balanced test sets: one with 4k real images and one with 4k of synthetic images. All synetic images were created using Stable Diffusion.
Looking at the precision and recall curves for the various models, we made an interesting discovery. Unsurprisingly, the largest of the entirely real datasets performed the best (40k). The one trained on 10k real images and 30k synthetic images performed significantly better than the one trained only on 10k real images.
These results suggest that while large amounts of real data are best, a mixture of synthetic and real data could in fact boost model performance when little data is available.
Keeping an Eye Out For Bias
After model training is finished, extensive testing must be done in order to make sure there aren’t any biases in the model results. Biases can come in the form of biases that exist in the real world and are thus often ingrained in real-world data, such as racial bias or gender bias, but can also come in the form of biases that occur in the data by coincidence.
A great example of how unpredictable certain biases can be came recently during a model training for NSFW detection, where the model started flagging many pictures of computer keyboards as false positives. Upon closer investigation, this occurred because many of the NSFW pictures in our training data were photos of computers whose screens were displaying explicit content. Since the computer screens were the focus of these images, keyboards were also often included, leading to the false association that keyboards are an indicator of NSFW imagery.
In order to correct this bias, we added more non-NSFW keyboard examples to the training data. Improving this bias in this way not only helps the model by addressing the bias itself, but also boosts general model performance. Of course, addressing bias is even more critical when dealing with data that carries current or historical biases against minority groups, thereby perpetuating them by ingraining them into future technology. The importance of detecting and correcting these biases cannot be overstated, since leaving them unaddressed carries a significant amount of risk beyond simply calling a keyboard NSFW.
Regardless of the type of bias, it’s important to note that biases aren’t always readily apparent. The original model prior to addressing the bias had a balanced accuracy of 80%, which is high enough that the bias may not have been immediately noticeable since errors weren’t extremely frequent. The takeaway here is thus not just that bias correction matters, but that looking into potential biases is necessary even when you might not think they’re there.
Visual classification models are in many ways the heart of Hive — they were our main launching point into the space of content moderation and AI-powered APIs more broadly. We’re continuously searching for ways to keep improving these models as the research surrounding them grows and evolves. Conclusions like those discussed here — the importance of clean data, particularly when you have lots of it, the possible use of synthetic data when real data is lacking, and the need to find and correct all biases (don’t forget about the unexpected ones!) — greatly inform the way we build and maintain our products.