BACK TO ALL BLOGS

Flag AI-Generated Text with Hive’s New Classifier

Hive is excited to announce our new classifier to differentiate between AI-generated and human-written text. This model is hosted on our website as a free demo, and we encourage users to test out its performance.

The recent release of OpenAI’s ChatGPT model has raised questions about how public access to these kinds of large language models will impact the field of education. Certain school districts have already banned access to ChatGPT, and teachers have been adjusting their teaching methods to account for the fact that generative AI has made academic dishonesty a whole lot easier. Since the rise of internet plagiarism, plagiarism detectors have become commonplace at academic institutions. Now a need arises for a new kind of detection: AI-generated text.

Our AI-Generated Text Detector outperforms key competitors, including OpenAI itself. We compared our model to their detector, as well as two other popular AI-generated text detection tools: GPTZero and Writer’s AI Content Detector. Our model was the clear frontrunner, not just in terms of balanced accuracy but also in terms of false positive rate — a critical factor when these tools are deployed in an educational setting.

Our test dataset consisted of 242 text passages, including ChatGPT-generated text as well as human-written text. To ensure that our model behaves correctly on all genres of content, we included everything from casual writing to more technical and academic writing. We took special care to include texts written by those learning English as a second language, so as to be careful that their writing is not incorrectly categorized by our model due to differences in tone or wording. For these test examples, our balanced accuracy stands at an impressive 99% while the closest competitor is GPTZero with 83%. OpenAI got the lowest of the bunch, with only 73%.

Others have tried our model against OpenAI’s in particular, and they have echoed our findings. Following OpenAI’s classifier release, Mark Hachman at PCWorld published an article that suggested that those disappointed with OpenAI’s model should turn to Hive’s instead. In his own informal testing of our model, he praised our results for their accuracy as well as our inclusion of clear confidence scores for every result.

A large fear about using these sorts of detector tools in an educational setting is the potentially catastrophic impact of false positives, or cases in which human-written writing is classified as AI-generated. While building our model, we were mindful of the fact that the risk of such high-cost false positives is one that many educators may not want to take. In response, we prioritized lowering our false positive rate. On the test set above, our false positive rate is incredibly low, at 1%. This is compared to OpenAI’s at 12.5%, Writer’s at 46%, and GPTZeros at 30%.

Even with our low false positive rate, we do encourage that this tool be used as part of a broader process when investigating academic dishonesty and not as the sole decision maker. Just like plagiarism checkers, it is created to be a helpful screening tool and not a final judge. We are continuously working to improve our model, and any feedback is greatly appreciated. Large language models like ChatGPT are here to stay, and it is crucial to provide educators with tools they can use as they decide how to navigate these changes in their classrooms.

BACK TO ALL BLOGS

Spot Deepfakes With Hive’s New Deepfake Detection API

Contents

The Danger of Deepfakes

When generative AI models first gained popularity in the late 2010s, they brought with them the ability to create deepfakes. Deepfakes are synthetic media, typically video, in which one person’s likeness is replaced by another’s using deep learning. They are powerful tools for fraud and misinformation, allowing for the creation of synthetic videos of political leaders and letting scammers easily take on new identities.

The primary use, though, of deepfake technology is the fabrication of nonconsensual pornography. The term “deepfake” itself was coined in 2017 by a Reddit user of the same name who made fake pornographic videos featuring popular female celebrities. In 2019, the company Sensity AI catalogued deepfakes across the web and reported that a whopping 96% of them were pornographic, all of which were of women. In the years since, more of this sort of deepfake pornography has become readily available online, with countless forums and even entire porn sites dedicated to it. The targets of this are not just celebrities. They are also everyday women superimposed into adult content by request—on-demand revenge porn for anyone with an internet connection.

Many sites have banned deepfakes entirely, since they are far more often used for harm than for good. At Hive, we’re committed to providing API-accessible solutions for challenging moderation problems like this one. We’ve built our new Deepfake Detection API to empower enterprise customers to easily identify and moderate deepfake content hosted on their platforms.

This blog post explains how our model identifies deepfakes and the new API that makes this functionality accessible.

A Look Into Our Model

Hive’s Deepfake Detection model is essentially a version of our Demographic API that is optimized to identify deepfakes as opposed to demographic attributes. When a query is submitted, this visual detection model locates any faces present in the input. It then performs an additional classification step that determines whether or not each detected face is a deepfake. In its response, it provides a bounding-box location and classification (with confidence scores) for each face.

While the face detection aspect of this process is the same as the one used for our industry-leading Demographic API, the classification step was fine-tuned for deepfake identification by training on a vast repository of synthetic and real video data. Many of these examples were pulled from genres commonly associated with deepfakes, such as pornography, celebrity interviews, and movie clips. We also included other types of examples in order to create a classifier that identifies deepfakes across many different content genres.

Putting It All Together: Example Input and Response

With only one head, the response of our Deepfake Detection model is easily interpretable. When an image or video query is submitted, it is first split into frames. Each frame is then analyzed by our visual detection model in order to find any faces present in the image. Every face then receives a deepfake classification — either yes_deepfake or no_deepfake. Confidence scores for these classifications range from 0.0 to 1.0, with a higher score indicating higher confidence in the model’s results.

Example Deepfake Detection input and API response
Example Deepfake Detection input and API response

Here we see the deepfaked image and, to its left, the two original images used to create it. This input image doesn’t appear to be fake at first glance, especially when the image is displayed at a small size. Even with a close examination, a human reviewer could fail to realize that it is actually a deepfake. As the example illustrates, the model correctly identifies this realistic deepfake with a high confidence score of more than 0.99. Since there is only one face present in this image, we see one corresponding “bounding poly” in the response. This “bounding poly” contains all model response information for that face. Vertices and dimensions are also provided, though those fields are truncated here for clarity.

Because deepfakes like this one can be very convincing, they are difficult to moderate with manual flagging alone. Automating this task is not only ideal to accelerate moderation processes, but also to spot realistic deepfakes that human reviewers might miss.

Digital platforms, particularly those that host NSFW media, can integrate this Deepfake Detection API into their workflows by automatically screening all content as it is posted. Video communication platforms and applications that use any kind of visual identity verification can also utilize our model to counter deepfake fraud.

Final Thoughts

Hive’s Deepfake Detection API joins our recently released AI-Generated Media Recognition API in the aim to expand content-moderation to keep up with the fast-growing domain of generative AI. Moving forward, we plan to continually update both models so as to best keep up with new generative techniques, popular content genres, and emerging customer needs.

The recent popularity of diffusion models like Stable DiffusionMidjourney, and DALL-E 2 has brought deepfakes back into the spotlight and sparked conversation on whether these newer generative techniques can be used to develop brand-new ways of making them. Whether or not this happens, deepfakes aren’t going away any time soon and are only growing in number, popularity, and quality. Identifying and removing them across online platforms is crucial to limit the fraud, misinformation, and digital sexual abuse that they enable.

If you’d like to learn more about our Deepfake Detection API and other solutions we’re building, please feel free to reach out to sales@thehive.ai or contact us here.

BACK TO ALL BLOGS

Detect and Moderate AI-Generated Artwork Using Hive’s New API

Try Our Demo

To try our AI-Generated Image Detection model out for yourself, check out our demo.

Contents

A New Need for Content Moderation

In the past few months, AI-generated art has experienced rapid growth in both popularity and accessibility. Engines like DALL-EMidjourney, and Stable Diffusion have spurred an influx of AI-generated artworks across online platforms, prompting an intense debate around their legality, artistic value, and potential for enabling the propagation of deepfake-like content. As a result, certain digital platforms such as Getty ImagesInkBlot ArtFur Affinity, and Newgrounds have announced bans on AI-generated content entirely, with more to likely follow in the coming weeks and months.

Platforms are enacting these bans for a variety of reasons. Online communities built for artists to share their artwork such as Newgrounds, Fur Affinity, and Purpleport stated they put their AI artwork ban in place in order to keep their sites focused exclusively on human-created art. Other platforms have taken action against AI-generated artwork due to copyright concerns. Image synthesis models often include copyrighted images in their training data, which consist of massive amounts of photos and artwork scraped from across the web, typically without any artists’ consent. It is an open question whether this type of scraping and the resulting AI-generated artwork amount to copyright violations — particularly in the case of commercial use — and platforms like Getty and InkBlot Art don’t want to take that risk.

As part of Hive’s commitment to providing enterprise customers with API-accessible solutions to moderation problems, we have created a classification model made specifically to assist digital platforms in enacting these bans. Our AI-Generated Media Recognition API is built with the same type of robust classification model as our industry-leading visual moderation products, and it enables enterprise customers to moderate AI-generated artwork without relying on users to flag images manually.

This post explains how our model works and the new API that makes this functionality accessible.

Using AI to Identify AI: Building Our Classifier

Hive’s AI-Generated Media Recognition model is optimized for use with the kind of media generated by popular AI generative engines such as DALL-E, Midjourney, and Stable Diffusion. It was trained on a large dataset comprising millions of artificially generated images and human-created images such as photographs, digital and traditional art, and memes sourced from across the web.

The resulting model is able to identify AI-created images among many different types and styles of artwork, even correctly identifying AI artwork that could be misidentified by manual flagging. Our model returns not only whether or not a given image is AI-generated, but also the likely source engine it was generated from. Each classification is accompanied by a confidence score that ranges from 0.0 to 1.0, allowing customers to set a confidence threshold to guide their moderation.

How it Works: An Example Input and Response

When receiving an input image, our AI-Generated Media Recognition model returns classifications under two separate heads. The first provides a binary classification as to whether or not the image is AI-generated. The second, which is only relevant when the image is classified as an AI-made image, identifies the source of that artificial image from among the most popular generation engines that are currently in use.

To get a sense of the capabilities of our AI-Generated Media Recognition model, here’s a look at an example classification:

This input image was created with the AI model Midjourney, though it is so realistic that it may be missed by manual flagging. As shown in the response above, our model correctly classifies this image as AI-generated with a high confidence score of 0.968. The model also correctly identifies the source of the image, with a similarly high confidence score. Other sources like DALL-E are also returned along with their respective confidence scores, and the scores under each of the two model heads sum to 1.

Platforms that host artwork of any kind can integrate this AI-Generated Media Recognition API into their workflows by automatically screening all content as it is being posted. This method of moderating AI artwork works far more quickly than manual flagging and can catch realistic artificial artworks that even human reviewers might miss.

Final Thoughts and Future Directions

Digital platforms are now being flooded with AI-generated content, and that influx will only increase as these generative models continue to grow and spread. On top of this, creating this kind of artwork is fast and easy to access online, which enables large quantities of it to be produced quickly. Moderating artificially created artworks is crucial for many sites to maintain their platform’s mission and protect themselves and their customers from potential legal issues further down the line.

We created our AI-Generated Media Recognition API to solve this problem, but our model will need to continue to evolve along with image generation models as existing ones improve and new ones are released. We plan on adding new generative engines to our sources as well as continually updating our model to keep up with the current capabilities of these models. Since some newer generative models can create video in addition to still images, we are working to add support for video formats within our API in order to best prevent all types of AI-generated artwork from dominating online communities where they are unwelcome.

If you’d like to learn more about this and other solutions we’re building, please feel free to reach out to sales@thehive.ai or contact us here.

BACK TO ALL BLOGS

Updates to Hive’s Best-in-Class Visual Moderation Model

Contents

Hive’s visual classifier is a cornerstone of our content moderation suite. Our visual moderation API has consistently been the best solution on the market for moderating key types of image-based content, and some of the world’s largest content platforms continue to trust Hive’s Visual Moderation model for effective automated enforcement on NSFW images, violence, hate, and more. 

As content moderation needs have evolved and grown, our visual classifier has also expanded to include 75 moderation classes across 31 different model heads. This is usually an iterative process – as our partners continue to send high volumes of content for analysis, we uncover ways to refine our classification schemes and identify new, useful types of content.

Recently, we’ve worked on broadening our visual model by defining new classes with input from our customers. And today, we’re shipping the general release of our latest visual moderation model, including three new classes to bolster our existing model capabilities:

  • Undressed to target racier suggestive images that may not be explicit enough to label NSFW
  • Gambling to capture betting in casinos or on games and sporting events
  • Confederate to capture imagery of the Confederate flag and graphics based on its design

All Hive customers can now upgrade to our new model to access predictions in these new classes at no additional cost. In this post, we’ll take a closer look at how these classes can be used and our process behind creating them.

New Visual Moderation Classes For Greater Content Understanding

Deep learning classifiers are most effective when given training data that illustrates a clear definition of what does and does not belong in the class. For this release, we used our distributed data labeling workforce – with over 5 million contributors – to efficiently source instructive labels on millions of training images relevant to our class definitions. 

Below, we’ll take a closer look at some visual examples to illustrate our ground truth definitions for each new class. 

Undressed

In previous versions, Hive’s visual classifier separated adult content into two umbrella classes: “NSFW,” which includes nudity and other explicit sexual classes, and “Suggestive,” which captures milder classes that might still be considered inappropriate. 

Our “Suggestive” class is a bit broad by design, and some customers have expressed interest in a simple way to identify the racier cases without also flagging more benign images (e.g., swimwear in beach photos). So, for this release, we trained a new class to refine this distinction: undressed

We wanted this class to capture images where a subject is clearly nude, even if their privates aren’t visible due to their pose, are temporarily covered by their hands or an object, or are occluded by digital overlays like emojis, scribbles, or shapes. To construct our training set, we added new annotations to existing training images for our NSFW and Suggestive classes and sourced additional targeted examples. Overall, this gave us a labeled set of 2.6M images to teach this ground truth to our new classifier. 

Here’s a mild example to help illustrate the difference between our undressed and NSFW definitions (you can find a full definition for undressed and other relevant classes in our documentation): 

Confidence scores for unedited version (left): undressed 1.00; general_nsfw 1.00; general_suggestive 0.00. Confidence scores for edited version (right): undressed 0.99; general_nsfw 0.35; general_suggestive 0.61
Confidence scores for unedited version (left): undressed 1.00; general_nsfw 1.00; general_suggestive 0.00. Confidence scores for edited version (right): undressed 0.99; general_nsfw 0.35; general_suggestive 0.61

The first image showing explicit nudity is classified as both undressed and NSFW with maximum confidence. When we add a simple overlay over relevant parts of the image, however, the NSFW score drops far below threshold confidence while the undressed score remains very high. 

Platforms can use undressed to flag both nudity and more obviously suggestive images in a single class. For content policies where milder images are allowed but undressed-type images are not, we expect this class to significantly reduce any need for human moderator review to enforce this distinction. 

Gambling

Gambling was another type of content that frequently came up in customer feedback. This was a new undertaking for Hive, and building our ground truth and training set for this class was an interesting exercise in definitions and evaluating context in images. 

Technically, gambling involves a wager staked on an uncertain outcome with the intent of winning a prize. For practical purposes, though, we decided to consider evidence of betting as the key factor. Certain behavior – like playing a slot machine or buying a lottery ticket – is always gambling since it requires a bet. But cards, dice, and competitive games don’t necessarily involve betting. We found the most accurate approach to be requiring visible money, chips or other tokens in these cases in order to flag an image as gambling. Similarly, we don’t consider photos at races or sporting events to be gambling unless receipts from a betting exchange or website are also shown.  

To train our new class on this ground truth definition, we sourced and labeled a custom set of over 1.1M images. The new visual classifier can now distinguish between gambling activity and similar non-gambling behavior, even if the images are visually similar:

For more detailed information, you can see a full description of our gambling class here. Platforms that wish to moderate or identify gambling can access predictions from this model head by default after upgrading to this model release. 

Confederate Symbolism

Separately, many of our customers also expressed interest in more complete monitoring of visual hate and white nationalism, especially Confederate symbolism. For this release, we sourced and labeled over 1M images to train a new class for identifying imagery of the commonly used version of the Confederate flag. 

In addition to identifying photos of the flag itself, this new model head will also capture the Confederate “stars and bars” shown in graphics, tattoos, clothing, and the like. We also trained the model to ignore visually similar flags and historical variants that are not easily recognizable:

Along with our other hate classes, customers can now use predictions from our Confederate class to keep their online environments safe.

Improvements to Established Visual Moderation Classes

Beyond these new classes, we also focused on improving the model’s understanding around niche edge cases in existing model heads. For example, we leveraged active learning and additional training examples to address biases we occasionally found in our NSFW and Gun classifiers. This corrected some interesting biases where the model sometimes incorrectly identified studio microphones as guns, or mistook acne creams for other, less safe-for-work liquids. 

Final Thoughts

This release delivers our most comprehensive and capable Visual Moderation model yet to help platforms develop proactive, cost-effective protection for their online communities. As moderation needs become more sophisticated, we’ll continue to incorporate feedback from our partners and refine our content moderation models to keep up. Stay tuned for our next release with additional classes and improvements later this year.

If you have any questions about this release, please get in touch at support@thehive.ai or api@thehive.ai. You can also find a video tutorial for upgrading to the latest model configuration here. For more information on Visual Moderation more generally, feel free to reach out to sales@thehive.ai or check out our documentation

BACK TO ALL BLOGS

Hive on Inc.

BACK TO ALL BLOGS

Deep Learning Methods for Moderating Harmful Viral Content

Contents

Content Moderation Challenges in the Aftermath of Buffalo

The racially-motivated shooting in a Buffalo supermarket – live streamed by the perpetrator and shared across social media – is tragic on many levels.  Above all else, lives were lost and families are forever broken as a result of this horrific attack.  Making matters worse, copies of the violent recording are spreading on major social platforms, amplifying extremist messages and providing a blueprint for future attacks.

Unfortunately, this is not a new problem: extremist videos and other graphic content have been widely shared for shock value in the past, with little regard for the negative impacts. And bad actors are more sophisticated than ever, uploading altered or manipulated versions to thwart moderation systems.

As the world grapples with broader questions of racism and violence, we’ve been working with our partners behind the scenes to help control the spread of this and other harmful video content in their online communities.  This post covers the concerns these partners have raised with legacy moderation approaches, and how newer technology can be more effective in keeping communities safe. 

Conventional Moderation and Copy Detection Approaches

Historically, platforms relied on a combination of user reporting and human moderation to identify and react to harmful content. Once the flagged content reaches a human moderator, enforcement is usually quick and highly accurate. 

But this approach does not scale for platforms with millions (or billions) of users.  It can take hours to identify and act on an issue, especially in the aftermath of a major news event when post activity is highest.  And it isn’t always the case that users will catch bad content quickly: when the Christchurch massacre was live streamed in 2019, it was not reported until 12 minutes after the stream ended, allowing the full video to spread widely across the web.

More recently, platforms have found success using cryptographic hashes of the original video to automatically compare against newly posted videos.  These filters can quickly and proactively screen high volumes of content, but are generally limited to detecting copies of the same video. Hashing checks often miss content if there are changes to file formats, resolutions, and codecs. And even the most advanced “perceptual” hashing comparisons – which preprocess image data in order to consider more abstract features – can be defeated by adversarial augmentations.  

Deep Learning To Advance Video Moderation and Contain Viral Content

Deep learning models can close the moderation capability gap for platforms in multiple ways. 

First, visual classifier models can proactively monitor live or prerecorded video for indicators of violence.  These model predictions enable platforms to shut down or remove content in real-time, preventing the publishing and distribution of videos that break policies in the first place.  The visual classifiers can look for combinations of factors, such as someone holding a gun, bodily injury, blood, and other object or scene information to create automated and nuanced enforcement mechanisms. Specialized training techniques can also accurately teach visual classifiers to identify the difference ​​between real violence and photorealistic violence depicted in video games, so that something like a first-person shooter game walkthrough is not mistaken for an real violent event.

In addition to screening using visual classifiers, platforms can harness new types of similarity models to stop reposts of videos confirmed to be harmful, even if those videos are adversarially altered or manipulated. If modified versions somehow bypass visual classification filters, these models can catch these videos based on visual similarity to the original version.   

In these cases, self-supervised training techniques expose the models to a range of image augmentation and manipulation methods, enabling them to accurately assess human perceptual similarity between image-based content. These visual similarity models can detect duplicates and close copies of the original image or video, including more heavily modified versions that would otherwise go undetected by hashing comparisons.

Unlike visual classifiers, these models do not look for specific visual subject matter in their analysis.  Instead, they quantify visual similarity on a spectrum based on overlap between abstract structural features. This means there’s no need to produce training data to optimize the model for every possible scenario or type of harmful content; detecting copies and modified versions of known content simply requires that the model accurately assess whether images or video come from the same source.

How it works: Deep Learning Models in Automated Content Moderation Systems

Using predictions from these deep learning models as a real-time signal offers a powerful way to proactively screen video content at scale. These model results can inform automated enforcement decisions or triage potentially harmful videos for human review. 

Advanced visual classification models can accurately distinguish between real and photorealistic animated weapons. Here are results from video frames containing both animated and real guns. 

To flag real graphic violence, automated moderation logic could combine confidence scores in actively held weapons, blood, and/or corpse classes but exclude more benign images like these examples. 

As a second line of defense, platforms need to be able to detect reposts or modified versions of known harmful videos from spreading.  To do this, platforms use predictions from pre-trained visual similarity models in the same way they use hash comparisons today. With an original version stored as a reference, automated moderation systems can perform a frame-wise comparison with any newly posted videos, flagging or removing new content that scores above a certain similarity threshold.  

In these examples, visual similarity models accurately predict that frame(s) in the query video are derived from the original reference, even under heavy augmentation. By screening new uploads against video content known to be graphic, violent, or otherwise harmful, these moderation systems can replace incomplete tools like hashing and audio comparison to more comprehensively solve the harmful content detection problem.

Final Thoughts: How Hive Can Help

No amount of technology can undo the harm caused by violent extremism in Buffalo or elsewhere.  We can, however, use new technology to mitigate the immediate and future harms of allowing hate-based violence to be spread in our online communities. 

Hive is proud to support the world’s largest and most diverse platforms in fulfilling their obligation to keep online communities safe, vibrant, and hopeful. We will continue to contribute towards state-of-the-art moderation solutions, and can answer questions or offer guidance to Trust & Safety teams who share our mission at support@thehive.ai.

BACK TO ALL BLOGS

Introducing Moderation Dashboard: a streamlined interface for content moderation

Over the past few years, Hive’s cloud-based APIs for moderating image, videotext, and audio content have been adopted by hundreds of content platforms, from small communities to the world’s largest and most well-known platforms like Reddit.  

However, not every platform has the resources or interest in building their own software on top of Hive’s APIs to manage their internal moderation workflows.  And since the need for software like this is shared by many platforms, it made sense to build a robust, accessible solution to fill the gap.

Today, we’re announcing the Moderation Dashboard, a no-code interface for your Trust & Safety team to design and execute custom-built moderation workflows on top of Hive’s best-in-class AI models.  For the first time, platforms can access a full-stack, turnkey content moderation solution that’s deployable in hours and accessible via an all-in-one flexible seat-based subscription model.

We’ve spent the last month beta testing the Moderation Dashboard and have received overwhelmingly positive feedback.  Here are a few highlights:

  • “Super simple integration”: customizable actions define how the Moderation Dashboard communicates with your platform
  • “Effortless enforcement”: automating moderation rules in the Moderation Dashboard UI requires zero internal development effort
  • “Streamlined human reviews”: granular policy enforcement settings for borderline content significantly reduced need for human intervention
  • “Flexible” and “Scalable”: easy to add seat licenses as your content or team needs grow, with a stable monthly fee you can plan for

We’re excited by the Moderation Dashboard’s potential to bring industry-leading moderation to more platforms that need it, and look forward to continuing to improve it with updates and new features based on your feedback.

If you want to learn more, the post below highlights how our favorite features work.  You can also read additional technical documentation here.

Easily Connect Moderation Dashboard to Your Application

Moderation Dashboard connects seamlessly to your application’s APIs, allowing you to create custom enforcement actions that can be triggered on posts or users – either manually by a moderator or automatically if content matches your defined rules.

You can create actions within the Moderation Dashboard interface specifying callback URLs that tell the Dashboard API how to communicate with your platform.  When an action triggers, the Moderation Dashboard will ping your callback server with the required metadata so that you can successfully execute the action on the correct user or post within your platform.

Implement Custom Content Moderation Rules

At Hive, we understand that platforms have different content policies and community guidelines. Moderation Dashboard enables you to set up custom rules according to your particular content policies in order to automatically take action on problematic content using Hive model results. 

Moderation Dashboard currently supports access to both our visual moderation model and our text moderation model – you can configure which of over 50 model classes to use for moderation and at what level directly through the dashboard interface. You can easily define sets of classification conditions and specify which of your actions – such as removing a post or banning a user – to take in response, all from within the Moderation Dashboard UI. 

Once configured, Moderation Dashboard can communicate directly with your platform to implement the moderation policy laid out in your rule set. The Dashboard API will automatically trigger the enforcement actions you’ve specified on any submitted content that violates these rules.

Another feature unique to Moderation Dashboard: we keep track of (anonymized) user identifiers to give you insight into high-risk users. You can design rules that account for a user’s post history to take automatic action on problematic users. For example, platforms can identify and ban users with a certain number of flagged posts in a set time period, or with a certain proportion of flagged posts relative to clean content – all according to rules you set in the interface.

Intuitive Adjustment of Model Classification Thresholds

Moderation Dashboard allows you to configure model classification thresholds directly within the interface. You can easily set confidence score cutoffs (for visual) and severity score cutoffs (for text) that tells Hive how to classify content according to your sensitivity around precision and recall.

Streamline Human Review

Hive’s API solutions were generally designed with an eye towards automated content moderation. Historically, this has required our customers to expend some internal development effort to build tools that also allow for human review. Moderation Dashboard closes this loop by allowing custom rules that route certain content to a Review Feed accessible by your human moderation team.

One workflow we expect to see frequently: automating moderation of content that our models classify as clearly harmful, while sending posts with less confident model results to human review. By limiting human review to borderline content and edge cases, platforms can significantly reduce the burden on moderators while also protecting them from viewing the worst content.

Setting Human Review Thresholds

To do this, Moderation Dashboard administrators can set custom score ranges that trigger human review for both visual and text moderation. Content scoring in these ranges will be automatically diverted to the Review Feed for human confirmation. This way, you can focus review from your moderation team on trickier cases, while leaving content that is clearly allowable and clearly harmful to your automated rules. Here’s an example rule that sends text content scored as “controversial” (severity scores of 1 or 2) to the review feed but auto-moderates the most severe cases.

Review Feed Interface for Human Moderators

When your human review rules trigger, Moderation Dashboard will route the post to the Review Feed of one of your moderators, where they can quickly visualize the post and see Hive model predictions to inform a final decision.

For each post, your moderators can select from the moderation actions you’ve set up to implement your content policy. Moderation Dashboard will then ping your callback server with the required information to execute that action, enabling your moderators to take quick action directly within the interface.

Additionally, Moderation Dashboard makes it simple for your Trust & Safety team administrators to onboard and grant review access to additional moderators. Platforms can easily scale their content moderation capabilities to keep up with growth.

Access Clear Intel on Your Content and Users

Beyond individual posts, Moderation Dashboard includes a User Feed that allows your moderators to see detailed post histories of each user that has submitted unsafe content. 

Here, your moderators can access an overview of each user including their total number of posts and the proportion of those posts that triggered your moderation rules. The User Feed also shows each of that user’s posts along with corresponding moderation categories and any corresponding action taken. 

Similarly, Moderation Dashboard makes quality control easy with a Content Feed that displays all posts moderated automatically or through human review. The Content Feed allows you to see your moderation rules in action, including detailed metrics on how Hive models classified each post. From here, administrators supervise human moderation teams for simple QA or further refine thresholds for automated moderation rules.

Effortless Moderation of Spam and Promotions

In addition to model classifications, Moderation Dashboard will also filter incoming text for spam entities – including URLs and personal information such as emails and phone numbers. The Spam Manager interface will aggregate all posts containing the same spam text into a single action item that can be allowed or denied with one click.

With Spam Manager, administrators can also define custom whitelists and blacklists for specific domains and URLs and then set up rules to automatically moderate spam entities in these lists. Finally, Spam Manager provides detailed histories of users that post spam entities for quick identification of bots and promotional accounts, making it easy to keep your platform free of junk content. 

Final Thoughts: The Future of Content Moderation

We’re optimistic that Moderation Dashboard can help platforms of all sizes meet their obligations to keep online environments safe and inclusive. With Moderation Dashboard as a supplement to (or replacement for) internal moderation infrastructure, it’s never been easier for our customers to leverage our top-performing AI models to automate their content policies and increase efficiency of human review. 

Moderation Dashboard is an exciting shift in how we deliver our AI solutions, and this is just the beginning. We’ll be quickly adding additional features and functionality based on customer feedback, so please stay tuned for future announcements.

If you’d like to learn more about Moderation Dashboard or schedule a personal demo, please feel free to contact sales@thehive.ai

BACK TO ALL BLOGS

OCR Moderation with Hive: New Approaches to Online Content Moderation

Recently, image-based content featuring embedded text – such as memes, captioned images and GIFs, and screenshots of text – have exploded in popularity across many social platforms. These types of content can present unique challenges for automated moderation tools. Not only does embedded text need to be detected and ordered accurately, it also must be analyzed with contextual awareness and attention to semantic nuance. 

Emojis have historically been another obstacle for automated moderation. Thanks to native support across many devices and platforms, these characters have evolved into a new online lexicon for accentuating or replacing text. Many emojis have also developed connotations that are well-understood by humans but not directly related to the image itself, which can make it difficult for automated solutions to identify harmful or inappropriate text content.

To help platforms tackle these challenges, Hive offers optical character recognition (OCR)-based moderation as part of our content moderation suite. Our OCR models are optimized for the types of digitally-generated content that commonly appears on social platforms, enabling robust AI moderation on content forms that are widespread yet overlooked by other solutions. Our OCR moderation API combines competitive text detection and transcription capabilities with our best-in-class text moderation model (including emoji support) into a single response, making it easy for platforms to take real-time enforcement actions across these popular content formats. 

OCR Model for Text Recognition

Effective OCR moderation starts with training for accurate text detection and extraction. Hive’s OCR model is trained on a large, proprietary set of examples that optimizes for how text commonly appears within user-generated digital content. Hive has the largest distributed workforce for data labeling in the world, and we leaned on this capability to provide tens of millions of human annotations on these examples to build our model’s understanding. 

We recently conducted a head-to-head comparison of our OCR model against top public cloud solutions using a custom evaluation dataset sourced from social platforms. We were particularly interested in test examples that featured digitally-generated text – such as memes and captioned images – to capture how content commonly appears on social platforms and selected evaluation data accordingly. 

In this evaluation, we looked at end-to-end text recognition, which includes both text detection and text transcription. Here, Hive’s OCR model outperformed or was competitive with other models on both exact transcription and transcription allowing character-level errors. At 90% recall, Hive’s OCR model achieved a precision of 98%, while public cloud models ranged from ~88% to 97%, implying a similar or lower end-to-end error rate.

OCR Moderation: Language Support

We recognize that many platforms’ moderation needs extend beyond English-speaking users. Hive’s OCR model supports text recognition and transcription for many widely spoken languages with comparable performance, many of which are also supported by our text moderation solutions. Here’s an overview of our current language support:

LanguageOCR Support?Text Moderation Support?
EnglishYesYes (Model)
SpanishYesYes (Model)
FrenchYesYes (Model)
GermanYesYes (Model)
MandarinYesYes (Pattern Match)
RussianYesYes (Pattern Match)
PortugueseYesYes (Model)
ArabicYesYes (Model)
KoreanYesYes (Pattern Match)
JapaneseYesYes (Pattern Match)
HindiYesYes (Model)
ItalianYesYes (Pattern Match)

Moderation of Detected Text

Hive’s OCR moderation solution goes beyond producing a transcript – we then apply our best-in-class text moderation model to understand the meaning of that speech in context (including any detected emojis). Our backend will automatically feed text detected in an image as an input to our text moderation model, making our model classifications on image-based text accessible with a single API call. Our text model is generally robust to misspellings and character substitutions, enabling high classification accuracies on text extracted via OCR even if errors occur in transcription. 

Hive’s text moderation model can classify extracted text across several sensitive or inappropriate categories, including sexuality, threats or descriptions of violence, bullying, and racism. 

Another critical use-case is moderating spam and doxxing: OCR moderation will quickly and accurately flag images containing emails, phone numbers, addresses and other personal identifiable information.  Finally, our text moderation model can also identify promotions such as soliciting services, asking for shares and follows, soliciting donations, or links to external sites. This gives platforms new tools to curate user experience and remove junk content. 

We understand that verbal communication is rarely black and white – context and linguistic nuance can have profound effects on how meaning and intent of words are perceived. To help navigate these gray areas, our text model responses supplement classifications with a score from benign (score = 0) to severe (score = 3), which can be used to adapt any necessary moderation actions to platforms’ individual needs and sensitivities. You can read more about our text models in previous blog posts or in our documentation.

Our currently supported moderation classes in each language are as follows:

LanguageClasses
EnglishSexual, Hate, Violence, Bullying
SpanishSexual, Hate
PortugueseSexual, Hate
FrenchSexual
GermanSexual
HindiSexual
ArabicSexual

Emoji Classification for Text Moderation

Emoji recognition is a unique feature of Hive’s OCR moderation model that opens up new possibilities for identifying harmful or harassing text-based content. Emojis can be particularly useful in moderation contexts because they can subtly (or not-so-subtly) alter how accompanying text is interpreted by the reader. Text that is otherwise innocuous can easily become inappropriate when accompanied by a particular emoji and vice-versa.

Hive OCR is able to detect and classify any emojis supported by Apple, Samsung, or Google devices. Our OCR model currently achieves a weighted accuracy of over 97% when classifying emojis. This enables our text moderation model to account for contextual meaning and connotations of emojis used in input text. 

To get a sense of our model’s understanding, let’s take a look at some examples of how use of emojis (or inclusion of text around emojis) changes our model predictions to align with human understanding. Each of these examples is from a real classification task submitted to our latest model release.

Here’s a basic example of how adding an emoji changes our model response from classifying as clean to classifying as sensitive.  Our models understand not only the verbal concept represented by the emoji, but what the emoji means semantically based on where it is located in the text. In this case, the bullying connotation of the “garbage” or “trash” emoji would be completely missed by an analysis of the text alone. 

Our model is similarly sensitive to changes in semantic meaning caused by substitutions of emojis for text.

In this case, our model catches the sexual connotation added by the eggplant emoji in place of the word “eggplant.” Again, the text alone without an emoji – “lemme see that !” – is completely clean.

In addition to understanding how emojis can alter the meaning of text, our model is also sensitive to how text can change implications of emojis themselves.

Here, adding the phrase “hey hotty” transforms an emoji usually used innocuously into a message with suggestive intent, and our model prediction changes accordingly.  

Finally, Hive’s OCR and text moderation models are trained to differentiate between each skin tone option for emojis in the “People” category and understand their implications in the context of accompanying text. We are currently exploring how the ability to differentiate between light and darker skin tones can enable new tools to identify hateful, racist, or exclusionary text content.

OCR Moderation: Final Thoughts

User preferences for online communication are constantly evolving in both medium and content, which can make it challenging for platforms to keep up with abusive users. Hive prides itself on identifying blindspots in existing moderation tools and developing robust AI solutions using high-quality training data tailored to these use-cases. We hope that this post has showcased what’s possible with our OCR moderation capabilities and given some insight into our future directions. 

Feel free to contact sales@thehive.ai if you are interested in adding OCR capabilities to your moderation suite, and please stay tuned as we announce new features and updates!


BACK TO ALL BLOGS

New and Improved AI Models for Audio Moderation

Live streaming, online voice chat, and teleconferencing have all exploded in popularity in recent years. A wider variety of appealing content, shifting user preferences, and unique pressures of the coronavirus pandemic have all been major drivers of this growth. Daily consumption of video and audio content has steadily increased year-over-year, with a recent survey indicating that a whopping 90% of young people watch video content daily across a variety of platforms. 

As the popularity of user-generated audio and video increases, so too does the difficulty of moderating this content efficiently and effectively. While images and text can usually be analyzed and acted on quickly by human moderators, audio/video content – whether live or pre-recorded – is lengthy and linear, requiring significantly more review time for human moderation teams. 

Platforms owe it to their users to provide a safe and inclusive online environment. Unfortunately, the difficulties of moderating audio and video – in addition to the sheer volume of content – have led to passive moderation approaches that rely on after-the-fact user reporting. 

At Hive, we offer access to robust AI audio moderation models to help platforms meet these challenges at scale. With Hive APIs, platforms can access nuanced model classifications of their audio content in near-real time, allowing them to automate enforcement actions or quickly pass flagged content to human moderators for review. By automating audio moderation, platforms can cast a wider net when analyzing their content and take action more quickly to protect their users. 

How Hive Can Help: Speech Moderation

We built our audio solutions to identify harmful or inappropriate speech with attention to context and linguistic subtleties. By natively combining real-time speech-to-text transcription with our best-in-class text moderation model, Hive’s audio moderation API makes our model classifications and a full transcript of any detected speech available with a single API call.  Our API can also analyze audio clips sampled from live content and produce results in 10 seconds or less, providing real-time content intelligence that lets platforms act quickly.

Speech Transcription

Effective speech moderation needs to start with effective speech transcription, and we’ve been working hard to improve our transcription performance. Our transcription model is trained on moderation-relevant domains such as video game streams, game lobbies, and argumentative conversations.

In a recent head-to-head comparison, Hive’s transcription model outperformed or was competitive with top public cloud providers on several publicly available datasets (the evaluation data for each set was withheld from training). 

Each evaluation dataset consisted of about 10 hours of recorded English speech with varying accents and audio quality. As shown, Hive’s transcription model achieved lower word error rates than top public cloud models. This measures the ratio of incorrect words, missed words, and inserted words to the total number of words in the reference, implying Hive’s accuracy was 10-20% higher than competing solutions. 

Audio Moderation

Hive’s audio moderation tools go beyond producing a transcript – we then apply our best-in-class text moderation model to understand the meaning of that speech in context. Here, Hive’s advantage starts with our data. We operate the largest distributed data-labeling workforce in the world, with over five million Hive annotators providing accurate and consensus-driven training labels on diverse example text sourced from relevant domains. For our text models, we leaned on this capability to produce a vast, proprietary training set with millions of examples annotated with human classifications. 

Our models classify speech across five main moderation categories: sexual content, bullying, hate speech, violence, and spam. With ample training data at our disposal, our models achieve high accuracy in identifying these types of sensitive speech, especially at the most severe level. Our hate speech model, for example, achieved a balanced accuracy of ~95% in identifying the most severe cases with a 3% false positive rate (based on a recent evaluation using our validation data). 

Thoughtfully-chosen and accurately labeled training data is only part of our solution here. We also designed our verbal models to provide multi-leveled classifications in each moderation category. Specifically, our model will return a severity score ranging from 0 to 3 (most severe) in each major moderation class based on its understanding of full sentences and phrases in context. This gives our customers more granular intelligence on their audio content and the ability to tailor moderation actions to community guidelines and user expectations. Alternatively,  borderline/controversial cases can be quickly routed to human moderators for review.  

In addition to model classifications, our model response object includes a punctuated transcript with confidence scores for each word to allow more insight into your content and enable quicker review by human moderators if desired. 

Language Support

We recognize that many platforms’ moderation needs extend beyond English-speaking users. At the time of writing, we support audio moderation for English, Spanish, Portuguese, French, German, Hindi, and Arabic. We train each model separately with an eye towards capturing subtleties that vary across cultures and regions. Our currently supported moderation classes in each language are as follows: 

We frequently update our models to add support for our moderation classes in each language, and are currently working to add more support for these and other widely spoken languages. 

Beyond Words: Sound Classification

Hive’s audio moderation model also offers the unique ability to detect and classify undesirable sounds. This opens up new insights into audio content that may not be captured by speech transcription alone. For example, our audio model can detect explicit or inappropriate noises, shouting, and repetitive or abrasive noises to enable new modalities for audio filtering and moderation. We hope that these sound classifications can help platforms identify toxic behaviors beyond bad speech and take action to improve user experience. 

Final Thoughts: Audio Moderation

Hive audio moderation makes it simple to access accurate, real-time intelligence on audio and video content and take informed moderation actions to enforce community guidelines. Our solution is nimble and scalable, helping platforms of all sizes grow with peace of mind. We believe our tools can have a significant impact in curbing toxic or abusive behavior online and lead to better experiences for users.

At Hive, we pride ourselves on continuous improvement. We are frequently optimizing and adding features to our models to increase their understanding and cover more use cases based on client input. We’d love to hear any feedback or suggestions you may have, and please stay tuned for updates!

BACK TO ALL BLOGS

Reuters