BACK TO ALL BLOGS

Expanding our Moderation APIs with Hive’s New Vision Language Model

Contents

Hive is thrilled to announce that we’re releasing Moderation 11B Vision Language Model. Fine-tuned on top of Llama 3.2 11B Vision Instruct, Moderation 11B is a new vision language model (VLM) that expands our established suite of text and visual moderation models. Building on our existing capabilities, this new model offers a powerful way to handle flexible and context-dependent moderation scenarios.

An Introduction to VLMs and Moderation 11B

Vision language models (VLMs) are models that can learn from image and text inputs. This ability to simultaneously process inputs across multiple modalities (e.g. images and text) is known as multimodality. While VLMs share similar functions with large language models (LLMs), traditional LLMs cannot process image inputs.

With Moderation 11B VLM, we leverage unique multimodal capabilities to extend our existing moderation tool suite. Beyond its multimodality, Moderation 11B VLM can incorporate additional contextual information, which is not possible with our traditional classifiers. The model’s baked-in knowledge, combined with insights trained from our classifier dataset, enables a more comprehensive approach to moderation.

Moderation 11B VLM is trained on all 53 public heads of our Visual Moderation system, recognizing content across distinct categories such as sexual content, violence, drugs, hate, and more. Because of these enhancements, it becomes a valuable addition to our existing Enterprise moderation classifiers, helping to capture a wide range of flexible and alternative cases that can arise in dynamic workflows.

Potential Use Cases

Moderation 11B VLM applies to a broad range of use cases, notably surpassing Llama 3.2 11B Vision Instruct in identifying contextual violations and handling unseen data. Below are some potential use cases where our model performs well:

  1. Contextual violations: Cases where individual inputs alone may not be flagged as violations, but all inputs contextualized together makes it one. For example, a text message could appear harmless on its own, yet the preceding conversation context reveals it to be a violation.
  2. Multi-modal violations: Situations where both text and image inputs are important. For instance, analyzing a product image alongside its description can uncover violations that single-modality models would miss.
  3. Unseen data: Inputs that the model has not previously encountered. For example, customers may use Moderation 11B VLM to ensure that user content aligns with newly introduced company policies.

Below are graphical representations of our fine-tuned Moderation 11B model’s performance compared to the Llama 3.2 11B Vision Instruct model. We assessed their respective F1 scores, a metric that combines both precision and recall.

Expanding Moderation

With Moderation 11B VLM’s release, we hope to meaningfully and flexibly broaden the range of use cases our moderation tools can handle. We’re excited to see how this model assists with your moderation workflows, especially when navigating complex scenarios. Anyone with a Hive account can access our API playground here to try Moderation 11B VLM directly from the user interface.

Below are two examples of Moderation 11B VLM requests and responses.

For more details, please refer to the documentation here. If you’re interested in learning more about what we do, please reach out to our sales team (sales@thehive.ai) or contact us here for further questions.