Hive AI
Multimodal
Language Models
Seamlessly integrate powerful multimodal models, including Hive’s Moderation 11B Vision Language Model and popular open-source options like Llama 3.2 11B Vision Instruct.
Explore All Multimodal Language Models
Explore All Multimodal Language Models
Hosted by Hive, integrate popular open-source multimodal models like Llama 3.2 11B Vision Instruct into production workflows with just a few lines of code.
Llama 3.2 11B Vision Instruct
Llama 3.2 11B Vision Instruct is an instruction-tuned model optimized for a variety of vision-based use cases. These include but are not limited to: visual recognition, image reasoning and captioning, and answering questions about images.
Moderation 11B Vision Language Model
Built on top of Llama 3.2 11B Vision Instruct and Hive’s proprietary dataset, this model expands our existing moderation tools to handle more comprehensive contexts and cases. With advanced multimodal capabilities, it excels at detecting NSFW, violence, and other harmful content across text and images.
How customers use our Multimodal Language Models
How customers use our Multimodal Language Models
Content Moderation at Scale
Platforms detect harmful content in complex images and text cases to ensure safer user experiences while maintaining compliance.
Enhance Accessibility
Generate multilingual, context-rich descriptions for images and videos, making visual content more accessible and improving inclusivity across platforms.
Improve Advertising and Insights
Advertisers and platforms analyze visuals to understand ad content, context, and placement opportunities, while gaining deeper insights for data-driven strategies.
What makes our Moderation 11B Vision Language Model unique
What makes our Moderation 11B Vision Language Model unique
Accurate responses for a wide range of multimodal use cases
Accurate responses for a wide range of multimodal use cases
Explore everything you can achieve with our API in the documentation. From generating detailed captions to answering contextual questions, our models deliver reliable results for text, image, and video inputs.
Input : image (gif, jpg, png, webp) or video (mp4, webm, avi, mkv, wmv, mov), prompt
Response : Clear, accurate captions, direct answers to your questions, or moderation scoring —powered by our advanced Vision models.
Why choose our Multimodal Language Models
Why choose our Multimodal Language Models
Speed at scale
We handle high volume with ease and efficiency, serving real-time responses to billions of API calls per month.
Proactive updates
Our Multimodal Language Model is regularly upgraded to improve performance and keep up with evolving customer needs.
Simple integration
Get accurate image descriptions on demand. Integrate our Multimodal Language Model into any application with just a few clicks.
Speed at scale
We handle high volume with ease and efficiency, serving real-time responses to billions of API calls per month.
Proactive updates
Our Multimodal Language Model is regularly upgraded to improve performance and keep up with evolving customer needs.
Simple integration
Get accurate image descriptions on demand. Integrate our Multimodal Language Model into any application with just a few clicks.
Simple usage based pricing so you only pay for what you use
Simple usage based pricing so you only pay for what you use
Multimodal Language Model Pricing Details
Multimodal Language Model Pricing Details
Model
Pricing
Unit
Llama 3.2 11B Vision Instruct
$0.20
$0.20
1M Tokens (Input + Output)
Moderation 11B Vision Language Model
$0.20
$0.20
1M Tokens (Input + Output)