{"id":38,"date":"2023-09-14T12:42:00","date_gmt":"2023-09-14T12:42:00","guid":{"rendered":"http:\/\/54.151.72.21\/?p=38"},"modified":"2025-02-14T21:06:48","modified_gmt":"2025-02-14T21:06:48","slug":"best-in-class-hive-model-benchmarks","status":"publish","type":"post","link":"https:\/\/thehive.ai\/blog\/best-in-class-hive-model-benchmarks","title":{"rendered":"Best-in-Class: Hive Model Benchmarks"},"content":{"rendered":"\n<h2>What does it mean to be \u201cbest-in-class\u201d?<\/h2>\n\n\n\n<p>We often refer to our models as \u201cindustry-leading\u201d or \u201cbest-in-class,\u201d but what does this actually mean in practice?&nbsp;<em>How<\/em>&nbsp;are we better than our competitors, and by how much? It is easy to throw these terms around, but we mean it \u2014 and we have the evidence to back it up. In this blog post, we\u2019ll be walking through some of the benchmarks that we have run against similar products to show how our models outperform the competition.<\/p>\n\n\n\n<h2>Visual Moderation<\/h2>\n\n\n\n<p>First, let\u2019s take a look at one of our oldest and most popular models: visual moderation. To compare our model to its major competitors, we ran a test set of NSFW, suggestive, and clean images through all models.<\/p>\n\n\n\n<p>Visual moderation is a classification task \u2014 in other words, the model\u2019s job is to classify each submitted image into one of several categories (in this case, NSFW or Clean). A popular and effective metric to measure performance in classification models is by looking at their precision and recall. Precision is the number of true positives (i.e., correctly identified NSFW images) over the number of predicted positives (images predicted to be NSFW). Recall is the number of true positives (correctly identified NSFW images) over the number of ground-truth positives (actual NSFW images).&nbsp;<\/p>\n\n\n\n<p>There is a tradeoff between the two. If you predict all images to be NSFW, you will have perfect recall \u2014 you caught all the NSFW images! \u2014 but horrible precision because you incorrectly classified many clean images as NSFW. The goal is to have both high recall&nbsp;<em>and<\/em>&nbsp;high precision, no matter what confidence threshold is used.<\/p>\n\n\n\n<p>With our visual moderation models, we\u2019ve achieved this. We plotted the results of our test as a precision\/recall curve, showing that even at high recall we maintain high precision and vice versa while our competitors fall behind us.<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"737\" data-id=\"122\"  src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1024x737.png\" alt=\"\" class=\"wp-image-122\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1024x737.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-300x216.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-768x552.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1536x1105.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1.png 1600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/figure>\n\n\n\n<p>The above plot is for NSFW content detection. Our precision at 90% recall is nearly perfect at 99.6%, which makes our error rate a whopping 45 times lower than Public Cloud C. Even Public Clouds A and B, which are closer to us in performance, have error rates 12.5 times higher and 22.5 times higher than ours respectively.<\/p>\n\n\n\n<p>We also benchmarked our model for suggestive content detection, or content that is inappropriate but not as explicit as our NSFW category. Hive\u2019s error rate remains far below the other models, resting at 6 times lower than Public Cloud A and 12 times lower than Public Cloud C. Public Cloud B did not offer a similar category and thus could not be compared.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"737\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1024x737.png\" alt=\"\" class=\"wp-image-124\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1024x737.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-300x216.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-768x552.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1536x1105.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2.png 1600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We only ran our test on NSFW\/explicit imagery more broadly because our competitors do not have equivalent classes to ours for other visual moderation classes such as drugs, gore, and terrorism. This makes comparisons difficult, though it also in itself speaks to the fact that we offer far more classes than many of our competitors. With more than 90 subclasses, our visual moderation model far exceeds its peers in terms of the granularity of our results \u2014 we don\u2019t just have classes for NSFW, but also for nudity, underwear, cleavage, and other smaller categories that offer our customers a more more in-depth understanding of their content.<\/p>\n\n\n\n<h2>Text Moderation<\/h2>\n\n\n\n<p>We used precision\/recall curves to compare our text moderation model as well. For this comparison, we charted our performance across eight different classes. Hive outperforms all peer models on every single one.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"551\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-12-1024x551.png\" alt=\"\" class=\"wp-image-841\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-12-1024x551.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-12-300x161.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-12-768x413.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-12.png 1250w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Hive\u2019s error rate on sexual content is 4 times lower than its closest competitor, Public Cloud B. Our other two competitors for that class both have error rates 6 times higher. The threat class boasts similar metrics, with Hive\u2019s error rate between 2 and 4 times lower than all its peers.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"553\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-8-1024x553.png\" alt=\"\" class=\"wp-image-842\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-8-1024x553.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-8-300x162.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-8-768x415.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-8-1536x830.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-8.png 1910w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Hive\u2019s model for hateful content detection is on par with our competitors, remaining slightly ahead on all thresholds. Our model for bullying content does the same, with an error rate 2 times lower than all comparable models.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"552\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.05.55-PM-1024x552.png\" alt=\"\" class=\"wp-image-844\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.05.55-PM-1024x552.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.05.55-PM-300x162.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.05.55-PM-768x414.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.05.55-PM-1536x828.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.05.55-PM.png 1914w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Hive is one of few companies to offer text moderation for drugs and weapons, and our error rates here are also worth noting \u2014 our only competitor has an error rate 4 and 8 times higher than ours for drugs and weapons respectively.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"552\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.06.17-PM-1024x552.png\" alt=\"\" class=\"wp-image-845\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.06.17-PM-1024x552.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.06.17-PM-300x162.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.06.17-PM-768x414.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.06.17-PM-1536x829.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.06.17-PM-2048x1105.png 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Hive also offers the child exploitation class, one that few others provide. With this class, we achieve an error rate 8 times lower than our only other major competitor.<\/p>\n\n\n\n<h2>Audio Moderation<\/h2>\n\n\n\n<p>For Audio Moderation, we evaluate our model using word error rate (WER), which is the gold-standard metric for a speech recognition system. Word error rate is the number of errors divided by the total number of words transcribed, and a perfect word error rate is 0. As you can see, we achieve the best or near-best performance across a variety of languages.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"425\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1-1024x425.jpg\" alt=\"\" class=\"wp-image-128\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1-1024x425.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1-300x124.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1-768x318.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1-1536x637.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/1-1.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"425\" data-id=\"130\"  src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1-1024x425.jpg\" alt=\"\" class=\"wp-image-130\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1-1024x425.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1-300x124.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1-768x318.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1-1536x637.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2-1.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/figure>\n\n\n\n<p><br>We excel across the board, with the lowest word error rate on the majority of the languages offered. On Spanish in particular, our word error rate is more than 4 times lower than Public Cloud B.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"425\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/3-1-1024x425.jpg\" alt=\"\" class=\"wp-image-131\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/3-1-1024x425.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/3-1-300x124.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/3-1-768x318.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/3-1-1536x637.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/3-1.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>For German and Italian we are very close behind Public Cloud C and remain better than all other competitors.<\/p>\n\n\n\n<h2>Optical Character Recognition (OCR)<\/h2>\n\n\n\n<p>To benchmark our OCR model, we calculated the F-score for our model as well as several of our competitors. F-score is the harmonic mean of a model\u2019s precision and recall, combining both of them into one measurement. A perfect F-score is 1. When comparing general F-scores, Hive excels as shown below.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"601\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.07.42-PM-1024x601.png\" alt=\"\" class=\"wp-image-848\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.07.42-PM-1024x601.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.07.42-PM-300x176.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.07.42-PM-768x451.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.07.42-PM-1536x902.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.07.42-PM.png 1908w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We also achieve best-in-class or near-best performance when comparing by language, as shown in the graphs below. With some languages, we excel by quite a large margin. For Chinese and Korean in particular, Hive\u2019s F-score is more than twice all of its competitors. We fall slightly behind in Hindi, yet still perform significantly better than Public Cloud A.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"381\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/4-1-1024x381.jpg\" alt=\"\" class=\"wp-image-133\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/4-1-1024x381.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/4-1-300x111.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/4-1-768x285.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/4-1-1536x571.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/4-1.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"381\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/5-1024x381.jpg\" alt=\"\" class=\"wp-image-134\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/5-1024x381.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/5-300x111.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/5-768x285.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/5-1536x571.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/5.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2>Demographics<\/h2>\n\n\n\n<p>We evaluated our age prediction model by calculating mean error, or how far off our age predictions were from the truth. Since the test dataset we used is labeled using age ranges and not individual numbers, mean error is defined as the distance in years from the closest end of the correct age range (i.e., guessing 22 for someone in the range 25-30 is an error of 3 years). A perfect mean error is 0.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"759\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.26-PM-1024x759.png\" alt=\"\" class=\"wp-image-849\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.26-PM-1024x759.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.26-PM-300x222.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.26-PM-768x569.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.26-PM-1536x1138.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.26-PM.png 1862w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>As you can see from this distribution, Hive has a significantly lower mean error rate in the three lowest age buckets (0-2, 3-9, and 10-19). In the age range 0-2, our mean error rate is 11 times less than Public Cloud A\u2019s. For the range 3-9 and 10-19, that difference becomes 5 times greater and 3 times greater respectively \u2014 still quite a large margin. Hive also excels notably at the oldest age bucket (70+), where our mean error rate is nearly 7 times less than Public Cloud A\u2019s.<\/p>\n\n\n\n<p>For a broader analysis, we compared our overall mean error across all age buckets, as well as the accuracy of our gender predictions.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.54-PM-1024x272.png\" alt=\"\" class=\"wp-image-850\" width=\"840\" height=\"223\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.54-PM-1024x272.png 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.54-PM-300x80.png 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.54-PM-768x204.png 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.54-PM-1536x408.png 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/Screenshot-2024-07-04-at-10.08.54-PM.png 1850w\" sizes=\"(max-width: 840px) 100vw, 840px\" \/><\/figure>\n\n\n\n<h2>AutoML<\/h2>\n\n\n\n<p>One of the newest additions to our product suite, our AutoML platform allows you to train image classification, text classification, and fine-tune large language models with your own custom datasets. To evaluate the effectiveness of this tool, we used the same test set to train models both on our platform and on competitor\u2019s platforms and measured the performance of the resulting model.&nbsp;<\/p>\n\n\n\n<p>For image classification, we used three different classification tasks to account for the fact that different tasks have different levels of inherent difficulty and thus may yield higher or lower performing models. We also used three different dataset sizes for each classification task in order to measure how well the AutoML platform is able to work with limited amounts of examples.<\/p>\n\n\n\n<p>We compared the resulting models using balanced accuracy, which is the arithmetic mean of a model\u2019s true positive rate and true negative rate. A perfect balanced accuracy is 100%.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"213\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/6-1024x213.jpg\" alt=\"\" class=\"wp-image-135\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/6-1024x213.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/6-300x62.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/6-768x160.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/6-1536x320.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/6.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"212\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/7-1024x212.jpg\" alt=\"\" class=\"wp-image-137\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/7-1024x212.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/7-300x62.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/7-768x159.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/7-1536x317.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/7.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"213\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/8-1-1024x213.jpg\" alt=\"\" class=\"wp-image-138\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/8-1-1024x213.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/8-1-300x62.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/8-1-768x160.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/8-1-1536x320.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/8-1.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>As shown in the above tables, Hive achieves best or near-best accuracy across all sets. Our results are quite similar to Public Cloud B\u2019s, pulling ahead on the product dataset. We fell to near-best performance on the smoking dataset, which is the most difficult of the three classification tasks. Even then, we remained within a few percentage points of the winner, Public Cloud B.<\/p>\n\n\n\n<p>For text classification, we trained models for three different categories: sexual content, drugs, and bullying. The results are in the table below. Hive outperforms all competitors on all three categories using all dataset sizes.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"251\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/9-1-1024x251.jpg\" alt=\"\" class=\"wp-image-139\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/9-1-1024x251.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/9-1-300x74.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/9-1-768x188.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/9-1-1536x377.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/9-1.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/10-1024x248.jpg\" alt=\"\" class=\"wp-image-140\" width=\"840\" height=\"203\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/10-1024x248.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/10-300x73.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/10-768x186.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/10-1536x371.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/10.jpg 2048w\" sizes=\"(max-width: 840px) 100vw, 840px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"248\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/11-1024x248.jpg\" alt=\"\" class=\"wp-image-141\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/11-1024x248.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/11-300x73.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/11-768x186.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/11-1536x371.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/11.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Another important consideration when it comes to AutoML is training time. An AutoML tool could build accurate models, but if it takes an entire day to do so it still may not be a great solution. We compared the time it took to train Hive\u2019s text classification tool for the drugs category, and found that our platform was able to train the model 10 times as fast as Private Company A and 32 times as fast as Public Cloud B. And for the smallest dataset size of 100 examples, we trained the model 18 times faster than Private Company A and 268 times faster than Public Cloud B. That\u2019s a pretty significant speedup.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/12-1024x234.jpg\" alt=\"\" class=\"wp-image-143\" width=\"840\" height=\"191\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/12-1024x234.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/12-300x69.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/12-768x176.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/12-1536x351.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/12.jpg 2048w\" sizes=\"(max-width: 840px) 100vw, 840px\" \/><\/figure>\n\n\n\n<p>Measuring the performance of fine-tuned LLMs on our foundation model is a bit more complicated. Here we evaluate two different tasks: question answering and closed-domain classification.&nbsp;<\/p>\n\n\n\n<p>To measure performance on the question answering task, we used a metric called token accuracy. Token accuracy indicates how many tokens are the same between the model\u2019s response and the expected response from the test set. A perfect token accuracy is 100%. As shown below, our token accuracy is higher than our competitors or around the same for all dataset sizes.<\/p>\n\n\n\n<p>This is also true for the classification task, where maintained roughly the same performance as Public Cloud A across the various dataset sizes. Below are the full results of our comparison.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"249\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/13-1024x249.jpg\" alt=\"\" class=\"wp-image-145\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/13-1024x249.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/13-300x73.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/13-768x187.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/13-1536x374.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/13.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"249\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/14-1024x249.jpg\" alt=\"\" class=\"wp-image-146\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/14-1024x249.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/14-300x73.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/14-768x187.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/14-1536x374.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/14.jpg 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2>Final Thoughts<\/h2>\n\n\n\n<p>As illustrated throughout this in-depth look into the performance of our models, we truly earn the title \u201cbest-in-class.\u201d We conduct these benchmarks not just to justify that title, but more so as part of our constant effort to make our models the best that they can be. Reviewing these analyses helps us to identify our strengths, yes, but also our weaknesses and where we can improve.<\/p>\n\n\n\n<p>If you have any questions about any of the benchmarks we\u2019ve discussed here or any other questions about our models, please don\u2019t hesitate to reach out to us at&nbsp;<a href=\"mailto:sales@thehive.ai\" target=\"_blank\" rel=\"noreferrer noopener\">sales@thehive.ai<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hive provides an in-depth comparison of all of our products against top competitors.<\/p>\n","protected":false},"author":1,"featured_media":1925,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"kia_subtitle":""},"categories":[8,4,2],"tags":[],"_links":{"self":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/38"}],"collection":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/comments?post=38"}],"version-history":[{"count":7,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/38\/revisions"}],"predecessor-version":[{"id":1488,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/38\/revisions\/1488"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/media\/1925"}],"wp:attachment":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/media?parent=38"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/categories?post=38"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/tags?post=38"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}