Humata Outperforms GPT-4, Claude 3, Mistral & More in RepLiQA Document Benchmark

15 Sep 2025 • 8 mins read

At Humata, our mission goes beyond answering questions on your data. We focus on understanding complex, unseen content with unmatched accuracy. In the latest RepLiQA benchmark, Humata ranks first among 20 leading models, including OpenAI's GPT-4, Mistral Large, Claude 3 Sonnet, and more.

Humata AI

Research Team

Humata Outperforms GPT-4, Claude 3, Mistral & More in RepLiQA Document Benchmark

Humata Leads the Pack in RepLiQA Benchmark

At Humata, our mission goes beyond answering questions on your data. We focus on understanding complex, unseen content with unmatched accuracy. In the latest RepLiQA benchmark, the industry’s toughest test of reading comprehension on new documents, Humata achieved a recall score of 0.7429, ranking first among 20 leading models, including OpenAI’s GPT-4, Mistral Large, Claude 3 Sonnet, and more.

Why does this matter? In real business scenarios like mission critical compliance reviews, legal contracts, and research reports, you don’t have the luxury of pre-training a model on your data. You need a system that can instantly read and extract the right information from brand-new documents with hyper-precision and granular citations so that you can double-check accuracy on the stop and guarantee correctness. That’s where Humata stands apart; we are delivering higher accuracy, fewer missed insights, and greater confidence in every answer, which you can verify.

Leading the Benchmark

Direct comparison of all 20 models evaluated in the RepLiQA benchmark study. Results show Humata's competitive advantage in reading comprehension and information extraction from unseen documents compared to leading industry models.

Rank

Model

Recall

Humata

0.7429

Mistral Large

0.7229

Claude 3 Sonnet

0.6654

Claude 3 Haiku

0.6580

WizardLM 2 7B

0.6576

Mixtral 8x22B

0.6544

Mistral Small

0.6442

Mixtral 8x7B

0.6365

WizardLM 2 8x22B

0.6359

#10

Snowflake Arctic

0.6231

#11

GPT-4o

0.6085

#12

Gemini Flash 1.5

0.6043

#13

Mistral 7B

0.6006

#14

GPT-3.5 Turbo

0.5898

#15

Gemini Pro

0.5834

#16

Llama 3 70B

0.5639

#17

Llama 3 8B

0.5482

#18

Command R

0.5016

#19

Command R Plus

0.4640

* Recall scores from RepLiQA benchmark study (2024). All models evaluated under identical conditions. Higher scores indicate better performance at extracting relevant information from provided documents.

About RepLiQA Dataset

RepLiQA is a groundbreaking question-answering dataset specifically designed to benchmark large language models on truly unseen reference content, addressing critical issues in current AI evaluation methods.

17,955

Questions

3,591

Reference Documents

Document Categories

Why RepLiQA Matters

RepLiQA represents a breakthrough in AI evaluation, addressing critical limitations in current benchmarking methods and providing a more reliable measure of true reading comprehension capabilities.

🛡️

Eliminates Data Contamination

Uses entirely novel, human-created content that was never part of any model's training data, ensuring accurate evaluation of true comprehension abilities rather than memorization.

🎯

Tests Real Reading Skills

Focuses on genuine reading comprehension by requiring models to extract information from provided contexts, closely mimicking real-world RAG scenarios.

🔍

Selective Question Answering

Includes unanswerable questions (20%) to test models' ability to recognize when information is insufficient, a crucial skill for reliable AI systems.

📊

Reveals True Performance

Exposes surprising performance patterns where smaller models sometimes outperform larger ones, providing insights into model capabilities beyond parameter count.

🌍

Comprehensive Coverage

Spans 17 diverse document categories from cybersecurity to regional folklore, ensuring robust evaluation across various domain-specific content types.

🚀

Enterprise-Ready Evaluation

Perfectly suited for evaluating AI systems in enterprise environments where models must handle proprietary, previously unseen documents with high accuracy.

Real Business Impact

Legal teams using Humata almost never miss critical clauses in contracts. Researchers can trust that literature reviews and patent searches include all relevant findings. Financial analysts gain cleaner, more reliable data extraction, cutting down on reconciliation work.

Efficiency and Cost Savings

Humata is not only more accurate but also more efficient. With lower compute required per document, it’s faster and more cost-effective than many large models with weaker recall.

See It for Yourself

Don’t settle for models that look good on paper but miss the mark on your actual documents. Try Humata today with your own files and experience the difference in accuracy, speed, and confidence.

Schedule a 30-Minute Demo

Share this post