How Much Do LLMs Hallucinate in Document Q&A? Key Lessons from a 172B-Token Study

If you are building a RAG system, internal knowledge assistant, or document search chatbot, one question matters more than almost anything else:

When the answer is supposed to come from the provided documents, how often does the model still make things up?

That is exactly what the March 9, 2026 paper “How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms” tries to measure.

The short answer is uncomfortable:

Even the best model in the study still fabricated answers 1.19% of the time at 32K context.
Strong models often landed in the 5% to 7% range.
The median model was closer to 25%.
At 200K context, no tested model stayed below 10% fabrication.

That makes this paper useful not because it says hallucinations exist. We already know that. It is useful because it puts numbers on the problem at a much larger scale than most benchmark discussions.

What the Paper Studied

The author evaluated 35 open-weight models across:

3 context lengths: 32K, 128K, and 200K
4 temperatures: 0.0, 0.4, 0.7, and 1.0
3 hardware platforms: NVIDIA H200, AMD MI300X, and Intel Gaudi3

In total, the study used more than 172 billion tokens across more than 4,000 runs.

The focus was not open-ended generation. It was a narrower and more practical setting:

Give the model documents
Ask questions grounded in those documents
Measure whether the answer is correct
Measure whether the model invents facts that are not present

That last part is important. Many evaluations only test whether a model can retrieve or summarize information that exists. This paper also tested whether the model would confidently answer questions about things that do not exist in the documents.

Why This Benchmark Is Interesting

The paper uses a method called RIKER.

Instead of starting with real-world documents and paying humans to annotate the correct answers, the evaluation starts with structured ground truth first. Documents are then generated from that known ground truth. That means the benchmark already knows exactly:

What facts exist
What facts do not exist
Which answers should be refused

This design helps the paper avoid three common benchmark problems:

Benchmark contamination: models may already have seen static benchmark data during training
LLM-as-judge bias: another model is often used to grade the output, which introduces its own errors
Small sample sizes: many evaluations are too small to be statistically convincing

You can debate whether one framework captures every real-world behavior, but the general setup is strong for measuring fabrication in document-grounded QA.

The Most Important Findings

1. Hallucination Does Not Go to Zero

This is the headline result.

Under the best conditions tested, the best model still fabricated answers 1.19% of the time at 32K context. That may sound small, but in production it is not.

If your application handles:

10,000 document-grounded questions per day

Then a 1.19% fabrication rate would still imply roughly:

119 fabricated answers per day

And that is the best-case result from the paper, not the average one.

The more realistic takeaway is that many supposedly strong models still hallucinate often enough that you cannot treat their answers as automatically trustworthy.

2. Longer Context Windows Make Things Worse

One of the clearest patterns in the paper is that hallucination gets worse as context length increases.

At 32K, a handful of models stayed below 10% fabrication.

At 128K, only 5 of 26 tested models remained below 10% fabrication.

At 200K, none of the tested models did.

This matters because many teams assume that if a model advertises a very large context window, then it can reliably reason over that entire context. The paper argues that this is the wrong mental model.

Advertised context length is not the same as usable context length.

In other words:

A model may technically accept 200K tokens
That does not mean it can answer reliably across 200K tokens
It may degrade badly or fabricate far more often

That is a direct warning for “just stuff more documents into the prompt” style RAG systems.

3. Model Family Matters More Than Raw Size

A useful result from the paper is that bigger is not automatically safer.

Some model families consistently fabricated less than others, and that pattern held better than simple parameter count comparisons.

Here is a compact comparison table based on the paper’s reported best-case numbers:

Model	32K Overall Accuracy	32K Fabrication	128K Overall Accuracy	128K Fabrication	200K Overall Accuracy	200K Fabrication
GLM 4.5	97.40%	1.19%	87.43%	3.19%	Not tested	Not tested
MiniMax M2.1	95.96%	5.06%	85.59%	9.72%	Not tested	Not tested
DeepSeek V3.1	95.49%	6.35%	90.45%	7.36%	Not tested	Not tested
Qwen3 Next 80B-A3B	93.87%	7.04%	87.85%	7.99%	82.68%	10.25%
GLM 4.6	93.26%	7.04%	85.81%	13.75%	37.65%	69.53%
Llama 4 Maverick	86.52%	28.08%	63.90%	38.82%	61.56%	43.29%
Llama 3.1 405B	84.75%	26.51%	58.29%	30.62%	Not tested	Not tested
Llama 3.1 70B	69.76%	49.50%	42.08%	56.67%	Not tested	Not tested

This table highlights three patterns fast:

GLM 4.5 is the standout at 32K
Qwen3 Next 80B-A3B is the most resilient model among those tested all the way to 200K
GLM 4.6 shows how a strong 32K model can collapse badly at very long context lengths

For example, the paper reports that:

Some GLM and MiniMax models had relatively low fabrication rates
Several Llama-family models showed much higher fabrication even when they had strong grounding scores

That suggests hallucination resistance is not simply an emergent property of scale. It looks more like a capability that depends heavily on training choices, alignment, and calibration.

For practitioners, the implication is simple:

Do not pick a model just because it is bigger
Do not pick a model just because it performs well on general benchmarks
Test the exact document-QA behavior you care about

4. Temperature 0 Is Not Always the Best Choice

A lot of teams default to temperature = 0 for factual tasks because it feels safer and more deterministic.

This paper shows that the rule is not that simple.

According to the results:

T=0.0 gave the best overall accuracy in about 60% of cases
Higher temperatures reduced fabrication for the majority of model-context combinations
T=0.0 also increased coherence failures, including infinite generation loops, especially at long context lengths

One especially striking result in the paper is that some models had dramatically higher loop or truncation rates at T=0.0 than at T=1.0, with extreme cases showing tens of times more failures.

The practical message is not “always use temperature 1.0.” It is:

Do not blindly assume temperature 0 is optimal

For a real system, you may need to tune for a balance between:

accuracy
fabrication rate
response stability

5. Hardware Did Not Meaningfully Change Fidelity

The paper also compared the same models across:

NVIDIA H200
AMD MI300X
Intel Gaudi3

The main conclusion was that hardware platform did not meaningfully change the models’ fidelity behavior.

That is good news for deployment planning. If the same serving stack is used, hardware choice appears to be more about:

cost
throughput
availability

and less about answer quality.

A Very Important Distinction: Grounding vs Fabrication

One of the best ideas in the paper is that grounding ability and fabrication resistance are not the same thing.

A model can be good at finding facts that really exist in the provided documents and still be bad at refusing to answer when the requested fact is not there.

That means a model can look good on retrieval-heavy benchmarks but still be risky in production.

This is a major point for anyone evaluating RAG systems. If your benchmark only asks:

“Can the model find the right answer when the answer exists?”

then you are missing the harder and more dangerous question:

“What does the model do when the answer does not exist in the retrieved context?”

That is where real trustworthiness is tested.

What This Means for RAG and Enterprise AI

If you deploy document-grounded LLM systems, this paper points to a few practical rules.

Treat Hallucination as a Product Constraint, Not a Rare Bug

The paper’s numbers are too high to dismiss as edge cases. Even top-tier models produce fabricated answers often enough that you need system-level defenses.

That can include:

answer citation requirements
refusal behavior when evidence is weak
retrieval quality checks
confidence thresholds
human review for high-stakes workflows

Test at Your Real Context Length

If your production system regularly sends 80K, 120K, or 200K tokens, then a 32K benchmark is not enough. The paper shows that performance at shorter context lengths can give false confidence.

Measure Refusal Quality Explicitly

A good evaluation set should include questions where:

the answer is absent
the entity is missing
the relationship is fake

If you do not test those cases, you are mostly measuring retrieval and summarization, not hallucination resistance.

Stop Using “Bigger Model” as a Shortcut for Safety

The paper makes it clear that some smaller or mid-sized models can be better calibrated than much larger ones for document-grounded QA.

Limitations to Keep in Mind

This is a useful paper, but it is still one study and it has clear limits:

It evaluates open-weight models, not proprietary systems like GPT, Claude, or Gemini
It focuses on English
It measures one framework, RIKER
It is specifically about document Q&A, not every type of LLM task

So the exact rankings should not be treated as universal truth. But the broader patterns are hard to ignore:

hallucination floors are real
long context makes reliability harder
temperature tuning is more nuanced than people assume
retrieval success does not guarantee refusal quality

Final Takeaway

This paper answers an important practical question with unusually concrete numbers:

LLMs hallucinate in document Q&A more often than most teams would be comfortable admitting, and the problem gets worse as context grows.

If you are building RAG or enterprise knowledge systems, the lesson is not to abandon LLMs. The lesson is to stop evaluating them with shallow metrics.

You need to test:

whether the model finds the right answer
whether it refuses when the answer is missing
how it behaves at your actual production context length
whether decoding choices make reliability better or worse

That is a much higher bar than “it looked good in a demo,” but this paper makes a strong case that the higher bar is necessary.

Source

Paper: How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

What the Paper Studied#

Why This Benchmark Is Interesting#

The Most Important Findings#

1. Hallucination Does Not Go to Zero#

2. Longer Context Windows Make Things Worse#

3. Model Family Matters More Than Raw Size#

4. Temperature 0 Is Not Always the Best Choice#

5. Hardware Did Not Meaningfully Change Fidelity#

A Very Important Distinction: Grounding vs Fabrication#

What This Means for RAG and Enterprise AI#

Treat Hallucination as a Product Constraint, Not a Rare Bug#

Test at Your Real Context Length#

Measure Refusal Quality Explicitly#

Stop Using “Bigger Model” as a Shortcut for Safety#

Limitations to Keep in Mind#

Final Takeaway#

Source#

Related Articles