Why Your RAG Pipeline Still Hallucinates (And How to Fix It)
TL;DR
RAG hallucinations come from five sources: bad chunking, missing re-ranking, no query decomposition, absent citation verification, and zero evaluation. Fix all five to go from 60% to 85%+ accuracy.
The Pattern I See in Every RAG Audit
After auditing 15+ RAG systems in production, the pattern is consistent: teams achieve 60-65% accuracy, declare it "good enough," and wonder why users don't trust the system.
The root cause is almost never the LLM. It's the retrieval pipeline.
Here are the five failure modes I see repeatedly, ordered by impact.
1. Naive Chunking Destroys Context
The problem: Most teams chunk documents by fixed character count (512 or 1024 tokens). This splits sentences mid-thought, separates headers from their content, and fragments tables.
The fix: Use semantic chunking that respects document structure:
- Split on paragraph and section boundaries
- Keep headers attached to their body text
- Preserve tables and code blocks as atomic units
- Add overlap only at semantic boundaries, not character offsets
This alone typically improves retrieval precision by 10-15%.
2. Vector Search Alone Isn't Enough
The problem: Dense embeddings are great for semantic similarity but terrible at exact matching. Ask "What's our policy on SOC 2 compliance?" and you'll get results about "security standards," semantically related but not the specific document you need.
The fix: Hybrid retrieval combining:
- Dense search (embeddings) for semantic understanding
- Sparse search (BM25) for exact keyword matching
- Metadata filtering for scoping (date ranges, document types, teams)
Fuse results with Reciprocal Rank Fusion (RRF). It's simple and works well.
3. No Re-Ranking = Low Precision
The problem: Top-k retrieval returns 10-20 candidates, but only 2-3 are actually relevant. The LLM sees all of them and treats irrelevant context as authoritative.
The fix: Add a cross-encoder re-ranker between retrieval and generation. Cross-encoders evaluate query-document pairs jointly, producing much better relevance scores than bi-encoder similarity.
This is typically the single highest-impact improvement, delivering a 15-20 point accuracy lift.
4. Complex Questions Need Decomposition
The problem: "Compare our Q3 and Q4 compliance requirements" is one question that requires two retrievals. Single-query retrieval returns a muddle of Q3 and Q4 content.
The fix: Decompose complex queries into sub-queries:
- Detect multi-part or comparative questions
- Generate focused sub-queries
- Retrieve independently for each sub-query
- Synthesize answers with proper attribution
This handles the long tail of complex queries that naive RAG consistently fails.
5. No Citation Verification = Silent Hallucinations
The problem: The LLM generates a confident answer with [Source: Doc A], but Doc A doesn't actually say what the answer claims. This is the most dangerous failure mode because it looks correct but isn't.
The fix: Post-generation verification:
- Extract claims from the generated answer
- For each claim, verify it's supported by the cited source
- Flag unsupported claims and either regenerate or escalate
- Track citation accuracy as a production metric
Putting It All Together
Each fix is additive. In a recent audit, implementing all five moved accuracy from 62% to 89%:
| Fix | Accuracy Improvement | |-----|---------------------| | Semantic chunking | 62% → 72% | | Hybrid retrieval | 72% → 76% | | Cross-encoder re-ranking | 76% → 84% | | Query decomposition | 84% → 87% | | Citation verification | 87% → 89% |
The total effort was 3 weeks of engineering work. The ROI was immediate: user trust scores jumped from 2.8/5 to 4.3/5 within a month.
Amin Mirlohi, PhD
AI Agent Systems Architect