amin.mirlohi_
2025-01-208 min read

Why Your RAG Pipeline Still Hallucinates (And How to Fix It)

RAGRetrievalEvaluationProduction

TL;DR

RAG hallucinations come from five sources: bad chunking, missing re-ranking, no query decomposition, absent citation verification, and zero evaluation. Fix all five to go from 60% to 85%+ accuracy.

The Pattern I See in Every RAG Audit

After auditing 15+ RAG systems in production, the pattern is consistent: teams achieve 60-65% accuracy, declare it "good enough," and wonder why users don't trust the system.

The root cause is almost never the LLM. It's the retrieval pipeline.

Here are the five failure modes I see repeatedly, ordered by impact.

1. Naive Chunking Destroys Context

The problem: Most teams chunk documents by fixed character count (512 or 1024 tokens). This splits sentences mid-thought, separates headers from their content, and fragments tables.

The fix: Use semantic chunking that respects document structure:

  • Split on paragraph and section boundaries
  • Keep headers attached to their body text
  • Preserve tables and code blocks as atomic units
  • Add overlap only at semantic boundaries, not character offsets

This alone typically improves retrieval precision by 10-15%.

2. Vector Search Alone Isn't Enough

The problem: Dense embeddings are great for semantic similarity but terrible at exact matching. Ask "What's our policy on SOC 2 compliance?" and you'll get results about "security standards," semantically related but not the specific document you need.

The fix: Hybrid retrieval combining:

  • Dense search (embeddings) for semantic understanding
  • Sparse search (BM25) for exact keyword matching
  • Metadata filtering for scoping (date ranges, document types, teams)

Fuse results with Reciprocal Rank Fusion (RRF). It's simple and works well.

3. No Re-Ranking = Low Precision

The problem: Top-k retrieval returns 10-20 candidates, but only 2-3 are actually relevant. The LLM sees all of them and treats irrelevant context as authoritative.

The fix: Add a cross-encoder re-ranker between retrieval and generation. Cross-encoders evaluate query-document pairs jointly, producing much better relevance scores than bi-encoder similarity.

This is typically the single highest-impact improvement, delivering a 15-20 point accuracy lift.

4. Complex Questions Need Decomposition

The problem: "Compare our Q3 and Q4 compliance requirements" is one question that requires two retrievals. Single-query retrieval returns a muddle of Q3 and Q4 content.

The fix: Decompose complex queries into sub-queries:

  1. Detect multi-part or comparative questions
  2. Generate focused sub-queries
  3. Retrieve independently for each sub-query
  4. Synthesize answers with proper attribution

This handles the long tail of complex queries that naive RAG consistently fails.

5. No Citation Verification = Silent Hallucinations

The problem: The LLM generates a confident answer with [Source: Doc A], but Doc A doesn't actually say what the answer claims. This is the most dangerous failure mode because it looks correct but isn't.

The fix: Post-generation verification:

  • Extract claims from the generated answer
  • For each claim, verify it's supported by the cited source
  • Flag unsupported claims and either regenerate or escalate
  • Track citation accuracy as a production metric

Putting It All Together

Each fix is additive. In a recent audit, implementing all five moved accuracy from 62% to 89%:

| Fix | Accuracy Improvement | |-----|---------------------| | Semantic chunking | 62% → 72% | | Hybrid retrieval | 72% → 76% | | Cross-encoder re-ranking | 76% → 84% | | Query decomposition | 84% → 87% | | Citation verification | 87% → 89% |

The total effort was 3 weeks of engineering work. The ROI was immediate: user trust scores jumped from 2.8/5 to 4.3/5 within a month.

Amin Mirlohi, PhD

AI Agent Systems Architect

Frequently Asked Questions