What's a realistic accuracy target for enterprise RAG?

85-90% on complex multi-hop queries is achievable with the techniques in this article. 95%+ requires domain-specific fine-tuning and extensive golden dataset evaluation. Anything above 90% also requires human-in-the-loop for edge cases.

Which improvement gives the biggest accuracy boost?

Cross-encoder re-ranking typically gives the single biggest lift, often 15-20 percentage points. It's also the easiest to implement as a drop-in addition to existing pipelines.

How do you measure RAG accuracy in production?

Three metrics: (1) Answer relevance via BERTScore against human-verified answers, (2) Citation accuracy, meaning does the cited source actually support the claim, (3) Faithfulness, meaning are all claims in the answer grounded in retrieved context. Use a golden dataset of 200+ QA pairs minimum.

Why Your RAG Pipeline Still Hallucinates (And How to Fix It)

The Pattern I See in Every RAG Audit

After auditing 15+ RAG systems in production, the pattern is consistent: teams achieve 60-65% accuracy, declare it "good enough," and wonder why users don't trust the system.

The root cause is almost never the LLM. It's the retrieval pipeline.

Here are the five failure modes I see repeatedly, ordered by impact.

1. Naive Chunking Destroys Context

The problem: Most teams chunk documents by fixed character count (512 or 1024 tokens). This splits sentences mid-thought, separates headers from their content, and fragments tables.

The fix: Use semantic chunking that respects document structure:

Split on paragraph and section boundaries
Keep headers attached to their body text
Preserve tables and code blocks as atomic units
Add overlap only at semantic boundaries, not character offsets

This alone typically improves retrieval precision by 10-15%.

2. Vector Search Alone Isn't Enough

The problem: Dense embeddings are great for semantic similarity but terrible at exact matching. Ask "What's our policy on SOC 2 compliance?" and you'll get results about "security standards," semantically related but not the specific document you need.

The fix: Hybrid retrieval combining:

Dense search (embeddings) for semantic understanding
Sparse search (BM25) for exact keyword matching
Metadata filtering for scoping (date ranges, document types, teams)

Fuse results with Reciprocal Rank Fusion (RRF). It's simple and works well.

3. No Re-Ranking = Low Precision

The problem: Top-k retrieval returns 10-20 candidates, but only 2-3 are actually relevant. The LLM sees all of them and treats irrelevant context as authoritative.

The fix: Add a cross-encoder re-ranker between retrieval and generation. Cross-encoders evaluate query-document pairs jointly, producing much better relevance scores than bi-encoder similarity.

This is typically the single highest-impact improvement, delivering a 15-20 point accuracy lift.

4. Complex Questions Need Decomposition

The problem: "Compare our Q3 and Q4 compliance requirements" is one question that requires two retrievals. Single-query retrieval returns a muddle of Q3 and Q4 content.

The fix: Decompose complex queries into sub-queries:

Detect multi-part or comparative questions
Generate focused sub-queries
Retrieve independently for each sub-query
Synthesize answers with proper attribution

This handles the long tail of complex queries that naive RAG consistently fails.

5. No Citation Verification = Silent Hallucinations

The problem: The LLM generates a confident answer with [Source: Doc A], but Doc A doesn't actually say what the answer claims. This is the most dangerous failure mode because it looks correct but isn't.

The fix: Post-generation verification:

Extract claims from the generated answer
For each claim, verify it's supported by the cited source
Flag unsupported claims and either regenerate or escalate
Track citation accuracy as a production metric

Putting It All Together

Each fix is additive. In a recent audit, implementing all five moved accuracy from 62% to 89%:

| Fix | Accuracy Improvement | |-----|---------------------| | Semantic chunking | 62% → 72% | | Hybrid retrieval | 72% → 76% | | Cross-encoder re-ranking | 76% → 84% | | Query decomposition | 84% → 87% | | Citation verification | 87% → 89% |

The total effort was 3 weeks of engineering work. The ROI was immediate: user trust scores jumped from 2.8/5 to 4.3/5 within a month.