Home » Beyond RAG » Verification Layers

How to Add Verification Layers to RAG Output

A verification layer checks whether the generated answer is actually supported by the retrieved context before returning it to the user. Without verification, RAG systems hallucinate by generating confident answers that go beyond what the retrieved chunks say, or by presenting outdated information as current fact. Adding citation checking, confidence scoring, and a refusal path when evidence is insufficient reduces hallucination rates by 40 to 60% in production systems.

Why Generation Without Verification Fails

LLMs are trained to generate fluent, helpful responses. When given retrieved context that partially answers a question, the model fills in the gaps from its training data rather than admitting the context is insufficient. This gap-filling is invisible to the user because the generated text reads naturally. A question about "current API rate limits" might be answered with a mix of retrieved rate limit documentation (from six months ago) and the model's training data (which may reflect different limits), producing an answer that is partially right but specifically wrong.

Verification layers catch this by comparing the generated answer against the retrieved context and flagging claims that cannot be traced back to a specific source passage. This does not eliminate errors entirely, but it shifts the failure mode from "wrong answer presented confidently" to "uncertain answer flagged for review," which is dramatically safer for production applications.

Step-by-Step Implementation

Step 1: Add citation requirements to generation.
The simplest verification technique is requiring citations during generation rather than checking after. Modify your generation prompt to instruct the LLM to cite the specific retrieved chunk that supports each claim. Use numbered references that map to the retrieved chunks. This forces the model to ground its answer in the context rather than filling gaps from training data.
GROUNDED_GENERATION_PROMPT = """Answer the question using ONLY the provided context. For each claim, cite the source using [1], [2], etc. If the context does not contain enough information to answer fully, say "Based on available information..." and note what is missing. Do NOT include information from your training data. Only use what appears in the provided sources. Sources: {numbered_sources} Question: {query}""" def format_sources(chunks): numbered = "" for i, chunk in enumerate(chunks, 1): numbered += f"[{i}] {chunk.text}\n\n" return numbered

This approach catches roughly 30% of hallucinations because the model must point to a source for each claim. Claims that it cannot cite often get omitted or explicitly marked as uncertain. The limitation is that the model can still hallucinate citations, pointing to a source that does not actually support the claim.

Step 2: Implement post-generation verification.
Add a second LLM call that takes the generated answer, the original retrieved chunks, and verifies that each claim in the answer is supported by the context. This is more robust than in-generation citation because the verification model can focus entirely on fact-checking rather than balancing generation quality with grounding.
VERIFY_PROMPT = """Check each claim in the answer against the sources. For each claim, determine: - SUPPORTED: the claim is directly stated in or clearly implied by a source - UNSUPPORTED: the claim cannot be verified from the sources - CONTRADICTED: the claim conflicts with information in the sources Answer: {answer} Sources: {sources} Return a JSON array of objects with "claim", "status", and "source_ref".""" def verify_answer(answer, chunks): sources = format_sources(chunks) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=2000, messages=[{"role": "user", "content": VERIFY_PROMPT .replace("{answer}", answer) .replace("{sources}", sources)}] ) claims = json.loads(response.content[0].text) unsupported = [c for c in claims if c["status"] != "SUPPORTED"] support_ratio = 1 - (len(unsupported) / len(claims)) if claims else 0 return { "claims": claims, "support_ratio": support_ratio, "unsupported_claims": unsupported }
Step 3: Add confidence scoring to retrieved chunks.
Before generation, score each retrieved chunk by how well it answers the specific question, not just how similar it is. A cross-encoder reranker provides this naturally, but you can also use an LLM evaluator. The confidence score feeds into the generation prompt so the model knows which sources are most trustworthy, and into the verification layer as a threshold for accepting the answer.
RELEVANCE_PROMPT = """Rate how well this passage answers the question. Score from 0.0 (completely irrelevant) to 1.0 (directly answers it). Question: {query} Passage: {chunk} Return just the numeric score.""" def score_chunk_relevance(query, chunk): response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=10, messages=[{"role": "user", "content": RELEVANCE_PROMPT .replace("{query}", query) .replace("{chunk}", chunk.text)}] ) return float(response.content[0].text.strip())
Step 4: Implement contradiction detection.
When retrieved chunks contain conflicting information (an older document says the timeout is 30 seconds, a newer one says 60 seconds), the LLM may use either value or blend them. Check for contradictions across retrieved chunks before generation so conflicting information can be flagged and the most current version preferred.
CONTRADICTION_PROMPT = """Do these two passages contain any contradictory information? If yes, describe the contradiction and which passage appears more current or authoritative. Passage A: {chunk_a} Passage B: {chunk_b} Return JSON: {"contradicts": true/false, "description": "...", "preferred": "A" or "B"}""" def check_contradictions(chunks): contradictions = [] for i in range(len(chunks)): for j in range(i + 1, len(chunks)): response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=200, messages=[{"role": "user", "content": CONTRADICTION_PROMPT .replace("{chunk_a}", chunks[i].text) .replace("{chunk_b}", chunks[j].text)}] ) result = json.loads(response.content[0].text) if result["contradicts"]: contradictions.append({ "chunk_a": i, "chunk_b": j, "description": result["description"], "preferred": result["preferred"] }) return contradictions
Step 5: Add the refusal path.
When the verification confidence is below your threshold (typically 0.6 to 0.7 support ratio), return a response that acknowledges the uncertainty rather than presenting a low-confidence answer as fact. The refusal message should explain what the system found and what it could not verify, so the user can decide how to proceed.
def generate_with_verification(query, retriever, threshold=0.7): chunks = retriever.search(query, top_k=5) contradictions = check_contradictions(chunks) if contradictions: # Remove lower-quality contradicting chunks for c in contradictions: remove_idx = c["chunk_a"] if c["preferred"] == "B" else c["chunk_b"] chunks[remove_idx] = None chunks = [c for c in chunks if c is not None] answer = generate_grounded(query, chunks) verification = verify_answer(answer, chunks) if verification["support_ratio"] < threshold: return { "answer": None, "message": "I found relevant information but cannot " "verify a complete answer. Here is what I found: " + summarize_partial(verification), "confidence": verification["support_ratio"] } return { "answer": answer, "confidence": verification["support_ratio"], "citations": verification["claims"] }

Cost and Latency Considerations

Full verification adds 2 to 4 LLM calls per query: relevance scoring, generation with citations, post-generation verification, and optionally contradiction detection. This roughly triples the cost and doubles the latency compared to naive RAG. For high-stakes applications (medical, legal, financial), the accuracy improvement justifies this cost. For casual chat applications, it may not.

A practical middle ground is to run verification only when the initial confidence is ambiguous. If the top retrieved chunk scores above 0.9 relevance, skip verification and generate directly. If it scores below 0.5, decline immediately. Only run the full verification pipeline for the middle range (0.5 to 0.9) where the outcome is uncertain. This reduces the average verification cost by 50 to 70% while catching most of the errors.

Adaptive Recall builds verification into the retrieval layer itself. Every memory carries a confidence score that reflects how well-corroborated it is. Contradiction detection runs during memory consolidation rather than at query time. Evidence-gated learning prevents unverified information from being stored in the first place. This means the memories that reach the LLM have already been through a verification process, reducing the need for expensive post-generation checking.

Build on pre-verified memories. Adaptive Recall's evidence-gated learning and confidence scoring mean the context you retrieve has already been checked.

Try It Free