State of AI 2026

From Search to Answers: Building RAG Systems That Don’t Hallucinate






From Search to Answers: Building RAG Systems That Don’t Hallucinate


Article II — Semantic Search Series
A Practical Guide for Programmers

From Search to Answers: Building RAG Systems That Don’t Hallucinate

Connect your semantic search engine to an LLM and build a question-answering system with citations, confidence scoring, and graceful “I don’t know” responses.

Chapter One

The Most Dangerous Sentence in Software

Your semantic search engine works. You built the pipeline from Article I — chunking, embeddings, vector storage, hybrid retrieval — and now when a user types “how do I request time off?” the system returns the three most relevant paragraphs from your company handbook. Impressive. But the user doesn’t want paragraphs. They want an answer. A clear, direct, trustworthy answer. So you do the obvious thing: you take those retrieved paragraphs, hand them to a large language model, and ask it to synthesize a response in natural language.

And the model says: “To request time off, submit a PTO request in Workday at least two business days in advance. Your manager will receive an automatic notification and must approve within 48 hours.”

Crisp. Professional. Exactly the kind of answer your users want. There’s just one problem: your company uses BambooHR, not Workday. The approval window is 72 hours, not 48. And there is no automatic notification — managers check a dashboard manually every morning. The model read your documents, absorbed the general theme, and then confidently fabricated the specifics.

This is hallucination. Not a bug in the traditional sense — no stack trace, no error code, no crash. Something far worse: a system that is wrong with the calm authority of something that is right. And in production, it is the most dangerous sentence your software will ever generate.

The Core Risk

Hallucination in RAG systems is not random. It follows predictable patterns: the model fills gaps when context is insufficient, blends retrieved facts with parametric knowledge when boundaries are unclear, and invents plausible details when the prompt doesn’t constrain it. Every one of these failure modes is preventable with engineering discipline.

This article is about that discipline. We are going to build a complete RAG system that retrieves relevant context with the semantic search engine from Article I, generates answers grounded exclusively in that context, attaches citations to every claim, scores its own confidence, and — crucially — knows when to say “I don’t have enough information to answer that.”

Chapter Two

What RAG Actually Is (And What It Isn’t)

Retrieval-Augmented Generation is a deceptively simple idea: instead of asking a language model to answer questions from its training data alone, you first retrieve relevant documents, then pass those documents as context alongside the question to the model. The model generates its answer based on the provided context rather than its parametric memory.

Think of it this way

A language model answering questions without retrieval is like a brilliant expert taking an exam from memory six months after studying. They’ll get the general concepts right but hallucinate specifics. RAG is that same expert taking an open-book exam — they can reference the source material for every answer they give.

The architecture has three stages, and every design decision you make maps to one of them:

RAG Pipeline Architecture
User Query
“How do I request PTO?”

Retrieve
Semantic search

Augment
Build prompt + context

Generate
LLM synthesizes answer

Validate
Cite, score, or abstain

The fifth stage — validation — is what separates production RAG from demo RAG. Most tutorials stop at stage four.

What RAG is not: fine-tuning. Fine-tuning changes the model’s weights. RAG changes the model’s input. Fine-tuning is expensive, slow, and makes the model’s knowledge static at training time. RAG is cheap, fast to update (just re-index your documents), and always reflects the latest version of your source material. For the vast majority of enterprise question-answering use cases, RAG is the correct architecture.

RAG

Dynamic Context

Retrieves fresh documents at query time. Update your knowledge base and answers change immediately. No retraining needed.

Fine-Tuning

Baked-In Knowledge

Modifies model weights with your data. Expensive to update. Best for teaching the model a new skill or domain-specific language — not for factual Q&A.

Chapter Three

The Retrieval Layer: Getting the Right Context

Your RAG system is only as good as the context it retrieves. Feed the model irrelevant paragraphs and it will either hallucinate to fill the gap or produce a confused, hedge-everything response. Feed it the right paragraphs and the model’s job becomes almost trivial: summarize what’s in front of it.

If you built the semantic search engine from Article I, you already have the core retrieval pipeline: chunked documents, vector embeddings, a similarity index. But RAG demands more from retrieval than a standalone search feature does. Here are the three upgrades that matter.

Upgrade 1: Context-Aware Chunking

In Article I, we chunked by character count with overlap. For RAG, we need chunks that are self-contained enough for an LLM to reason about in isolation. A chunk that says “As discussed above, the policy requires…” is useless without the preceding chunk. The fix is to prepend contextual metadata to each chunk before embedding:

Python

def create_contextual_chunks(title: str, text: str, splitter) -> list[dict]:
    """Create chunks that carry their own context."""
    raw_chunks = splitter.split_text(text)
    contextual_chunks = []

    for i, chunk in enumerate(raw_chunks):
        # Prepend document title and position for self-contained context
        enriched = f"Document: {title}\nSection {i+1} of {len(raw_chunks)}\n\n{chunk}"

        contextual_chunks.append({
            "content": chunk,               # Original text for display
            "content_for_embedding": enriched, # Enriched text for search
            "source_title": title,
            "chunk_index": i,
            "total_chunks": len(raw_chunks),
        })

    return contextual_chunks

Upgrade 2: Multi-Query Retrieval

A single user query often under-represents what they actually need. The question “What’s our refund policy?” might need context about return windows, exceptions for digital products, and the approval workflow. Multi-query retrieval generates several reformulations of the original question and retrieves documents for each:

Python

from anthropic import Anthropic

client = Anthropic()

def generate_sub_queries(query: str, n: int = 3) -> list[str]:
    """Generate alternative phrasings to improve retrieval coverage."""
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Generate {n} alternative phrasings of this question
that would help retrieve relevant documents. Return only the
questions, one per line, no numbering.

Original question: {query}"""
        }]
    )
    return response.content[0].text.strip().split("\n")

def multi_query_retrieve(query: str, search_fn, top_k: int = 5) -> list:
    """Retrieve using original query + sub-queries, deduplicate."""
    all_queries = [query] + generate_sub_queries(query)
    seen_ids = set()
    results = []

    for q in all_queries:
        for doc in search_fn(q, top_k=top_k):
            if doc["id"] not in seen_ids:
                seen_ids.add(doc["id"])
                results.append(doc)

    return results

Upgrade 3: Retrieval with a Relevance Threshold

This is the upgrade most teams skip, and it’s the one that prevents the largest category of hallucinations. When every retrieved document scores below a similarity threshold, it means you don’t have the answer in your knowledge base. Without a threshold, the system passes low-relevance context to the model, and the model does what it does best: generate a plausible-sounding answer from nothing.

Python

def retrieve_with_threshold(
    query: str,
    search_fn,
    top_k: int = 5,
    min_similarity: float = 0.35
) -> tuple[list, bool]:
    """Retrieve documents, but flag when confidence is low."""
    results = search_fn(query, top_k=top_k)

    # Filter by similarity threshold
    relevant = [r for r in results if r["similarity"] >= min_similarity]

    has_sufficient_context = len(relevant) >= 1
    return relevant, has_sufficient_context

The threshold value depends on your embedding model and domain. For all-MiniLM-L6-v2, 0.35 is a reasonable starting point. For OpenAI’s text-embedding-3-small, start at 0.40. Calibrate by testing against a set of queries you know should and shouldn’t have answers in your knowledge base.

The single most effective anti-hallucination technique is not in the prompt, the model, or the post-processing. It’s in the retrieval layer: knowing when you don’t have enough context to answer.

Chapter Four

The Generation Layer: Prompts That Prevent Hallucination

You’ve retrieved relevant context. Now you need to turn it into a trustworthy answer. The prompt you write for this step is the most consequential piece of text in your entire system. A vague prompt produces hallucinations. A precise prompt produces grounded, citable answers.

Here is the prompt architecture that works in production. Every line exists for a reason.

Python

SYSTEM_PROMPT = """You are a precise question-answering assistant. Your ONLY
job is to answer the user's question based on the provided context documents.

STRICT RULES:
1. Base your answer EXCLUSIVELY on the provided context documents.
2. For every factual claim in your answer, include a citation in the format
   [Source N] where N corresponds to the document number.
3. If the context documents do not contain enough information to fully answer
   the question, say exactly: "Based on the available documents, I cannot
   fully answer this question." Then explain what information IS available
   and what is missing.
4. NEVER use information from your training data to fill gaps. If a specific
   detail (a date, a number, a name, a process) is not in the context,
   do not guess or infer it.
5. If the context documents contradict each other, flag the contradiction
   explicitly and present both versions with their sources.
6. Keep your answer concise and direct. Prefer short, accurate responses
   over long, padded ones."""

def build_rag_prompt(query: str, context_docs: list[dict]) -> list[dict]:
    """Build the complete message array for the RAG call."""
    # Format context documents with source numbers
    context_block = ""
    for i, doc in enumerate(context_docs, 1):
        context_block += f"""
--- Source {i}: {doc['source_title']} ---
{doc['content']}

"""

    return [
        {"role": "user", "content": f"""Context documents:
{context_block}
Question: {query}

Answer the question using ONLY the context documents above. Cite every
factual claim with [Source N]. If you cannot answer from the provided
context, say so explicitly."""}
    ]

Let’s break down why each rule matters:

  • Rule 1 (exclusive grounding) prevents the model from blending retrieved facts with its training data — the primary cause of subtle hallucinations where the answer is mostly right but wrong in key details.
  • Rule 2 (mandatory citations) forces the model to mentally trace each claim back to a source. If it can’t cite something, it either omits it or flags the gap. Citations are not just for the user — they constrain the model’s generation process.
  • Rule 3 (explicit abstention) gives the model a concrete alternative to hallucinating. Without this instruction, models default to being helpful, which means making up an answer. With it, they default to honesty.
  • Rule 4 (no gap-filling) is the sharpest constraint. Models naturally interpolate. This rule tells them: gaps are features, not bugs. Leave them visible.
  • Rule 5 (contradiction handling) prevents the model from silently choosing one version when sources disagree — a common and dangerous failure mode in legal and compliance contexts.

The Complete Generation Function

Python

from anthropic import Anthropic

client = Anthropic()

def generate_answer(
    query: str,
    context_docs: list[dict],
    has_sufficient_context: bool
) -> dict:
    """Generate a grounded answer with citations."""

    # Early exit: no relevant context found
    if not has_sufficient_context:
        return {
            "answer": "I don't have enough information in the knowledge base "
                     "to answer this question. This topic may not be covered "
                     "in the current documentation.",
            "citations": [],
            "confidence": 0.0,
            "status": "no_context"
        }

    messages = build_rag_prompt(query, context_docs)

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages,
        temperature=0.0  # Minimize creative generation
    )

    answer_text = response.content[0].text

    # Extract citations from the answer
    citations = extract_citations(answer_text, context_docs)

    # Score confidence based on retrieval + generation signals
    confidence = compute_confidence(context_docs, answer_text, citations)

    return {
        "answer": answer_text,
        "citations": citations,
        "confidence": confidence,
        "status": "answered" if confidence > 0.4 else "low_confidence"
    }

Notice temperature=0.0. For factual question-answering, you want the model to be as deterministic as possible. Creativity is the enemy of accuracy when the task is to faithfully summarize retrieved documents.

Chapter Five

Citation Extraction: Every Claim Gets a Source

Citations aren’t just a nice-to-have feature for user trust. They are a structural defense against hallucination. When you require the model to cite every factual claim, you create a verifiable chain from answer to source. If a citation doesn’t check out, you know exactly which part of the answer to distrust.

Here is a robust citation extraction system that parses [Source N] markers from the generated text and maps them back to the original documents:

Python

import re

def extract_citations(
    answer: str,
    context_docs: list[dict]
) -> list[dict]:
    """Extract [Source N] citations and map them to documents."""
    citation_pattern = re.compile(r'\[Source\s+(\d+)\]')
    matches = citation_pattern.findall(answer)

    citations = []
    for source_num in set(matches):
        idx = int(source_num) - 1
        if 0 <= idx < len(context_docs):
            citations.append({
                "source_number": int(source_num),
                "title": context_docs[idx]["source_title"],
                "content_preview": context_docs[idx]["content"][:200],
                "similarity_score": context_docs[idx].get("similarity", None),
            })

    return citations


def verify_citations(
    answer: str,
    citations: list[dict],
    context_docs: list[dict]
) -> dict:
    """Verify that cited claims are actually supported by sources."""
    # Split answer into cited sentences
    sentences = re.split(r'(?<=[.!?])\s+', answer)
    cited_sentences = []
    uncited_claims = []

    for sentence in sentences:
        if re.search(r'\[Source\s+\d+\]', sentence):
            cited_sentences.append(sentence)
        elif contains_factual_claim(sentence):
            uncited_claims.append(sentence)

    return {
        "total_sentences": len(sentences),
        "cited_count": len(cited_sentences),
        "uncited_claims": uncited_claims,
        "citation_coverage": len(cited_sentences) / max(len(sentences), 1),
    }


def contains_factual_claim(sentence: str) -> bool:
    """Heuristic: does this sentence make a factual assertion?"""
    # Skip transitional/connective phrases
    non_factual_starts = [
        "based on", "according to", "in summary",
        "however", "additionally", "note that",
        "i don't", "i cannot", "the available",
    ]
    lower = sentence.lower().strip()
    if any(lower.startswith(p) for p in non_factual_starts):
        return False

    # Contains numbers, dates, proper nouns, or specific details
    return bool(re.search(r'\d+|must|requires?|shall|policy|procedure', lower))

The verify_citations function is your audit trail. If citation_coverage drops below 0.6 for an answer, it means a significant portion of the response isn’t grounded in sources — a strong signal that the model is drawing from its parametric memory rather than the retrieved context.

Chapter Six

Confidence Scoring: Knowing What You Don’t Know

A binary “answered” or “not answered” isn’t granular enough for production systems. Users and downstream systems need a confidence score — a number that communicates how much you should trust the answer. This isn’t a model’s self-reported probability (models are notoriously overconfident). It’s an engineered signal derived from multiple independent indicators.

Retrieval Signal

How relevant were the documents?

Average similarity score of retrieved chunks. High retrieval scores mean the knowledge base has strong coverage of the topic.

Coverage Signal

How well do citations cover the answer?

Ratio of cited claims to total claims. Low coverage suggests the model is filling gaps with its own knowledge.

Abstention Signal

Did the model hedge or qualify?

Presence of phrases like “I cannot determine” or “the documents don’t specify” indicates the model recognized its own uncertainty.

Python

def compute_confidence(
    context_docs: list[dict],
    answer: str,
    citations: list[dict]
) -> float:
    """
    Compute a confidence score from 0.0 to 1.0 using three independent
    signals: retrieval quality, citation coverage, and abstention detection.
    """
    # ── Signal 1: Retrieval quality (0.0 - 1.0)
    similarities = [d["similarity"] for d in context_docs if "similarity" in d]
    retrieval_score = sum(similarities) / len(similarities) if similarities else 0.0

    # ── Signal 2: Citation coverage (0.0 - 1.0)
    verification = verify_citations(answer, citations, context_docs)
    coverage_score = verification["citation_coverage"]

    # ── Signal 3: Abstention detection (0.0 or 1.0)
    abstention_phrases = [
        "i cannot", "i don't have", "not enough information",
        "the documents don't", "not covered", "no information",
        "cannot fully answer", "not specified", "unable to determine",
    ]
    has_abstention = any(p in answer.lower() for p in abstention_phrases)
    abstention_penalty = 0.5 if has_abstention else 1.0

    # ── Weighted combination
    raw_score = (
        0.45 * retrieval_score +    # Retrieval quality matters most
        0.35 * coverage_score +     # Citation coverage is next
        0.20 * abstention_penalty    # Abstention reduces confidence
    )

    return round(min(max(raw_score, 0.0), 1.0), 2)

Here’s how confidence scores translate to user-facing behavior:

Confidence Score Thresholds
High
0.85

Medium
0.55

Low
0.25

High confidence → show answer directly. Medium → show answer with “verify with sources” caveat. Low → show “I’m not sure” with source links instead.

Confidence Range Status User Experience
0.70 – 1.00 Display answer directly with citation links Clean, authoritative response with inline references
0.40 – 0.69 Display answer with a caveat “Here’s what I found, but you may want to verify: …” + source links
0.00 – 0.39 Decline to answer “I couldn’t find a reliable answer. Here are the closest documents: …” + links

Chapter Seven

The Complete System: From Question to Trustworthy Answer

We’ve built each layer individually. Now let’s compose them into a single, production-ready RAG function that handles the full lifecycle: retrieve, validate context, generate, cite, score confidence, and decide whether to answer or abstain.

Python — rag_engine.py

from anthropic import Anthropic
from dataclasses import dataclass

client = Anthropic()


@dataclass
class RAGResponse:
    answer: str
    citations: list[dict]
    confidence: float
    status: str              # "answered" | "low_confidence" | "no_context"
    source_documents: list[dict]
    debug: dict


def ask(
    query: str,
    search_fn,
    top_k: int = 5,
    min_similarity: float = 0.35,
    confidence_threshold: float = 0.4
) -> RAGResponse:
    """
    Complete RAG pipeline: retrieve → validate → generate → cite → score.

    Returns a RAGResponse with the answer, citations, confidence score,
    and a status indicating whether the system is confident enough to
    present the answer directly.
    """
    # ── Stage 1: Retrieve with threshold ──
    context_docs, has_context = retrieve_with_threshold(
        query, search_fn, top_k=top_k, min_similarity=min_similarity
    )

    # ── Stage 2: Early exit if no relevant context ──
    if not has_context:
        return RAGResponse(
            answer=("I don't have enough information in the knowledge "
                    "base to answer this question."),
            citations=[],
            confidence=0.0,
            status="no_context",
            source_documents=[],
            debug={"retrieval_count": 0, "query": query}
        )

    # ── Stage 3: Build prompt and generate ──
    messages = build_rag_prompt(query, context_docs)

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages,
        temperature=0.0
    )

    answer_text = response.content[0].text

    # ── Stage 4: Extract and verify citations ──
    citations = extract_citations(answer_text, context_docs)
    verification = verify_citations(answer_text, citations, context_docs)

    # ── Stage 5: Compute confidence ──
    confidence = compute_confidence(context_docs, answer_text, citations)

    # ── Stage 6: Determine response status ──
    if confidence >= 0.7:
        status = "answered"
    elif confidence >= confidence_threshold:
        status = "low_confidence"
    else:
        status = "no_context"
        answer_text = ("I found some related information, but I'm not confident "
                       "enough to give a reliable answer. Here are the most "
                       "relevant documents I found:")

    return RAGResponse(
        answer=answer_text,
        citations=citations,
        confidence=confidence,
        status=status,
        source_documents=context_docs,
        debug={
            "query": query,
            "retrieval_count": len(context_docs),
            "citation_coverage": verification["citation_coverage"],
            "uncited_claims": verification["uncited_claims"],
            "model": "claude-sonnet-4-5-20250929",
        }
    )

Usage is clean and expressive:

Python

# Ask a question
result = ask("How do I request time off?", search_fn=semantic_search)

# Route based on confidence
if result.status == "answered":
    print(result.answer)
    print(f"\nSources:")
    for c in result.citations:
        print(f"  [{c['source_number']}] {c['title']}")

elif result.status == "low_confidence":
    print("⚠ I found a possible answer, but please verify:")
    print(result.answer)

else:
    print("I couldn't find a reliable answer to that question.")
    print("You might want to check these related documents:")
    for doc in result.source_documents:
        print(f"  → {doc['source_title']}")

Chapter Eight

The Five Failure Modes (And How to Defend Against Each)

Every production RAG system encounters these failure patterns. Understanding them is the difference between a demo that impresses and a product that people trust with real decisions.

Failure Mode What Happens Defense
Context Poisoning Irrelevant documents retrieved. Model hallucinates to bridge the gap between the question and unrelated context. Retrieval threshold + relevance re-ranking (Article I). Never pass low-similarity documents to the model.
Parametric Bleed Model blends retrieved facts with its training data. Answer is “mostly right” but wrong in specifics. Explicit grounding rules in system prompt. Temperature = 0. Citation requirement forces traceability.
Chunk Boundary Loss The answer spans two chunks but only one was retrieved. Model fills the gap with inferred content. Overlap chunking (Article I) + contextual chunk enrichment + retrieving adjacent chunks.
Confident Abstention Failure Model doesn’t realize it’s missing information. Generates a complete answer that happens to be wrong. Multi-signal confidence scoring + citation coverage audit. If coverage < 0.6, flag as low confidence.
Stale Context Knowledge base is outdated. Model returns an answer based on old policy or deprecated information. Metadata filtering by date + document versioning + freshness score in retrieval ranking.

Defending Against Chunk Boundary Loss

This failure mode deserves special attention because it’s both common and invisible. When the correct answer spans two consecutive chunks, a naive retrieval system might return only one, leaving the model to guess about the rest. The fix is to retrieve neighboring chunks automatically:

Python

def retrieve_with_neighbors(
    query: str,
    search_fn,
    fetch_by_id_fn,
    top_k: int = 5,
    neighbor_window: int = 1
) -> list[dict]:
    """Retrieve top chunks + their neighbors for context continuity."""
    results = search_fn(query, top_k=top_k)
    expanded = []
    seen_ids = set()

    for doc in results:
        # Fetch the chunk and its neighbors
        source = doc["source_title"]
        idx = doc["chunk_index"]

        for offset in range(-neighbor_window, neighbor_window + 1):
            neighbor_idx = idx + offset
            key = (source, neighbor_idx)
            if key not in seen_ids and neighbor_idx >= 0:
                neighbor = fetch_by_id_fn(source, neighbor_idx)
                if neighbor:
                    seen_ids.add(key)
                    expanded.append(neighbor)

    return expanded

This technique increases the context window for each retrieved chunk without retrieving irrelevant documents. The model now sees the paragraph before and after each hit, preserving the narrative flow of the original document.

Chapter Nine

Evaluating Your RAG System: Metrics That Matter

You cannot improve what you don’t measure. RAG evaluation requires testing both the retrieval layer and the generation layer independently, then testing them together. Here is a practical evaluation framework you can implement in an afternoon.

Python

from dataclasses import dataclass

@dataclass
class EvalCase:
    query: str
    expected_answer: str           # Ground truth answer
    expected_source_title: str     # Which doc should be retrieved
    should_abstain: bool = False  # True if answer isn't in knowledge base


def evaluate_rag(
    eval_cases: list[EvalCase],
    ask_fn
) -> dict:
    """Run evaluation suite and compute aggregate metrics."""
    results = {
        "retrieval_recall": [],      # Did we retrieve the right doc?
        "answer_correctness": [],    # Is the answer factually right?
        "abstention_accuracy": [],   # Did we abstain when we should?
        "hallucination_count": 0,    # Times we answered but shouldn't
        "false_abstention_count": 0, # Times we abstained but had answer
    }

    for case in eval_cases:
        response = ask_fn(case.query)

        # Check retrieval
        retrieved_titles = [d["source_title"] for d in response.source_documents]
        retrieval_hit = case.expected_source_title in retrieved_titles
        results["retrieval_recall"].append(retrieval_hit)

        # Check abstention behavior
        did_abstain = response.status == "no_context"

        if case.should_abstain and not did_abstain:
            results["hallucination_count"] += 1
        elif not case.should_abstain and did_abstain:
            results["false_abstention_count"] += 1

        results["abstention_accuracy"].append(
            case.should_abstain == did_abstain
        )

    # Aggregate
    n = len(eval_cases)
    return {
        "retrieval_recall": sum(results["retrieval_recall"]) / n,
        "abstention_accuracy": sum(results["abstention_accuracy"]) / n,
        "hallucination_rate": results["hallucination_count"] / n,
        "false_abstention_rate": results["false_abstention_count"] / n,
        "total_cases": n,
    }

Build your evaluation set with 50–100 cases. Include three categories in roughly equal proportion:

  1. Answerable questions — questions whose answers are clearly present in your knowledge base. Measure retrieval recall and answer quality.
  2. Unanswerable questions — questions that are in-domain but not covered by your documents. The system should abstain. Failures here are hallucinations.
  3. Edge cases — questions that are partially answerable, or whose answers span multiple documents, or that use different terminology than your source material. These expose the weaknesses in your chunking and retrieval strategy.

The metric you care about most is hallucination rate: the percentage of unanswerable questions where the system generated an answer anyway. In production, this should be below 5%. If it isn’t, your retrieval threshold is too low or your system prompt isn’t constraining the model enough.

Chapter Ten

Production Patterns: The Details That Ship

The difference between a RAG prototype and a production RAG system is not the model or the retrieval quality. It’s the twenty small engineering decisions that handle real-world messiness. Here are the four that matter most.

Pattern 1: Streaming Responses with Live Citations

Users don’t want to wait 8 seconds staring at a spinner. Stream the answer token by token, then append citations when the stream completes:

Python

def ask_streaming(query: str, search_fn):
    """Stream the RAG response for responsive UX."""
    context_docs, has_context = retrieve_with_threshold(query, search_fn)

    if not has_context:
        yield {"type": "status", "status": "no_context"}
        return

    messages = build_rag_prompt(query, context_docs)
    full_answer = ""

    # Stream the generation
    with client.messages.stream(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages,
        temperature=0.0
    ) as stream:
        for text in stream.text_stream:
            full_answer += text
            yield {"type": "token", "text": text}

    # After stream completes, emit citations and confidence
    citations = extract_citations(full_answer, context_docs)
    confidence = compute_confidence(context_docs, full_answer, citations)

    yield {
        "type": "metadata",
        "citations": citations,
        "confidence": confidence,
    }

Pattern 2: Conversation Memory with Context Windowing

Real users ask follow-up questions. “What’s the refund policy?” followed by “What about for digital products?” The second question is meaningless without the first. But you can’t simply prepend the entire conversation history to every retrieval query — it dilutes the embedding. Instead, use the LLM to rewrite follow-ups as standalone questions:

Python

def rewrite_with_context(
    current_query: str,
    conversation_history: list[dict]
) -> str:
    """Rewrite a follow-up question as a standalone query."""
    if not conversation_history:
        return current_query

    history_text = "\n".join(
        f"{msg['role']}: {msg['content']}"
        for msg in conversation_history[-4:]  # Last 4 turns only
    )

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Fast, cheap model for rewriting
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": f"""Given this conversation:
{history_text}

Rewrite this follow-up question as a standalone question that captures
the full intent. Return ONLY the rewritten question.

Follow-up: {current_query}"""
        }]
    )

    return response.content[0].text.strip()

Pattern 3: Source Document Freshness

When your knowledge base contains multiple versions of a document — a policy updated three times over two years, for example — the retrieval system might surface all three versions. The model then has contradictory context. Use metadata filtering to prefer the most recent version:

SQL — PostgreSQL + pgvector

-- Retrieve relevant chunks, preferring recent documents
SELECT title, content, created_at,
       1 - (embedding <=> $query_vector::vector) AS similarity
FROM   documents
WHERE  1 - (embedding <=> $query_vector::vector) > 0.35
ORDER BY
    -- Blend similarity with freshness: recent docs get a boost
    (1 - (embedding <=> $query_vector::vector))
    * (1.0 + 0.1 * LN(GREATEST(
        EXTRACT(EPOCH FROM AGE(NOW(), created_at)) / 86400,
        1
    ))) DESC
LIMIT  5;

Pattern 4: Logging for Continuous Improvement

Every RAG interaction is a data point for improvement. Log everything: the query, the retrieved documents, the generated answer, the confidence score, and — crucially — user feedback. This becomes your evaluation dataset for the next iteration.

Python

import json
from datetime import datetime

def log_interaction(query: str, response: RAGResponse):
    """Log RAG interactions for evaluation and improvement."""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "query": query,
        "answer": response.answer,
        "status": response.status,
        "confidence": response.confidence,
        "num_sources": len(response.source_documents),
        "num_citations": len(response.citations),
        "debug": response.debug,
    }

    # Append to JSONL file (one JSON object per line)
    with open("rag_interactions.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

Review your logs weekly. Sort by low confidence scores to find questions your knowledge base can’t answer — these are documentation gaps. Sort by high confidence scores where users gave negative feedback — these are hallucinations your scoring system missed. Both categories directly improve your system.

Key Takeaways

  • RAG retrieves context, then generates. Instead of answering from memory, the model answers from your documents. This makes responses current, auditable, and controllable.
  • The retrieval threshold is your first line of defense. When no document scores above the similarity threshold, the system should abstain rather than pass low-quality context to the model.
  • Prompt engineering is structural, not cosmetic. Explicit grounding rules, mandatory citations, and a concrete abstention instruction prevent the three most common hallucination patterns.
  • Confidence scoring uses three independent signals: retrieval similarity, citation coverage, and abstention detection. No single signal is sufficient. Combined, they give reliable trust estimates.
  • Citations are a defense mechanism, not just a feature. Requiring the model to cite every factual claim creates a verifiable chain from answer to source and constrains the generation process itself.
  • The five failure modes are predictable and preventable: context poisoning, parametric bleed, chunk boundary loss, confident abstention failure, and stale context each have specific, implementable defenses.
  • Measure hallucination rate explicitly. Build an evaluation set that includes unanswerable questions. The percentage of those that receive confident answers is your hallucination rate. Keep it below 5%.
  • Log everything. Every RAG interaction is training data for your next improvement cycle. Low-confidence queries reveal documentation gaps. High-confidence failures reveal scoring blind spots.

Continue the Series

I
The Quiet Revolution: How Semantic Search Is Rewriting the Rules of Data Retrieval

The foundation: embeddings, vector databases, hybrid search, and the pipeline that powers everything in this article.

III
Multimodal Search: When Your Data Isn’t Just Text

Extend semantic search to images, PDFs, code repositories, and audio transcripts using CLIP, ColPali, and unified embedding spaces.

IV
Scaling to 100 Million Vectors: Architecture Decisions That Matter

Sharding strategies, quantization techniques, HNSW vs. IVF index selection, and the real-world cost of vector search at enterprise scale.

◆ ◆ ◆

The goal was never to build a system that always has an answer. It was to build one that never lies about what it knows.


Leave a Reply