RAG chatbots that do not make things up

A plain language model is confident even when it is wrong. For a customer-facing chatbot, that is a liability. Retrieval augmented generation (RAG) fixes the root cause: instead of answering from memory, the model first retrieves the relevant passages from your own documents, then answers using only that context, with sources. The payoff is measurable. Properly built and continuously evaluated RAG cuts hallucination rates by up to roughly 71 percent, and on grounded tasks where the answer must come from supplied text, the best 2026 models hold hallucination rates under 2 percent. That is the difference between a demo and something you can put in front of paying customers.

What changes with RAG

The model stops guessing about your product, your policies and your prices. It reads your knowledge base at answer time and cites where each claim came from. When the documents do not cover a question, a well built assistant says so instead of inventing an answer.

This matters because the base failure mode is severe. Independent trackers in 2026 still report hallucination rates climbing well past 50 percent on hard, open-ended prompts for some models. RAG does not make a model smarter; it changes the question it is answering. Instead of “what do you remember about returns,” it becomes “given these three retrieved passages from our returns policy, answer and cite them.” That reframing is where the reliability comes from.

What a trustworthy build actually needs

Retrieval alone is not enough. The parts that decide whether you can put it in front of customers are the unglamorous ones:

Clean ingestion and chunking of your real documents
A retrieval step tuned to your content, not defaults
A reranking stage that scores the best candidates before the model sees them
Guardrails for off-topic and unsafe requests
Evaluation against real questions before launch
Monitoring so you see failures in production, not from angry users

Skip any one of these and the system looks fine in a scripted demo, then falls apart on the long tail of questions real users actually ask.

Chunking is where most builds quietly fail

How you split documents decides what retrieval can ever find. The common mistake is fixed-size chunking, cutting every document into uniform blocks regardless of structure. It is fast, and it neutralizes most of the gains from a good retriever, because the system ends up scoring ambiguous half-passages that start mid-sentence and end mid-clause.

Section-level or semantic chunking, where splits follow the document’s own headings, clauses and paragraphs, preserves meaning. A practical range for most knowledge bases is 100 to 600 tokens per chunk, tuned to the content: short clauses for policy and legal text, larger sections for narrative documentation. The chunking default we start every client on is section-aware splitting at around 300 tokens, then we move it based on what the evaluation set shows on their actual documents. Add a modest overlap between chunks so a fact that straddles a boundary is not lost. Attach metadata to every chunk too, such as source title, section, last-updated date and product version, so retrieval can filter and the answer can cite precisely. Clean ingestion is unglamorous, but it sets the ceiling on everything downstream.

Retrieve, then rerank

The retrieval pattern that has become standard practice in 2026 is two stages. First, a fast embedding-based search pulls the top 50 to 100 candidate chunks from a vector index. Then a cross-encoder reranker scores each candidate against the query and passes only the top 3 to 10 to the model. The first stage favors recall, getting the right passage into the pool; the second favors precision, putting it at the top.

The uplift is real. Cross-encoder reranking has been measured to improve retrieval quality by roughly a third on average, with some enterprise reports closer to a 48 percent gain. Mature rerankers in 2026 include Cohere Rerank 3.5, Voyage rerank-2.5, Jina Reranker v2 and the open BGE reranker-v2, the last of which you can self-host to avoid per-call fees. One caveat worth internalizing: reranking only helps when recall is already high. No reranker can rescue a correct passage that never made the candidate list, so the chunking and embedding work still comes first.

What it actually costs to run

A common worry is that grounding every answer is expensive. In practice the embedding and retrieval layer is cheap, and the model call dominates. Embedding a corpus with a model like text-embedding-3-small runs about 0.02 dollars per million tokens, or half that through batch processing; the higher-dimension text-embedding-3-large is around 0.13 dollars per million. Embedding an entire mid-size knowledge base is typically a one-time cost of a few dollars.

The vector store is similarly modest at small to mid scale. Around 10 million vectors, managed options land roughly between 45 and 135 dollars per month, with pgvector on a Postgres instance at the low end and fully managed services higher. If you already run Postgres and sit under 10 million vectors, pgvector keeps vectors and relational data in one system with no new infrastructure. Costs climb at very large scale, where self-hosting options like Qdrant or Milvus pull ahead.

The generation step is the real variable. As of mid-2026, mid-tier models such as Claude Sonnet 4.6 sit around 3 dollars per million input tokens and 15 per million output, while smaller models like Claude Haiku 4.5 are roughly 1 and 5. Two levers cut this sharply: prompt caching, which reduces repeated context cost by up to 90 percent, and batch processing for offline jobs at about half price. For high-volume support, routing easy questions to a small model and escalating only hard ones to a larger one keeps quality high and the bill predictable.

Prove it before launch with real evaluation

The single most skipped step is evaluation. Before any launch, build a test set of real questions, the messy ones from support tickets and chat logs, not the tidy ones from a happy-path demo. Then score the system on two axes: retrieval, did the right passage get pulled, and faithfulness, is every claim in the answer actually supported by what was retrieved.

Public benchmarks like RAGBench, CRAG and FACTS Grounding are useful reference points, but your own question set matters more because it reflects your customers. Treat the 71 percent hallucination reduction as a target that depends on continuous benchmarking, not a one-time check that decays the moment your documents change. A practical bar before going live: aim for retrieval to surface the correct source in the top results on the large majority of test questions, and for the assistant to abstain rather than guess when the answer is not in the corpus.

Monitor and improve after launch

A RAG system is not finished at launch; it drifts. Documents change, products ship, and new question patterns appear that your test set never saw. Log every interaction with the retrieved sources and the model’s answer so a failure can be traced to its cause: bad retrieval, a stale document, or a generation slip. Track abstention rate, escalation to humans, and user feedback signals, then feed the recurring misses back into your test set and your ingestion pipeline.

This is also where the business case becomes visible. In 2026, customer support is the dominant RAG use case, with AI assistants resolving a large share of routine Tier 1 questions without human escalation and cutting information search time by well over half. Those numbers only hold for teams that keep measuring; the ones who ship and walk away see them erode.

The target first

Before any of that, name the target: what should this assistant get right, for whom, and how will we measure it. That single decision saves more time than any model choice. It tells you which documents to ingest, which questions to evaluate, and what “good enough to launch” means in numbers rather than vibes. It is the same idea behind the name Nerai, aim before action.

Grounding is not a feature you bolt on at the end. It is the architecture, and it is what separates a chatbot that impresses in a meeting from one your customers actually rely on.

Send us a sample of your documents and the questions your customers actually ask, and we will tell you whether RAG is the right fit and what “good enough to launch” should mean in numbers for your case.