Evaluating AI Chatbot Responses Without Guessing: A Production Eval Framework
The Problem: Your Chatbot Feels Fine Until It Isn't
Staging demos are worthless. You hand-picked five questions, watched the bot return plausible paragraphs, and shipped. Two weeks later, a user asks, "How do I downgrade my plan?" and your bot confidently describes a non-existent "Plan Downgrade API" because the retriever surfaced a deprecated changelog from 2022. Another user asks for the refund policy, and the bot cites the privacy policy instead—same confident tone, wrong document. Manual review does not scale; you need numerical guardrails that fail CI when your RAG pipeline silently drifts.
The Three Metrics That Actually Matter
BLEU and ROUGE correlate poorly with human judgment for generative QA. Use task-specific metrics that map to failure modes:
Faithfulness — every claim in the generated answer must be verifiable in the retrieved context. This catches hallucinations and unsupported extrapolations.
Answer Relevancy — the answer must address the specific user question. High faithfulness paired with low relevancy means the bot is accurately quoting the wrong document.
Context Recall — the retrieved chunks must contain the information required to answer the question. This measures the retriever, not the LLM.
For extractive tasks—returning an order ID, a deadline date, or a policy clause—skip LLM judges entirely. Use exact-match F1 or citation-overlap scoring (compute token overlap between generated source citations and the ground-truth source spans). They are deterministic, auditable, and immune to judge-model bias. If you need a custom rubric (e.g., scoring multi-step reasoning or brand tone), use G-Eval, which uses chain-of-thought reasoning followed by form-filling to produce a scalar score.
Thresholds (estimated based on typical production deployments):
- Faithfulness < 0.70: block deploy; high hallucination risk.
- Answer Relevancy < 0.75: answers are drifting off-topic.
- Context Recall < 0.75: retrieval is broken; prompt tweaks are a waste of time.
0.85 on all three is a healthy, ship-ready pipeline.
Building a Ground-Truth Eval Set Without Dying
For a 200-article knowledge base, fifty well-stratified Q&A pairs beat five hundred synthetic ones. The goal is coverage, not volume.
Structure your 50 questions like your real support queue (suggested split based on typical SaaS support distributions):
- roughly 40% how-to (e.g., "How do I rotate an API key?")
- roughly 30% troubleshooting (e.g., "Why am I getting a 422 on /v1/invoices?")
- roughly 20% policy (e.g., "What is the refund window?")
- roughly 10% edge / multi-hop (e.g., "Do I need an enterprise plan to use webhooks in the EU?")
Mine questions from actual support tickets, not from article headings. For each item, store:
- A 1–2 sentence golden answer.
- The exact chunk IDs required to derive it.
- One "distractor" chunk that looks relevant but does not contain the answer (this lets you measure precision).
Fully synthetic generation with GPT-4 is tempting, but it overfits to the exact phrasing and structure of your headings, inflating retrieval scores unrealistically. Use humans to label edge cases and adversarial pairs; use LLMs only to bootstrap the first roughly 20% as a starting draft. Version your eval set in Git; changing questions mid-flight invalidates trend lines.
Running RAGAS (or Your Own Equivalent) in CI
RAGAS automates the LLM-as-judge loop: it prompts a judge model to compare the answer against the retrieved context and the user question, returning scalar scores for faithfulness, relevancy, and recall. A 50-example run costs roughly 150 LLM calls (three per metric). With GPT-4o-mini, this is pennies and takes under a minute if parallelized; with GPT-4-class judges, budget a few dollars and ~90 seconds per CI run.
The circularity problem is real: using GPT-4 to judge GPT-4 output measures self-consistency, not truth. Mitigate it three ways:
- Model separation: Judge GPT-4 answers with Claude 3.5 Sonnet, Gemini 1.5 Pro, or a local Llama-3-70B.
- Rotating judges: Swap judge models monthly. If your pipeline score jumps while deterministic metrics stay flat, your judge is drifting, not your bot.
- Deterministic anchors: Pair every LLM-judge run with citation-overlap or exact-match baselines. If citation overlap drops but faithfulness stays high, your LLM judge is forgiving hallucinations.
Run this suite as a CI gate on every PR touching prompts, chunk size, embeddings, or rerankers. Post-merge dashboards are too late.
A Worked Example: 50-Question Eval Run, Annotated
You ship a support bot backed by 200 articles: API reference and billing FAQs.
Your eval set includes:
- Q1: "How do I rotate my API key?" Golden answer cites article #14. Expected: "Go to Settings > API Keys and click Regenerate." This is a single-hop how-to with a precise answer.
- Q26: "Why was I charged twice this month?" Golden answer requires article #7 ("Billing Cycles") and article #11 ("Refund Policy"). This is multi-hop: the user needs both the explanation and the remedy.
- Q38: "Can I use webhooks on the free tier?" Golden answer cites article #22. Expected: "No, webhooks require a Pro plan or higher." This is a strict yes/no policy boundary.
You tweak the system prompt to make answers "friendlier" and rerun the suite. The scores (estimated based on typical production deployments) are:
- Faithfulness: 0.82 (estimated) — approximately 18% of answers contain unsupported claims. In Q26, the bot added, "You will be refunded automatically within 24 hours," a detail that does not exist in article #11. The model invented a concrete timeline to be helpful.
- Answer Relevancy: 0.79 (estimated) — The bot sometimes answers adjacent questions instead of the one asked. Q38's response started defining webhooks and describing their architecture rather than stating the plan restriction. The user got a lecture, not an answer.
- Context Recall: 0.71 (estimated) — The retriever failed to return the necessary chunk for roughly 29% of questions. Q26 only retrieved the billing-cycle article and missed the refund policy entirely. Q38 retrieved the webhook setup guide but missed the plan-requirement callout.
What these numbers mean in practice: Context recall is your hard ceiling. Even a perfect LLM cannot answer what the retriever never fetched. A 0.71 recall means your best possible accuracy is capped near 71% before the LLM writes a single token. Your fix is not prompt engineering; it is hybrid search (BM25 + dense embeddings), smaller chunks with overlap, or metadata filters so billing disputes surface article #11.
What to Do When a Metric Drops
Do not touch the eval set. Re-run the identical 50 questions and compare distributions.
- Faithfulness drops → Add a strict citation constraint: "Append the source document ID in brackets after every factual claim." Check if truncation is cutting off the bottom of retrieved chunks, which forces the model to hallucinate missing details.
- Answer relevancy drops → The prompt is likely over-optimized for helpfulness, causing the model to generalize. Revert to: "Answer only the user's specific question using the provided context. Do not extrapolate."
- Context recall drops → Retriever failure. Reduce chunk size to 256–512 tokens with roughly 20% overlap, or add a reranker (e.g., Cohere Rerank or a local cross-encoder). If recall remains below roughly 0.75 after reranking (estimated threshold), your embedding model is mismatched to the domain; fine-tune on your 200 articles or switch to a domain-specific model.
One-Line Summary: What Simple Agent Logs by Default
Simple Agent streams faithfulness, answer relevancy, context recall, and citation-overlap scores to stdout on every turn, so you can grep for regressions without leaving the terminal.
Ready to build your AI agent?
From zero to embedded agent in 90 seconds. Unlimited messages.
Create my agent