You’ve built an AI feature. It works brilliantly in testing. You ship it. Within 48 hours, a paying customer screenshots a response that’s confidently, completely wrong — and shares it publicly.
This is the founding moment for every AI product team in 2026. Traditional QA assumes deterministic outputs: given input X, you always get output Y. But LLMs don’t work that way. The same prompt can produce 100 different responses, each valid, each slightly different — and occasionally, one that’s a disaster.
AI coding agents now author roughly 4% of all public GitHub commits. A three-person YC startup called Relayboard shipped a full production SaaS to paying users in 17 days using Claude Opus 4.7 and GPT-5 Codex. The velocity is real. The risk is real too. Here’s the QA playbook that works.
The core problem is probabilistic outputs. Unit tests work because 2 + 2 == 4 is always true. But “summarise this support ticket” has hundreds of acceptable answers and a long tail of unacceptable ones. You can’t assert equality.
There’s also the prompt regression problem. You improve your system prompt to fix one class of failures — and inadvertently break another class you weren’t watching. And there’s model drift: the provider silently updates their model. Your prompts worked perfectly with Claude Opus 4.7 (87.6% SWE-bench Verified); they behave subtly differently with Claude Opus 4.7 (87.6%). You find out from a customer complaint, not a test failure.
Instead of asserting exact outputs, assert properties of outputs. Does the response stay under 200 words? Does it contain the customer’s name? Does it avoid mentioning a competitor? These are fast, cheap, and deterministic. They run in CI. Build 20–30 unit evals before you ship anything to users.
Integration evals test multi-step flows end-to-end. If your product uses an agent that retrieves context, reasons, and then generates a response, run that entire flow against representative inputs and check final output quality. They catch failures that unit evals miss — tool call errors, context truncation, unexpected model behaviour mid-chain.
Use a second LLM to judge whether responses meet your quality criteria. Write a judge prompt describing your rubric: accuracy, helpfulness, tone, format. Run outputs through the judge. Set a threshold — 85% must score 4/5 or higher — and fail the eval suite if you fall below it. Use a cheap model as judge (DeepSeek V4 at $0.30/M, Gemini Flash). Reserve expensive models for production.
Run the new system in parallel with live production, processing real inputs, but don’t show outputs to users. Compare, log divergences, and flag where the new system would have been worse. Relayboard used shadow mode before rolling out their LLM-powered invoice feature. It caught a class of hallucinations on edge-case formats that their eval suite missed — because those cases only appeared in real customer data.
Log every LLM call. Capture feedback signals (thumbs up/down, session abandonment). Alert on response latency spikes, drops in positive feedback rate, and unusual output length distributions. These are your canary signals before a real incident.
| Tool | Best For | Cost | Setup Time |
|---|---|---|---|
| PromptFoo | Prompt regression, multi-provider comparison | Free / open-source | 30 min |
| Braintrust | Logged evals, human annotation, CI integration | Free tier; from $150/mo | 1–2 hours |
| LangSmith | LangChain apps, tracing, dataset management | Free tier; from $39/mo | 1 hour |
| Ragas | RAG pipelines (faithfulness, context recall) | Free / open-source | 2–3 hours |
| Pytest + custom evals | Full control, property-based assertions | Free | 2–4 hours |
Critical rule: If you cannot roll back an AI feature within 10 minutes, you are not ready to ship it. The ability to revert is the fundamental safety mechanism that makes fast shipping sustainable.
1. What is the worst realistic failure? Embarrassing but recoverable = probably shippable with monitoring. Catastrophic (wrong financial figure, PII leak) = hold until you have coverage for that failure class.
2. Do you have at least 50 eval cases and an 85% LLM-judge pass rate? Below this, you are conducting user research, not shipping a product feature.
3. Can you roll back in under 10 minutes? If yes, risk is bounded. If no, fix that first.
The goal is not zero risk. It’s bounded, recoverable risk. The Relayboard team rolled back twice during their 17-day sprint — both times in under 5 minutes. They shipped faster because they could recover faster. Build the safety net, and you can move without hesitation.
The founders making the most progress share eval patterns, judge prompts, and staging strategies openly. That’s exactly what we do in the community below.
Join our free community of AI-first founders. Weekly hands-on sessions on Claude Code, Cursor, agent frameworks, and practical engineering topics — plus templates and playbooks you can use the same day.
Join the Free Community →