Evals for QA agents without losing your mind
A pragmatic guide to evaluating an autonomous tester at scale, with the tradeoffs we made along the way.
Building a QA agent is, in the end, a meta-QA problem. Who tests the tester? If you ship a model change and the agent quietly starts missing a class of bugs, every customer downstream is now shipping those bugs to production. Evals are the only thing standing between you and that outcome.
We tried five different ways to run evals before settling on what we have today. Here is the short version of what worked and what we threw away.
What we tried first (and why it failed)
The obvious first move is a giant golden set: a few hundred fixed flows, each with a known correct verdict, replayed every time the agent changes. We built it. It worked for about three weeks.
- The set went stale fast — apps move, flows move, fixtures rot.
- It told us when the agent regressed, but not where or why.
- It was expensive to run on every change, so engineers stopped running it on every change.
What actually works
We split evals into three layers, each with a different cost and a different question.
- Unit evals on the perception layer — given this screenshot, can the agent name the elements correctly? Cheap, fast, runs on every PR.
- Behavior evals on individual steps — given this state and this instruction, does the agent take the right next action? Run on every model change.
- End-to-end evals on real customer flows (anonymized, with consent) — does the full agent reach the right verdict? Run nightly and on every release candidate.
Treat the eval set like product code
The single most important habit: every customer-reported false positive or false negative becomes an eval the same day. The set grows from real failures, not from imagined ones. After a year of this, the set is a fairly good map of "the actual web," not "the web we thought we were testing."
Cost discipline
Evals can quietly become the single biggest line item on your model bill. We keep them honest with three rules:
- Sample, don't sweep. Every PR runs a stratified sample of the full set, not the full set.
- Cache aggressively. Perception evals run against fixed screenshots — same input, same answer.
- Promote slowly. A change that looks good on the sample runs against the full set before merging to main.
What we still don't have
We do not have a clean answer for evaluating verdicts where the ground truth is genuinely ambiguous. "Is this confirmation page good enough?" is a real question with multiple defensible answers. For now we route those to a human reviewer and treat the human's call as the label. It is not glamorous, but it is honest.
If you are building anything in this space, the meta-lesson is simple: the eval set is the product. The agent is the thing it produces.
Point an agent at your worst flow.
The fastest way to find out if Swarmcheck fits your team is to run it against the flow you trust the least.