Apr 18, 2026·6 min read·The Swarmcheck team

Why a tuned QA agent beats a scripted suite (and where it doesn't)

Selectors are the worst part of E2E testing. Here's how we replaced them with intent-based agents — and the tradeoffs you should know going in.

Most engineering teams have the same E2E story. Someone wires up Playwright in a weekend, the suite runs green for a month, and then the redesign lands. Selectors break, waits get racy, retries pile up, and the suite quietly turns into a tax everyone pays and nobody trusts.

We built Swarmcheck because we kept hitting the same wall — and we wanted to test the boring hypothesis: what if the test agent never sees a selector at all?

What an intent-based agent actually does

A Swarmcheck check is a sentence. "Sign up with a new email, verify the inbox, land on the onboarding page." The agent opens a real browser, looks at the screen the way a person would, decides what to click, and writes down what happened — pass, fail, with a replay attached.

There is no locator. There is no waitForSelector. There is no "flaky test" Slack channel. When the redesign moves the sign-up button into a modal, the agent finds the modal.

Where scripts still win

We are not going to pretend agents win every benchmark. Here is the honest list of places a Playwright suite is still the right tool:

Per-component unit-style assertions where you want sub-millisecond feedback in CI.
Deterministic API contract tests — agents add nothing here, just call the endpoint.
High-frequency smoke checks against an internal-only admin tool that nobody is going to redesign.
Anything where the assertion is "this exact byte appeared in this exact response" — that is a script, not a check.

Where the agent compounds

The interesting wins show up when the app is changing. A scripted suite is most expensive on the days you ship the most — exactly when you can least afford the maintenance. An agent suite gets cheaper relative to a script suite the faster the product moves.

Pre-merge: every PR gets a smoke check on the preview deploy. The agent reads the PR title for a hint and re-runs the affected flows.
Release suites: hundreds of English checks fan out in parallel. Coverage that used to take an overnight run finishes before standup.
Production heartbeats: the same English check runs hourly against prod. The on-call gets a video the moment checkout breaks, not three support tickets later.

The honest tradeoffs

Agents are not free. A run costs more than a Playwright run because there is a model in the loop. Latency is in seconds, not milliseconds. And you have to trust the verdict — which means you need a great replay, a great reasoning trace, and a way to disagree with the agent when it gets it wrong.

We spend most of our engineering effort on those three things, in that order. The clicking part is the easy part.

Try it on your worst flow

If you want to see whether this fits your team, pick the flow you trust the least — usually checkout, sign-up, or billing — and point Swarmcheck at it. You will know within a day whether it earns a slot in your CI.

[ Try Swarmcheck ]

Point an agent at your worst flow.

The fastest way to find out if Swarmcheck fits your team is to run it against the flow you trust the least.

Book a demo See the product