New: Read our White Paper 2026 on how teams test AI features and agents. Read the white paper

← Back to blog
Mar 30, 2026·7 min read·The Swarmcheck team

Killing the flake: how Swarmcheck handles waits, retries, and timing

Most 'flaky' tests are timing bugs in disguise. We rebuilt our wait strategy from the ground up. Here's what shipped.

Ask any QA engineer what they hate most about E2E testing and you will get the same answer: flake. A test that passes nine times and fails once is worse than a test that always fails — at least the latter is honest.

When we started Swarmcheck, our first runs had a flake rate north of 8%. After a quarter of work it sits under 0.4%. Here is what actually moved the number.

Most flake is a waiting bug

If you stare at a hundred flaky failures in a row, a pattern jumps out. Almost none of them are real bugs. Almost all of them are the test arriving at the page a beat before the page is ready to be tested.

Scripts paper over this with timeouts and polling. waitForSelector, sleep 500, retry once on fail. It works until it doesn't, and then it fails in a way that is impossible to debug because the failure happened in a race that the logs cannot reconstruct.

What we do instead

Our agent treats "is this page ready" as a first-class question. Before it tries the next action, it looks at the screen and asks: does this match the state I expect to be in? If yes, proceed. If no, wait one beat and look again. If still no after a budget, fail loudly with the screenshot.

  • Network idle is a hint, not a verdict. We watch for in-flight requests but never block on them.
  • DOM mutations are a hint, not a verdict. A spinner stops, that does not mean the data arrived.
  • The verdict is visual: does the screen look like a page a user could act on right now?

Retries are a smell

We do retry — but only the cheap, deterministic stuff: a network blip on the initial page load, a 502 from a CDN, a timeout reaching the screenshot service. We never retry an assertion. If the agent decided "this page does not show the order confirmation," running it again is not going to help — it is going to hide a real bug behind a green tick.

Time is the enemy you can measure

The single biggest unlock was treating wall-clock time as an observable, not a constant. Every action records how long it took the screen to settle. We chart it per check, and any check whose settle time drifts above its baseline gets a quiet alert before it ever flakes.

Flake is what timing bugs look like when nobody is measuring time.

What's left

We are nowhere near done. The remaining 0.4% is mostly third-party iframes (payments, captchas, support widgets) that genuinely behave differently across runs. We are working on isolating those into their own check class so a flaky Stripe iframe never red-lights an unrelated checkout test again.

If you have war stories, we want them. Half of what shipped this quarter started as a customer screenshot of a confusing failure.

[ Try Swarmcheck ]

Point an agent at your worst flow.

The fastest way to find out if Swarmcheck fits your team is to run it against the flow you trust the least.