lesspoo.com

How to triage flaky tests without rewriting them

A working framework for engineering teams whose CI is being eaten by flaky tests, focused on finding the small set of tests that produce most of the pain.

Every team eventually arrives at the same realization. The CI pipeline that used to run green most of the time now runs red about as often as green, and the engineers know which builds to ignore and which to rerun. The tests are not catching real regressions, because the noise has trained everyone to retry on red. Real regressions are getting through, and the team is paying for it elsewhere.

The instinct at this point is to declare flake bankruptcy and rewrite the test suite. The instinct is wrong. A wholesale rewrite produces a worse outcome than a targeted triage operation, takes longer, and rarely actually finishes. The teams that recover from a flake-eaten CI tend to do it without rewriting most of the tests.

This is the working version of how to triage flaky tests when the budget is small and the goal is to ship product again.

The eighty-twenty law of flakes

The most reliable empirical finding in CI reliability work is that a small number of tests produce most of the noise. In a typical suite of a few thousand tests, somewhere between ten and fifty tests are responsible for the majority of the false-positive failures. The rest of the suite either runs reliably or fails so rarely that nobody has paid attention.

The implication is that the right first move is not "rewrite the test suite." It is "find the ten to fifty tests that are eating CI and fix them." That work is a finite, achievable project. The team that frames it that way ships in weeks. The team that frames it as a cultural problem to solve through code review hygiene is the team that is still flaking a year later.

Instrument before you fix

The team needs data on which tests are flaking and how often. Most CI systems do not surface this well. The team needs to either install a tool that aggregates test results across runs and identifies flaky ones, or build a small in-house version that does the same thing.

The minimum data the system needs to produce, per test, is the count of runs in the last thirty days, the count of failures, and the count of cases where a failure on one run was followed by a success on the same code in the next run. The last metric is the flake signal. A test that fails and then passes on the same commit is, by definition, not deterministically detecting anything about that commit.

Once the data exists, the team can produce a list of tests in descending order of flakes. The first dozen on the list are usually responsible for more pain than the rest of the suite combined.

Categorize before you act

Each flaky test on the list is one of a small number of types. The right fix depends on the type, and the wrong fix can make the test worse without removing the flake.

Timing-dependent tests. The test waits for an asynchronous operation and assumes the operation completes within a fixed duration. The fix is to wait for a specific signal rather than a wall clock, or to extend the timeout to a value that handles the slowest realistic case. Tests in this category are usually fixable in twenty minutes each.

Order-dependent tests. The test passes when run in isolation and fails when run alongside another test that mutates shared state. The fix is to isolate the shared state, either by giving each test its own copy or by ensuring deterministic teardown. Tests in this category often expose a deeper problem with how the test suite manages state, and fixing them tends to reveal additional latent flakes that were masked.

Environment-dependent tests. The test depends on a specific configuration, a specific port, a specific environment variable, or a specific version of an external dependency. It passes on most CI runners and fails on the one with the wrong configuration. The fix is to make the dependency explicit or to mock it. Tests in this category often signal a CI runner inconsistency that is itself worth investigating.

Concurrency-dependent tests. The test exercises code that has a race condition. The race almost always loses but occasionally wins, producing rare flakes. The fix is to find and fix the race in the production code, not the test. Tests in this category are doing their job, and rewriting the test to ignore the flake is hiding a real bug.

Truly nondeterministic tests. The test depends on a source of randomness or external state that is not actually deterministic. The fix is to seed the randomness explicitly or remove the dependency. These are easy to identify and the fix is mechanical once they are.

The categorization itself is the first hour of the triage. The team that categorizes the top dozen flakes before fixing any of them tends to fix them faster, because the fixes follow a pattern within each category.

The order of operations

A useful order for the work, from highest leverage to lowest.

Quarantine the worst offenders immediately. Move the top ten flaky tests into a separate quarantined suite that runs but does not block CI. The CI pipeline becomes signal again on the same day. The quarantined tests are not gone, but they are no longer eating the team's attention.

Fix or delete the quarantined tests over the next two weeks. Each test in the quarantine gets either a fix that returns it to the main suite or a deletion if the test is not actually testing anything important. Deletions are uncomfortable. Some of the quarantined tests are decoration on top of better-tested code paths. Removing them improves the suite.

Address the next tier of flakes. After the worst offenders are handled, the next twenty or so are usually meaningfully less painful but still worth fixing. Run the same categorization and the same fix-or-delete pass.

Install ongoing measurement. The team should be able to see, at any time, the current flake list and how it is changing week to week. New flakes get caught early, before they accumulate into another wave.

Educate during code review. Once the existing flakes are handled, the goal is to prevent new ones. Code review of new tests should explicitly check for the categories above. This is the long-term cost of keeping the suite reliable, and it is much smaller than the cost of letting flakes recur.

What rewriting is actually for

There is a real case for rewriting parts of the test suite, and it should be reserved for that case. If the suite is structurally unreliable because of how it manages fixtures, state, or environment setup, no amount of fixing individual tests will produce a stable suite. The structural rewrite, in that case, is the right answer.

The structural rewrite is a much smaller project than the wholesale rewrite. It tends to involve a single file or a single helper module that affects how every test sets up state. Fixing that one piece often fixes a category of flakes that would otherwise need to be fixed test by test.

The team should still do the triage work first. The structural problem is much easier to see once the noise has been quieted. The team that starts with the structural rewrite often discovers, three weeks in, that the structural problem they were rewriting was not the real source of the flakes. By then they have wasted three weeks.

The honest mood

A flake-ridden CI is an environment where the team has lost trust in the signals they instrumented. The trust is recoverable. The recovery is not glamorous. It is a couple of weeks of unromantic work on the most painful tests, followed by ongoing maintenance.

The teams that complete this recovery tend to look back on it as one of the highest-leverage projects of the year. The CI returns to being something the team trusts, real regressions get caught again, the on-call rotation gets quieter, and the engineering manager stops fielding the same complaint at every retro. None of that requires the wholesale rewrite. All of it requires sitting down with a list of the worst offenders and doing the work, in order, until the worst offenders are no longer the worst.

The hardest part is starting. The work is small enough that "we'll get to it next sprint" is the answer for years if nobody forces the issue. Forcing the issue is what the recovery is.