lesspoo.com

When your test suite is actually a performance test

A guest post from the SaaSPerform team on what production performance work has taught us about reading test suites as performance signals.

The team that has been running performance engagements long enough develops a particular reflex. Whenever a customer complains that something is slow, the first instinct is to look at what the system is doing rather than to assume the database or the network. After enough engagements, the same reflex starts to apply to test suites. The slow test, in many cases, is not slow because the test framework is slow. It is slow because the production code being tested is slow, and the test is exercising the slowness.

This is not always true. Some slow tests are slow because the fixture is unnecessary, the setup is over-engineered, or the assertions are wasteful. Most CI work focuses on these. But a meaningful share of slow tests are slow because they are accidentally functioning as performance tests, and the team treating them as flakes or fixture problems is missing what the tests are actually telling them.

The signs are recognizable. The test takes longer than its peers, but the slowness is consistent rather than variable. The slowness scales with data size in a way that suggests the production code is not handling growth well. The slowness disappears when the test is run with mocked dependencies and reappears when the dependencies are real. Each of these is a hint that the production code path the test exercises has its own performance issue, which the test is dutifully measuring even though nobody scoped it that way.

For a team facing a stubbornly slow test, the diagnostic question is whether the test is slow because of the test itself or because of the production code. The way to find out is to look at where the test spends its time. If most of the time is in the test setup, the fix is in the test. If most of the time is in the call to the production code, the fix is in the production code, and the test is incidentally surfacing a real performance issue.

This perspective changes how the team should think about tests that are slow. The instinct is to fix the test or to quarantine it. The right move is sometimes neither. The right move can be to recognize that the test has measured a real performance problem in production and to fix that problem first. The test then becomes fast as a side effect, and the production system becomes faster too.

The same perspective applies to integration tests that take a long time to set up databases or external services. Some of the time is unavoidable test infrastructure cost. Some of it is production code that does too many round trips, fetches too much data, or waits on slow upstream calls in a way that the test is reproducing accurately. The team that distinguishes the two and addresses each appropriately makes more progress than the team that treats all integration test slowness as fixture overhead.

The pattern that ties this together is that the test suite is, among other things, a sample of the production code's performance characteristics. A test suite that has gradually slowed over the last year is often telling the team that the production code has gradually slowed too, and the team has not been measuring it elsewhere. The CI duration is the place the slowness shows up first because CI runs on every change, where production performance might only be measured during incidents or capacity reviews.

For a CI reliability team that has not been thinking about tests this way, the working pattern is to keep a running list of the slowest tests and ask, for each one, whether the slowness is intrinsic to the test or whether it is reflecting something about the production code under test. The slowness intrinsic to the test gets the usual treatment. The slowness reflecting production performance issues gets routed to the team that owns the production code, with the test as the evidence.

The teams that adopt this discipline find that test suite improvements and production performance improvements start to compound. The pull requests that fix slow tests often also fix slow production behavior, and the engineers writing the fixes understand the relationship. The teams that do not adopt it tend to fix tests in ways that mask the underlying issue and find that the production issue surfaces somewhere else, weeks or months later, in a context where the connection back to the test is no longer obvious.

The test suite is not separate from production. It is a continuous, automated experiment running on the production code. Reading it that way produces faster systems on both sides.


This is a guest post from the team at SaaSPerform, who run performance engineering retainers for SaaS companies dealing with database, queue, and rendering bottlenecks at production scale.