lesspoo.com

An eval harness that survives past month two

How to build the evaluation pipeline a RAG team will actually keep using, in five pieces, with the loop that prevents abandonment.

An evaluation harness for a RAG system needs to survive past month two. Most do not, and the reason is almost always the same.

The team launches the system, builds a benchmark to validate it, runs the benchmark, ships, and then never runs the benchmark again. The questions the team chose for the benchmark were the questions the team thought of during launch. The questions the system actually receives in production are different, and they shift over time as the user population and their problems shift. The benchmark that validated the launch becomes a museum piece by the second quarter.

We work on RAG systems long enough to have watched this happen at most of our customers, and we now insist on a different design from the start. The eval harness that survives is not a benchmark that gets run before each release. It is a continuously-fed pipeline that takes real user queries, samples them, scores the results, surfaces regressions, and gets enough attention from the team that it is uncomfortable to ignore. We want to describe how that pipeline is structured, because the structure is where the survival comes from.

The first piece is the input. The harness should pull from real production queries, not from a static list. The list ages. Production does not. We sample queries continuously, with a rotating window that includes recent queries and queries that have been part of the eval for longer. The recent ones catch new failure modes. The older ones catch regressions that would otherwise drift in.

The second piece is the labels. Each query in the harness has at least a partial label of the right answer or the right kind of answer. Producing the labels is the most expensive part of the work, and the eval is only as good as the labels. The pattern that has worked for us is to label a small set of queries thoroughly, label a larger set with structured rubrics that a non-expert can apply, and let the harness grow incrementally rather than trying to fully label a large set up front. A small high-quality eval is more useful than a large fuzzy one.

The third piece is the scoring. We score on multiple dimensions deliberately, because RAG systems can fail in non-obvious ways. Retrieval correctness, source quality, factual accuracy of the synthesized answer, hallucination rate, latency at the relevant percentiles, cost per query. The dimensions matter because a single overall score lets the team trade off in ways they should not. A system that is faster but hallucinates more is worse, even if its overall score is higher.

The fourth piece is the surfacing. The eval has to be visible to the people who can act on it. A dashboard nobody opens is a benchmark that runs in a vacuum. We wire the harness into the team's regular operating cadence. A weekly report. A regression alarm that fires on slack. A required review during release planning. The mechanism that works depends on the team. The principle that does not change is that the eval has to be in front of someone who can do something about it.

The fifth piece is the loop. When the harness identifies a regression, the team needs a path from "the eval is failing" to "the issue is fixed." The most common reason eval harnesses get abandoned is that this loop is broken. The eval surfaces failures, nobody owns the fix, the failures persist, the team stops looking at the eval, the eval atrophies. We build the loop explicitly. The harness produces a categorized failure list. Categories have owners. Owners have an SLA. The output of the loop is either a fix or an explicit decision to accept the failure mode. Either is fine. The thing that is not fine is silence.

The teams that build this kind of harness tend to keep their RAG systems working for years. The teams that do not tend to ship a working system in week one and watch it degrade quietly until someone complains. The cost of building the harness is a few weeks of engineering plus an ongoing operational discipline. The cost of not building it is the system getting worse without anyone noticing until the customers do.

For a team that has shipped its first RAG system and is wondering whether to invest in eval infrastructure, the reasonable next step is to look at how often the team has actually rerun whatever benchmark they used at launch. If the answer is "rarely or never," the eval is already dead. The team is operating on hope. The work to fix this is bounded, the operating model is repeatable, and the system that comes out of the discipline is one the team can trust over time.