lesspoo.com

The cheapest part of RAG is the embeddings

Why most RAG systems return crap from a working retrieval pipeline, and the corpus-curation work that actually fixes it.

The cheapest part of a RAG system is the embeddings. The expensive part is the curation, and most teams skip it.

We have done enough RAG engagements at this point to recognize a pattern. The team that has just stood up its first internal search or assistant comes to us because the results are bad and they cannot figure out why. The vector database is fine. The embedding model is fine. The retrieval code is fine. The chunking is reasonable. None of those are the problem. The corpus has not been curated, and a system retrieving accurately from a bad corpus is going to produce bad output reliably.

The first time a team encounters this, the diagnosis is uncomfortable. It does not feel like the kind of work the team thought they were signing up for. The team built a search system. They did not plan to spend two weeks reading their own documents. The honest answer is that reading the documents is the work, and the team that decides not to do it is not actually shipping a search system. They are shipping a way to surface their own document quality problems to the people they were hoping to help.

A useful way to think about a corpus is as a set of layers. The top layer is content that is correct, current, and authoritative. Internal docs that someone has actually maintained, source code with current behavior, customer-facing pages that have been reviewed in the last quarter. The middle layer is content that was correct at some point, may still be correct, but cannot be safely cited without checking. Old design docs, legacy runbooks, support tickets that include workarounds for bugs that have since been fixed. The bottom layer is content that is wrong, stale, contradicts the top layer, or is itself a paraphrase of other content that already exists in the corpus.

The bottom layer is what produces the bad output. The system retrieves from the top, the middle, and the bottom equally. The model summarizes what it found. The user gets a confident-sounding answer that is wrong, because part of what was retrieved was wrong. The team's instinct is to tune the model. The model has nothing to do with it. The retrieval was correct. The corpus was wrong.

The work to fix this is mostly mechanical. Identify the corpus. Establish provenance for each chunk. Decide which layer each piece sits in. Remove the bottom layer. Mark the middle layer with a flag the system uses to either de-prioritize it or surface a warning when it is the basis for an answer. The top layer becomes the default source. The team that does this work in week one of the engagement has a system that works in week two. The team that skips it is doing model tuning indefinitely and getting marginal improvements.

The painful truth is that most internal corpora have a meaningful bottom layer. Documents written by people who have since left and were not reviewed before publication. Old code that was never deleted. Tickets that documented fixes that were later reverted. Wikis that were maintained at one point and have not been touched in three years. None of this is unusual. All of it is in the search results until someone removes it.

The other painful truth is that a lot of recent corpora include AI-generated content that is itself a confident summary of older AI-generated content. The recursion produces a particular flavor of plausible-sounding wrongness. Removing this layer matters more than people often realize, because once it is in the corpus it tends to outrank human-written content in retrieval scoring even though it is by definition derivative.

The teams that come out of these engagements with working systems usually share a few practices. They have a stated source-of-truth for each topic the system answers. They have a process for moving content between layers as it ages. They have an eval harness that catches regressions when bad content sneaks back into the top layer. And they have a designated person who is responsible for the corpus, separately from the people who own the model and the retrieval code. The corpus is the system. The rest of it is plumbing.

For a team starting on its first internal RAG, the working pattern is to budget at least as much time for the corpus as for the engineering. Most of the engineering decisions can be revisited cheaply later. The corpus decisions are the ones that produce or block useful answers, and they are the ones the team has the most leverage to get right at the start. Skipping the corpus work and assuming the model will compensate is the most common reason RAG systems do not work, and it is the most expensive thing to fix later.