lesspoo.com

When your retrieval is slow, the answer is rarely the embeddings

A guest post from the SaaSPerform team on the five places production RAG latency actually lives, and why the embeddings are usually fine.

When your retrieval is slow, the answer is rarely the embeddings

The teams we work with on production RAG systems usually arrive with a latency problem they have decided is the embeddings. The embeddings take some time to compute. The vector store takes some time to query. The numbers add up to more than the team's latency budget. The proposed fix is a different embedding model or a different vector database. We have lost track of how many engagements have started this way and ended somewhere completely different, with the actual fix in a part of the system the team had not considered.

The reason is straightforward. Embeddings and vector lookups are usually fast in production, often faster than the team expects after they have measured it carefully. The slow part of a typical RAG query is somewhere else. Either the upstream context preparation is doing too much work. Or the downstream synthesis is rendering against a model whose latency is not actually optimized. Or the network between the components is paying for round trips that should have been collapsed. We have seen all three. The embeddings rarely turn out to be the problem when the system is profiled honestly.

The first place to look is the upstream work. RAG systems often build a query before they embed it. The query construction includes pulling user context, fetching session state, augmenting with retrieved-history, and sometimes calling another model to refine the question. Each of these is small in isolation. Together they can dominate the path. The team that profiles the upstream and discovers that two-thirds of the query latency is in context preparation, not in retrieval, has a much cheaper fix available than swapping the embedding model.

The second place to look is the downstream synthesis. The model generating the final answer is usually the largest single contributor to query latency, and the team's intuition for its cost is often outdated. Models that were slow last quarter are faster now. Caching that the team set up at launch may not be in the path anymore because of a refactor. Streaming that should have been enabled is not. The synthesis layer rewards detailed measurement and is often the highest-leverage place to spend an optimization week.

The third place to look is the network. RAG systems often have a microservice architecture in which the embedding service, the vector database, the retrieval orchestrator, and the synthesis model all live separately. Each call between them carries a TLS handshake, a serialization cost, a queueing cost, and a tail-latency contribution. A system that makes six round trips to assemble a query response can spend more time on network than on actual computation. Collapsing the round trips, co-locating the components, and using persistent connections can produce double-digit percentage latency wins without changing any of the model choices.

The fourth place to look, and the one that usually surprises teams the most, is the cache. RAG systems are unusually amenable to caching because the same questions get asked repeatedly. Question-level caches, embedding caches, and chunk-level caches all have positive hit rates in real production traffic, and the cumulative effect is meaningful. Teams that have not added caching deliberately are usually running with a hit rate of zero on cache layers that exist in the platform but are not configured. Turning them on is often a one-day fix with a multi-percent latency improvement.

The fifth place to look, after the upstream, downstream, network, and cache have been profiled, is the embeddings. By the time the team has been through the four other surfaces, the embeddings have usually been ruled in or out cleanly. If they are the bottleneck, the team has a clear case for changing models or vector databases, and the decision is made on real measurement rather than on assumption. If they are not the bottleneck, the team has saved itself a model migration that was not going to help.

The pattern that ties this together is the same pattern that applies to other production performance work. Measure first, swap later. The instinct to swap a component before measuring it is usually wrong, because the component that is loudest in the team's mental model is rarely the component that actually dominates the latency. RAG systems are particularly prone to this because the components are new, the team's experience with the cost of each is limited, and the discourse online attributes most performance issues to whichever component the writer is most familiar with.

For a team facing slow retrieval, the working pattern is to instrument the full query path before changing anything. The upstream, the retrieval, the synthesis, the network, and the cache all need timing data on real production traffic. Once the data is in, the bottleneck is almost always obvious, and the fix is almost always cheaper than swapping the embeddings would have been.


This is a guest post from the team at SaaSPerform, who run performance engineering retainers for SaaS teams dealing with database, queue, rendering, and now retrieval-layer bottlenecks at production scale.