Read more about: #agents#claude-code#infrastructure#llms#context-engineering

Your RAG Pipeline Sucks and Nobody Is Coming to Save You

llms · infrastructure · tool-design · context-engineering

Embed your docs. Chunk them. Throw them in a vector store. Retrieve the top-k. Stuff them in the prompt. Ship it.

That’s the pitch. Every RAG tutorial, every vector DB landing page, every “production-ready” template. And it’s wrong in ways that the fixes (better chunking, rerankers, hybrid search) can’t solve. Because the architecture is the problem.

I’ve been building search systems for almost a decade. LDA and topic modeling. Lucene, Solr, Elasticsearch. Universal Sentence Encoder. Fine-tuned BERT models. I implemented embedding pipelines by hand (before LLMs existed, before Hugging Face made it a one-liner). At startups. At Fortune 100 companies. I watched the entire transformation happen from the trenches.

And then vector databases showed up with $2B in funding and mass amnesia set in.

RAG is a data pipeline. Act accordingly.

The moment you commit to embeddings, you’ve signed up for data engineering. Processing pipelines. Chunking strategies. Embedding model selection. Index management.

And backfills. God, the backfills.

Change your chunking strategy? Rerun everything. Swap embedding models? Rerun everything. Update your source documents? Rerun everything. Add metadata extraction? Rerun everything.

You’re not building a search feature. You’re operating a data pipeline. Every change to any stage forces a full reprocessing of every document. You wanted a retrieval layer. You got ETL hell.

Two black boxes doing the same job

Here’s what nobody talks about. You have an LLM that UNDERSTANDS SEMANTICS. It’s the whole point. The model comprehends meaning, context, nuance. That’s why you’re building with it.

And then you bolt on an embedding model. Another neural network that also claims to understand semantics. A smaller, dumber one. To pre-process the information before the smart one sees it.

You now have two black boxes. One that genuinely understands language, and one that produces 1536-dimensional approximations of understanding. The embedding model makes retrieval decisions (what’s relevant, what’s not) before the LLM ever gets a chance to weigh in.

Why is the dumber model making the important decisions?

RAG breaks progressive disclosure

This is the deeper problem. RAG front-loads context. You retrieve before you understand what’s needed.

Think about what happens: a user asks a question. Before the LLM processes anything, you’ve already decided what to search for, what to retrieve, how many results to return, and what to stuff into the context window. You made all these decisions with a similarity score and a prayer.

What are you even querying? The user’s raw input? The conversation history? Some reformulated version? And who decides the reformulation, another LLM call? Now you have three models involved before the actual work starts.

This violates everything I know about good tool design. Search, View, Use. Let the consumer decide what it needs, when it needs it. Don’t pre-stuff context. Don’t force decisions before they’re necessary.

RAG does the opposite. It reveals more information than required, before it’s required. And when the next model is 2x smarter and needs different context? Your pipeline breaks, because it was designed for today’s model, not tomorrow’s.

You’ve created an infinite research problem that you can never fully deliver on and that will break on every new expectation.

BM25. Full-text search. Weighted scoring. The model decides what to search for and when.

I know. Not sexy. No pitch deck material. But hear me out.

Things in the real world are organized by semantic importance. A class name carries more signal than a function name. A function name carries more signal than a variable. A page title matters more than a paragraph buried in the footer. This hierarchy exists naturally in your data. BM25 with field-level weighting exploits it directly. No embeddings. No pipeline. No backfills.

And here’s the twist.

If the model knows what to search for, the ROI of FTS over a RAG pipeline is enormous. It’s fast. It’s cheap. It retrieves amazingly well.

So how does the model know? You JIT-parse whatever you need, throw it in a small index, and let the model use it like it would use grep.

# The "pipeline"
1. Parse source on demand
2. Build lightweight FTS index
3. Give the model a search tool
4. Let it query what it needs, when it needs it

No pre-computed embeddings. No chunking decisions. No backfills. The model drives retrieval because it already understands the query. You just gave it grep with better ranking.

This is the same pattern that makes Claude Code’s architecture work. Four primitives. The model decides what to read. Progressive disclosure. Context stays lean until the moment it’s needed.

”But it doesn’t scale”

The best solution to big data has always been to make the data smaller.

Partition correctly. Scope by category, by domain, by relevance tier. Nobody needs to search across a terabyte of unstructured text with a single query. If that’s your problem, it’s not a retrieval problem. It’s an information architecture problem. No amount of vector similarity will fix bad data organization.

The teams that ship working search don’t have better embeddings. They have better partitioning. They scoped the problem before they searched it.

The stack

BM25 is thirty years old. grep is fifty. The model that knows what to search for shipped last quarter. The stack was always there. We just forgot to use it.