The RAG Obituary: Killed by agents, buried by context windows

109 points by nbstme 14 hours ago

This glosses over a fundamental scaling problem that undermines the entire argument. The author's main example is Claude Code searching through local codebases with grep and ripgrep, then extrapolates this to claim RAG is dead for all document retrieval. That's a massive logical leap.

Grep works great when you have thousands of files on a local filesystem that you can scan in milliseconds. But most enterprise RAG use cases involve millions of documents across distributed systems. Even with 2M token context windows, you can't fit an entire enterprise knowledge base into context. The author acknowledges this briefly ("might still use hybrid search") but then continues arguing RAG is obsolete.

The bigger issue is semantic understanding. Grep does exact keyword matching. If a user searches for "revenue growth drivers" and the document discusses "factors contributing to increased sales," grep returns nothing. This is the vocabulary mismatch problem that embeddings actually solve. The author spent half the article complaining about RAG's limitations with this exact scenario (his $5.1B litigation example), then proposes grep as the solution, which would perform even worse.

Also, the claim that "agentic search" replaces RAG is misleading. Recent research shows agentic RAG systems embed agents INTO the RAG pipeline to improve retrieval, they don't replace chunking and embeddings. LlamaIndex's "agentic retrieval" still uses vector databases and hybrid search, just with smarter routing.

Context windows are impressive, but they're not magic. The article reads like someone who solved a specific problem (code search) and declared victory over a much broader domain.

CuriouslyC 6 hours ago

Agentic retrieval is really more a form of deep research (from a product standpoint there is very little difference). The key is that LLMs > rerankers, at least when you're not at webscale where the cost differential is prohibitive.
- nbstme 6 hours ago
  
  LLMs > rerankers. Yes! I don't like rerankers. They are slow, the context window is small (4096 tokens), it's expensive... It's better when the LLM reads the whole file versus some top_chunks.
bjornsing 41 minutes ago

But couldn’t an LLM search for documents in that enterprise knowledge base just like humans do, using the same kind of queries and the same underlying search infrastructure?
- z3dd 7 minutes ago
  
  I wouldn't say humans are efficient at that so no reason to copy, other than as a starting point.
nbstme 6 hours ago

Appreciate the feedback. I’m not saying grep replaces RAG. The shift is that bigger context windows let LLMs just read whole files, so you don’t need the whole chunk/embed pipeline anymore. Grep is just a quick way to filter down candidates.
From there the model can handle 100–200 full docs and jot notes into a markdown file to stay within context. That’s a very different workflow than classic RAG.
- visarga 2 hours ago
  
  I think the most important insight from your article, which I also felt, is that agentic search is really different. The ability to retarget a search iteratively fixes both the issues of RAG and grep approaches - they don't need to be perfect from the start, they only need to get there after 2-10 iterations. This really changes the problem. LLMs have become so smart they can compensate for chunking and not knowing the right word.
  But on top of this I would also use AI to create semantic maps, like hierarchical structure of content, and put that table of contents in the context, let the AI explore it. This helps with information spread across documents/chapters. It provides a directory to access anything without RAG, by simply following links in a tree. Deep Research agents build this kind of schema while they operate across sources.
  To explore this I built an graph MCP memory system where the agent can search both by RAG and text matching, and when it finds top-k nodes it can expand out by links. Writing a node implies having the relevant nodes first loaded up, and when generating the text, place contextual links embedded [1] like this. So simply writing a node also connects it to the graph in all the right points. This structure fits better with the kind of iterative work LLMs do.
- davidmckayv 6 hours ago
  
  That's fair, but how do you grep down to the right 100-200 documents from millions without semantic understanding? If someone asks "What's our supply chain exposure?" grep won't find documents discussing "vendor dependencies" or "sourcing risks."
  You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG. And doing that intelligently means you're back to using embeddings anyway.
  The workflow works great for codebases with consistent terminology. For enterprise knowledge bases with varied language and conceptual queries, grep alone can't get you to the right candidates.
  - pjm331 5 hours ago
    
    the agent greps for the obvious term or terms, reads the resulting documents, discovers new terms to grep for, and the process repeats until its satisfied it has enough info to answer the question
    > You could expand grep queries with synonyms, but now you're reimplementing query expansion, which is already part of modern RAG.
    in this scenario "you" are not implementing anything - the agent will do this on its own
    this is based on my experience using claude code in a codebase that definitely does not have consistent terminology
    it doesn't always work but it seemed like you were thinking in terms of trying to get things right in a single grep when it's actually a series of greps that are informed by the results of previous ones
  - cwyers 6 hours ago
    
    Classical search
    
    Spivak 4 hours ago
    
    Which is RAG. How you decide to take a set of documents to large for an LLM context window and narrow it down to a set that does fit is an implementation issue.
    The chunk, embed, similarity search method was just a way to get a decent classical search pipeline up and running with not too much effort.
- glenngillen an hour ago
  
  I was previously working at https://autonomy.computer, and building out a platform for autonomous products (i.e., agents) there. I started to observe a similar opportunity. We had an actor-based approach to concurrency that meant it was super cheap performance-wise to spin up a new agent. _That_ in turn meant a lot of problems could suddenly become embarrassingly parallel, and that rather than pre-computing/caching a bunch of stuff into a RAG system you could process whatever you needed in a just-in-time approach. List all the documents you've got, spawn a few thousand agents and give each a single document to process, aggregate/filter the relevant answers when they come back.
  Obviously that's not the optimal approach for every use case, but there's a lot where IMO it was better. In particular I was hoping to spend more time exploring it in an enterprise context where you've got complicated sharing and permission models to take into consideration. If you have agents simply passing through the permission of the user executing the search whatever you get back is automatically constrained to only the things they had access to in that moment. As opposed to other approaches where you're storing a representation of data in one place, and then trying to work out the intersection of permissions from one of more other systems, and sanitise the results on the way out. Always seemed messy and fraught with problems and the risk of leaking something you shouldn't.
jgalt212 6 hours ago

> Grep works great when you have thousands of files on a local filesystem that you can scan in milliseconds. But most enterprise RAG use cases involve millions of documents across distributed systems
Great point, but this grep in a loop probably falls apart (i.e. becomes non-performant) at 1000s of docs, not millions and 10s of simultaneous users
- nbstme 6 hours ago
  
  Why does grep in a loop fall apart? It’s expensive, sure, but LLM costs are trending toward zero. With Sonnet 4.5, we’ve seen models get better at parallelization and memory management (compacting conversations and highlighting findings).
voidhorse 6 hours ago

Not to mention, unless you want to ship entire containers, you are beholden to the unknown quirks of tools on whatever system your agent happens to execute on. It's like taking something already nondeterministic and extremely risky and ceding even more control—let's all embrace chaos.
Generative AI is here to stay, but I have a feeling we will look back on this period of time in software engineering as a sort of dark age of the discipline. We've seemingly decided to abandon almost every hard won insight and practice about building robust and secure computational systems overnight. It's pathetic that this industry so easily sold itself to the illogical sway of marketers and capital.
- queenkjuul 5 hours ago
  
  Mostly, i agree, except that the industry (from where I'm standing) has never done much else but sell itself to marketers and capital.

jimbohn 13 minutes ago

Feels like saying Elasticsearch (and similar) tools are dead because we can just grep our way through things. I'd love to see more data on this.

themanmaran 7 hours ago

I'm always amazed at claude codes ability to build context by just putting grep in a for loop.

It's pretty much the same process I would use in an unfamiliar code base. Just ctrl+f the file system till I find the right starting point.

nbstme 6 hours ago

It's mind blowing. It's so simple, elegant and... effective! Grep+glob and a lot of iterations is all we need.
- Analemma_ 6 hours ago
  
  We always suspected find+grep+xargs was Turing-complete, and now Claude is proving it.
  - delusional an hour ago
    
    That's one of the most nonsensical comments on all of hackernews. A Markov change could have wrote it.
    What do you mean Turing complete? Obviously all 3 programs are running on a Turing complete machine. Xargs is a runner for other commands, obviously those commands can be Turing complete.
    I haven't heard of anybody working on a _proof_ for the Turing completeness of xargs, and I think the only conference willing to publish it would be Sigbovik.
  - nbstme 6 hours ago
    
    Exactly. AGI implies minimal tooling and very primitive tools.
    
    SV_BubbleTime an hour ago
    
    AGI implies that a system is financially viable to let run 24 hours a day with little to no direction.
    No amount of find+grep+LLM is even remotely there yet.
eru 7 hours ago

That's what I used to use as a human, but then I finally overcame my laziness in setting up integration between my editor and compiler (and similar) and got 'jump to definition' working.
(Well, I didn't overcome my laziness directly. I just switched from being lazy and not setting up vim and Emacs with the integrations, to trying out vscode where this was trivial or already built in.)

cmenge 6 hours ago

We're processing tenders for the construction industry - this comes with a 'free' bucket sort from the start, namely that people practically always operate only on a single tender.

Still, that single tender can be on the order of a billion tokens. Even if the LLM supported that insane context window, it's roughly 4GB that need to be moved and with current LLM prices, inference would be thousands of dollars. I detailed this a bit more at https://www.tenderstrike.com/en/blog/billion-token-tender-ra...

And that's just one (though granted, a very large) tender.

For the corpus of a larger company, you'd probably be looking at trillions of tokens.

While I agree that delivering tiny, chopped up parts of context to the LLM might not be a good strategy anymore, sending thousands of ultimately irrelevant pages isn't either, and embeddings definitely give you a much superior search experience compared to (only) classic BM25 text search.

elliotto 39 minutes ago

I work at an AI startup, and we've explored a solution where we preprocess documents to make a short summary of each document, then provide these summaries with a tool call instruction to the bot so it can decide which document is relevant. This seems to scale to a few hundred documents of 100k-1m tokens, but then we run into issues with context window size and rot. I've thought about extending this as a tree based structure, kind of like an LLM file system, but have other priorities at the moment.
Embeddings had some context size limitations in our case - we were looking at large technical manuals. Gemini was the first to have a 1m context window, but for some reason its embedding window is tiny. I suspect the embeddings might start to break down when there's too much information.

kingjimmy 7 hours ago

Has it not dawned on the author how ironic calling embeddings and retrieval pipelines "a nightmare of edge cases" when talking about LLM

nbstme 6 hours ago

Haha! LLMs themselves are pure edge cases because they are non-deterministic. But if you add a 7-step pipeline on top of that, it's edge cases on top of edge cases.

intalentive 3 hours ago

Agentic search with a handful of basic tools (drawn from BM25, semantic search, tags, SQL, knowledge graph, and a handful of custom retrieval functions) blows the lid off RAG in my experience. The downside is it takes longer. A single “investigation” can easily use 20-30 different function calls. RAG is like a static one-shot version of this and while the results are inferior the process is also a lot faster.

CuriouslyC 6 hours ago

RAG isn't dead, RAG is just fiddly, you need to tune retrieval to the task. Also, grep is a form of RAG, it just doesn't use embeddings.

nbstme 6 hours ago

Yes my point is that the entire RAG pipeline like ingest, chunk, embed, search with Elastic, rerank is in decline. Grep is far simpler. It’s trivial.

selcuka 6 hours ago

I don't find this surprising. We are constantly finding workarounds for technical limitations, then ditch them when the limitation no longer exists. We will probably be saying the same thing for LLMs in a few years (when a new machine learning related TLA becomes the hype).

nbstme 6 hours ago

100%. The speed of change is wild. With each new model, we end up deleting thousands of lines of code (old scaffolding we built to patch the models’ failures.)

redwood 7 hours ago

Weird to see the use case referenced specifically code search when that's a very targeted one rather than what general purpose agents (or RAG) use cases might target.

nbstme 6 hours ago

The main use case I referenced is SEC filings search, which is quite different from code. Filings are much longer, less structured, and more complex, with tables and footnotes.
- hluska 6 hours ago
  
  I’m sure that was your intent but why did you get bogged down talking about code?
  - nbstme 6 hours ago
    
    hum because Claude Code pioneered the 'grep/glob/read' paradigm, so I felt the need to explain that what works well for coding files can also be applied to more complex documents.
    
    hluska 6 hours ago
    
    Did you consider using words to explain that? I don’t think you pay yourself by the word.

kixiQu 3 hours ago

This is a great example of a piece with enough meaningful and useful content in it that it's very clear the author had something of value to deliver, and I'm grateful for that... but enough repetitive LLM-output that I'm very annoyed by the end.

Actually, let me be specific: everything from "The Rise of Retrieval-Augmented Generation" up to "The Fundamental Limitations of RAG for Complex Documents" is good and fine as given, then from "The Emergence of Agentic Search - A New Paradigm" to "The Claude Code Insight: Why Context Changes Everything" (okay, so the tone of these generated headings is cringey but not entirely beyond the pale) is also workable. Everything else should have been cut. The last four paragraphs are embarrassing and I really want to caution non-native English speakers: you may not intuitively pick up on the associations that your reader has built with this loudly LLM prose style, but they're closer to quotidian versions of the [NYT] delusion reporting than you likely mean to associate with your ideas.

[NYT]: https://www.nytimes.com/2025/08/08/technology/ai-chatbots-de...

cyberax an hour ago

I wonder if something like LSP or IntelliJ's reverse index would work better for AI than RAG.

jgalt212 6 hours ago

I'm not feeling it. Constantly pinging these yuge LLMs is not economic and not good for sensitive docs.

nbstme 6 hours ago

But don’t you think LLM pricing is heading toward zero? It seems to halve every six months. And on privacy, you can hope model providers won’t train on your data, (but there’s no guarantee)
- queenkjuul 4 hours ago
  
  I don't see how it can trend to zero when none of the vendors are profitable. Uber and doordash et. al. increased in price over time. The era of "free" LLM usage can't be permanent
  - dangoodmanUT 2 hours ago
    
    Google’s inference is profitable
- imiric 2 hours ago
  
  Oh, it's going to be "free" alright, in the same way that most web services are today. I.e., you will pay for it with your data and attention.
  The only difference is that the advertising will be much more insidious and manipulative, the data collection far easier since people are already willingly giving it up, and the business much more profitable.
  I can hardly wait.

OutOfHere 2 hours ago

That's quite the over-generalization. RAG fundamentally is: topic -> search -> context -> output. Agents can enhance it by iterating in a loop, but what's inside the loop is not going away.

djoldman 7 hours ago

... for this specific use case (financial documents).

These corpora have a high degree of semantic ambiguity among other tricky and difficult to alleviate issues.

Other types of text are far more amenable to RAG and some are large enough that RAG will probably be the best approach for a good while.

For example: maintenance manuals and regulation compendiums.

nbstme 6 hours ago

Why? What if LLMs could parallelize much of their reading and then summarize the findings into a markdown file, eliminating the need for complicated search?

sergiotapia 6 hours ago

>The winners will not be the ones who maintain the biggest vector databases, but the ones who design the smartest agents to traverse abundant context and connect meaning across documents.

So if one were building say a memory system for an AI chat bot, how would you save all the data related to a user? Mother's name, favorite meals, allergies? If not a Vector database like pinecone, then what? Just a big .txt file per user?

ako 2 hours ago

That is what Claude Sonnet 4.5 is doing: https://youtu.be/pidnIHdA1Y8?si=GqNEYBFyF-3Klh4-
nbstme 6 hours ago

Exactly. Just a markdown file per user. Anthropic recommends that.
queenkjuul 4 hours ago

Any kind of database is far too efficient for an LLM, just take all your markdown and turn it into less markdown.

aussieguy1234 7 hours ago

grep was invented at a time when computers had very small amounts of memory, so small that you might not even be able to load a full text file. So you had tools that would edit one line at a time, or search through a text file one line at a time.

LLMs have a similar issue with their context windows. Go back to GPT-2 and you wouldn't have been able to load a text file into its memory. Slowly the memory is increasing, same as it did for the early computers.

nbstme 6 hours ago

Agree. It's a context/memory issue. Soon LLMs will have a 10M context window and they won't need to search. Most codebases are less than 10M tokens.

devmor an hour ago

This reads like someone AI-generated prose to defend something they want to invest in and decry something it competes with. It does not come off as honest, written by a human, or useful to anyone outside of the specific, narrow contexts the "author" sees for the technologies mentioned.

Frankly, reading through this at makes me feel as though I am a business analyst or engineering manager being presented with a project proposal from someone very worried that a competing proposal will take away their chance to shine.

As it reaches the end, I feel like I'm reading the same thing, but presented to a Buzzfeed reader.

dkga 6 hours ago

RAG is the new US dollar, now every year someone will predict its looming death…

nbstme 6 hours ago

HAHAHA. Ok let's call it "transformation." As i wrote "The next decade of AI search will belong to systems that read and reason end-to-end. Retrieval isn’t dead—it’s just been demoted."

catlover76 5 hours ago

[dead]

thenewwazoo 7 hours ago

[flagged]

tomhow 2 hours ago

We've been asking people not to comment like this on HN. We can never know exactly how much an individual's writing is LLM-generated, and the negative consequences of a false accusation outweigh the positive consequences of a valid one.
We don't want LLM-generated content on HN, but we also don't want a substantial portion of any thread being devoted to meta-discussion about whether a post is LLM-generated, and the merits of discussing whether a post is LLM-generated, etc. This all belongs in the generic tangent category that we're explicitly trying to avoid here.
If you suspect it, please use the the established approaches for reacting to inappropriate content: if it's bad content for HN, flag it; if it's a bad comment downvote it; and if there's evidence that it's LLM-generated, email us to point it out. We'll investigate it the same way we do when there are accusations of shilling etc, and we'll take the appropriate action. This way we can cut down on repetitive, generic tangents, and unfair accusations.
dymk 7 hours ago

I don’t mind articles that have a hint of “an AI helped write this” as long as the content is actually informationally dense and well explained. But this article is an obvious ad, has almost no interesting information or summaries or insights, and has the… weirdly chipper? tone that AI loves to glaze readers with.
- tptacek 7 hours ago
  
  How is this an ad? It's a couple thousand words about how they built something complicated that was then obsoleted.
  - serf 6 hours ago
    
    in the same vein that a 'Behind The Scenes Look At The Making of Jurassic Park' is , in fact, an ad.
    having a company name pitched at you within the first two sentences is a pretty good give away.
    
    tptacek 6 hours ago
    
    3/4 of what hits the front page is an "ad" by that standard. I don't see how you can get less promotional than a long-form piece about why your tech is obsolete. Seems just mean-spirited.
    
    dymk 6 hours ago
    
    It’s because the article’s main goal is to sell me the company’s product, not inform me about RAG. It’s a zero calorie article.
    
    SV_BubbleTime an hour ago
    
    > 3/4 of what hits the front page is an "ad" by that standard.
    Is anyone disagreeing with that?
    
    nbstme 6 hours ago
    
    haha so true!
- nbstme 6 hours ago
  
  Why call it an ad? It’s not even on the company site. I only mentioned my company upfront so people get context (why we had to build a complex RAG pipeline, what kinds of documents we’re working with, and why the examples come from real production use cases).
  - dymk 6 hours ago
    
    It stands out because the flow and tone was clearly AI generated. It’s fluff, and I don’t trust it was written by a human who wasn’t hallucinating the non-company related talking points.
momojo 7 hours ago

I'm guessing first draft was AI. I had to re-read that part a couple times because the flow was off. That second paragraph was completely unnecessary too since the previous paragraph already got the point across that "context window small in 2022".
On the whole though, I still learned a lot.
- nbstme 6 hours ago
  
  Thanks! Sorry if the flow was off
tptacek 7 hours ago

There are typos in it, too. I don't think this kind of style critique is really on topic for HN.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
https://news.ycombinator.com/newsguidelines.html
- sebmellen 7 hours ago
  
  Those guidelines that you reference talk almost exclusively about annoyances on the webpage itself, not the content of the article.
  I think it's fair to point out that many articles today are essentially a little bit of a human wrapper around a core of ChatGPT content.
  Whether or not this was AI-generated, the tells of AI-written text are all throughout it. There are some people who have learned to write like the AI talks to them, which is really not much of an improvement over just using the AI as your word processor.
  - bigwheels 7 hours ago
    
    Do you agree that bickering over AI-generated vs. not AI-generated makes for dull discussion? Sliding sewing needles deep into my fingernail bed sounds more appealing than nagging over such minutiae.
    
    IgorPartola 7 hours ago
    
    It’s also dull to brush my teeth, but I still do it because it is necessary.
    The problem is that HN is one of the few places left where original thoughts are the main reason people are here. Letting LLMs write articles for us here is just not all that useful or fun.
    Maybe quarantining AI related articles to their own thing a la Show HN would be a good move. I know it is the predominant topic here for the moment but like there is other interesting stuff too. And articles about AI written by AI so that Google’s AI can rank it higher and show it to more AI models to train on is just gross.
    
    tom_ 7 hours ago
    
    I'm not the person you're replying to, but for my part I do actually like to hear when people think it sounds like it's AI-generated.
    
    serf 6 hours ago
    
    minutiae to me is the effort of loading a page and reading half a paragraph in order to determine the AI tone for myself. The new AI literature frontier has actually added value to reading the comments first on HN in a surprising twist -- saves me the trouble.
    
    davkan 7 hours ago
    
    Almost as dull as being spoon-fed AI slop articles, yeah.
    
    bigwheels 7 hours ago
    
    There's an idea - create a website which can accurately assess "Slop-o-Meter" for any link, kind of like what FakeSpot of old did for Amazon products with fake reviews.
    
    sebmellen 4 hours ago
    
    I've tried doing this, but LLMs are shockingly bad at differentiating between their own slopware and true wetware thoughts.
    
    EnPissant 7 hours ago
    
    It's more akin to complaining about how Google search results have gotten worse.
    
    threecheese 7 hours ago
    
    It certainly makes a dull discussion, but frankly we need to have it. Post-AI HN is now a checkbox on a marketing plan - like a GitHub repository is - and I’m sick of being manipulated and sold to in one of the few forums that wasn’t gamed. It’s not minutiae, it’s an overarching theme that’s enshittifying the third places. Heck even having to discuss this is ruining it (yet here I am lol).
    
    akerl_ 5 hours ago
    
    I hate to ruin the magic for you, but HN has been part of marketing plans long before AI.
- titanomachy 7 hours ago
  
  "This wasn't written by a person" isn't a tangential style critique.
sebmellen 7 hours ago

It truly is unfortunate. Thankfully most people seem to have an innate immune response to this kind of RLHF slop.
- Retr0id 7 hours ago
  
  Unfortunately this can't be true, otherwise it wouldn't be a product of RLHF.
  - sebmellen 4 hours ago
    
    Go on an average college campus, and almost anyone can tell you when an essay was written with AI vs when it wasn't. Is this a skill issue? Are better prompters able to evade that innate immune response? Probably yes. But the revulsion is innate.
  - phainopepla2 7 hours ago
    
    Crowds can have terrible taste, even if they're made up of people with good (or at least middling) taste
xgulfie 7 hours ago

[flagged]