My advice for building something like this: don't get hung up on a need for vector databases and embedding.
Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop.
The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example.
Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches.
Plus it means you don't have to solve the chunking problem!
I created a small app that shows the difference between embedding-based ("semantic") and bm25 search:
http://search-sensei.s3-website-us-east-1.amazonaws.com/
(warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)
It runs a small embedding model in the browser and returns search results in "real time".
It has a few illustrative examples where semantic search returns the intended results. For example bm25 does not understand that "j lo" or "jlo" refer to Jennifer Lopez. Similarly embedding based methods can better deal with things like typos.
EDIT: search is performed over 1000 news articles randomly sampled from 2016 to 2024
https://www.anthropic.com/engineering/contextual-retrieval
Anthropic found embeddings + BM25 (keyword search) gave the best results. (Well, after contextual summarization, and fusion, and reranking, and shoving the whole thing into an LLM...)
But sadly they didn't say how BM25 did on its own, which is the really interesting part to me.
In my own (small scale) tests with embeddings, I found that I'd be looking right at the page that contained the literal words in my query and embeddings would fail to find it... Ctrl+F wins again!
FWIW, the org decided against vector embeddings for Claude Code due in part to maintenance. See 41:05 here: https://youtu.be/IDSAMqip6ms
It would also blow up the price/latency of Claude code if every chunk of every file had to be read into haiku->summarized->sent to an embedding model ->reindexed into a project index and that index stored somewhere. Since there’s a lot of context inherent in things like the file structure, storing the central context in Claude.md is a lot simpler. I don’t think them not using vector embeddings in the project space is anything other than an indication that it’s hard to manage embeddings in Claude code.
Some agents integrate with code intelligence tools which do use embeddings, right? (As well as "mechanical" solutions like LSPs, I imagine.)
I think it's just a case of "this isn't something we need to solve, other companies solve it already and then our thing can integrate with that."
Or maybe it's really just marginal gains compared with iterative grepping. I don't know. (Still amazed how well that works!)
I think your last point captures it, for various reasons (RL, inherent structure of code) iterative grepping is unreasonably effective. Interestingly Cursor does use embedding vectors for codebase indexing:
https://cursor.com/docs/context/codebase-indexing
Seems like sometimes Cursor has a better understanding of the vibe of my codebase than Claude code, maybe this is part of it. Or maybe it’s just really marginally important in codebase indexing. Vector dbs still have a huge benefit in less verifiable domains.
who's "the org"?
No cross encoders?
In my experience the semantic/lexical search problem is better understood as a precision/recall tradeoff. Lexical search (along with boolean operators, exact phrase matching, etc.) has very high precision at the expense of lower recall, whereas semantic search sits at a higher recall/lower precision point on the curve.
Yeah, that sounds about right to me. The most effective approach does appear to be a hybrid of embeddings and BM25, which is worth exploring if you have the capacity to do so.
For most cases though sticking with BM25 is likely to be "good enough" and a whole lot cheaper to build and run.
Depends on the app and how often you need to change your embeddings, but I run my own hybrid semantic/bm25 search on my MacBook Pro across millions of documents without too much trouble.
I recently came across a “prefer the most common synonym” problem, in Google Maps, while searching for a poolhall—even literally ‘billiards’ returned results for swimming pools and chlorine. I wonder if some more NOTs aren’t necessary…interested in learning about RAGs though I’m a little behind the curve.
In my app the best lexical search approaches completely broke my agent. For my rag system the llm would on average take 2.1 lexical searches to get the results it needed. Which wasn’t terrible but it meant sometimes it needed up to 5 searches to find it which blew up user latency. Now that I have a hybrid semantic search + lexical search it only requires 1.1 searches per result.
The problem is not using parallel tool calling or not returning a search array. We do this across large data sets and don’t see much of a problem. It also means you can swap algorithms on the fly. Building a BM25 index over a few thousand documents is not very expensive locally. Rg and grep are freeish. If you have information on folder contents you can let your agent decide at execution time based on information need.
Embeddings just aren’t the most interesting thing here if you’re running a frontier fm.
Search arrays help, but parallel tool calling assumes you’ve solved two hard problems: generating diverse query variations, and verifying which result is correct. Most retrieval doesn’t have clean verification. The better approach is making search good enough that you sidestep verification as much as possible (hopefully you are only requiring the model to make a judgment call within its search array). In my case (OpenStreetMap data), lexical recall is unstable, but embeddings usually get it right if you narrow the search space enough—and a missed query is a stronger signal to the model that it’s done something wrong.
Besides, if you could reliably verify results, you’ve essentially built an RL harness—which is a lot harder to do than building an effective search system and probably worth more.
Hmm it can capture more than just single words though, e.g. meaningful phrases or paragraphs that could be written in many ways.
Alternative advice: just test and see what works best for your use case. Totally agreed embeddings are often overkill. However, sometimes they really help. The flow is something like:
- Iterate over your docs to build eval data: hundreds of pairs of [synthetic query, correct answer]. Focus on content from the docs not general LLM knowledge.
- Kick off a few parallel evaluations of different RAG configurations to see what works best for your use case: BM25, Vector, Hybrid. You can do a second pass to tune parameters: embedding model, top k, re-ranking, etc.
I build a free system that does all this (synthetic data from docs, evals, test various RAG configs without coding each version). https://docs.kiln.tech/docs/evaluations/evaluate-rag-accurac...
That's excellent advice, the only downside being that collecting that eval data remains difficult and time-consuming.
But if you want to build truly great search that's the approach to take.
Agree totally. I’m spending half my time focused that problem (mostly synthetic data gen with guidance), and the other half on how to optimize once it works.
At this point you could also optimize your agentic flow directly in DSPy using a colbert model / Ratatouille for retrieval.
Not there yet. The biggest vectors for optimizing aren’t in the agents yet (RAG method, embedding model, etc)
This matches what I found building an AI app for kids. Started with embeddings because everyone said to, then ripped it out and went with simple keyword matching. The extra complexity wasn't worth it for my use case. Most of the magic comes from the LLM anyway, not the retrieval layer.
Simon have you ever given a talk or written about this sort of pragmatism? A spin on how to achieve this with Datasette is an easy thing to imagine IMO.
I did a livestream thing about building RAG against FTS search in Datasette last year: https://simonwillison.net/2024/Jun/21/search-based-rag/
No reason to try to avoid semantic search. Dead easy to implement, works across languages to some extent and the fuzziness is worth quite alot.
You're realistically going to need chunks of some kind anyway to feed the LLM, and once you got those it's just a few lines of code to get a basic persistant ChromaDB going.
Are multiple LLM queries faster than vector search? Even with the example "dog OR canine" that leads to two LLM inference calls vs one. LLM inference is also more expensive than vector search.
In general RAG != Vector Search though. If a SQL query, grep, full text search or other does the job then by all means. But for relevance-based search, vector search shines.
- [deleted]
Do you have a standard prompt you use for this? I have definitely seen agentic tools doing this for me, e.g., when searching the local file system, but I'm not sure if it native behaviour for tool-using LLMs or if it is coerced via prompts.
No I've not got a good only for this yet. I've found the modern models (or the Claude Code etc harness) know how to do this already by default - you can ask them a question and give them a search tool and they'll start running and iterating on searches by themselves.
I built a simple emacs package based on this idea [0]. It works surprisingly well, but I dont know how far it scales. It's likely not as frugal from a token usage perspective.
So kinda GAR - Generation-Augmented Retrieval :-)
Yes, exactly. We have our AI feature configured to use our pre-existing TypeSense integration and it's stunningly competent at figuring out exactly what search queries to use across which collections in order to find relevant results.
if this is coupled with powerful search engines beyond elastic then we are getting somewhere. other nonmonotonic engines that can find structural information are out there.
Perhaps SQLite with FTS5? Or even better, getting DuckDB into the party as it's ecosystem seems ripe for this type of work.
Burying the lede here - your solution for avoiding using vector search is either offloading to 1) user, expecting them to remember the right terms or 2) using LLM to craft the search query? And having it iterate multiple times? Holy mother of inefficiency, this agentic focus is making us all brain dead.
Vector DB's and embeddings are dead simple to figure out, implement, and maintain. Especially for a local RAG, which is the primary context here. If I want to find my latest tabular notes on some obscure game dealing with medical concepts, I should be able to just literally type that. It shouldn't require me remembering the medical terms, or having some local (or god forbid, remote) LLM iterate through a dozen combos.
FWIW I also think this is a matter of how well one structures their personal KB. If you follow strict metadata/structure and have coherent/logical writing, you'll have better chance of getting results with text matching. For someone optimizing for vector space search, and minimizing the need for upfront logical structuring, it will not work out well.
My opinion on this really isn't very extreme.
Claude Code is widely regarded to be the best coding agent tool right now and it uses search, not embeddings.
I use it to answer questions about files on my computer all the time.