For the retrieval stage, we have developed a highly efficient, CPU-only-friendly text embedding model:
https://huggingface.co/MongoDB/mdbr-leaf-ir
It ranks #1 on a bunch of leaderboards for models of its size. It can be used interchangeably with the model it has been distilled from (https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1...).
You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here: http://search-sensei.s3-website-us-east-1.amazonaws.com (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)
This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.
We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539
Hmmm. I recently created https://github.com/rcarmo/asterisk-embedding-model, need to look at this since I had very limited training resources.
How does performance (embedding speed and recall) compare to minish / model2vec static word embeddings?
I interacted with the authors of these models quite a bit!
These are very interesting models.
The tradeoff here is that you get even faster inference, but lose on retrieval accuracy [0].
Specifically, inference will be faster because essentially you are only doing tokenization + a lookup table + an average. So despite the fact that their largest model is 32M params, you can expect inference speeds to be higher than ours, which 23M params but it is transformer-based.
I am not sure about typical inference speeds on a CPU for their models, but with ours you can expect to do ~22 docs per second, and ~120 queries per second on a standard 2vCPU server.
As far as retrieval accuracy goes, on BEIR we score 53.55, all-MiniLM-L12-v2 (a widely adopted compact text embedding model) scores 42.69, while potion-8M scores 30.43.
I can't find their larger models but you can generally get an idea of the power level of different embedding models here: https://huggingface.co/spaces/mteb/leaderboard
If you want to run them on a CPU it may make sense to filter for smaller models (e.g., <100M params). On the other side our models achieve higher retrieval scores.
[0] "accuracy" in layman terms, not in accuracy vs recall terms. The correct word here would be "effectiveness".
And honestly in a lot of the cases bm25 has been the best approach used in many of the projects we deployed.