Don't use all-MiniLM-L6-v2 for new vector embeddings datasets.
Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.
For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.
I am partial to https://huggingface.co/Qwen/Qwen3-Embedding-0.6B nowadays.
Open weights, multilingual, 32k context.
Also matryoshka and the ability to guide matches by using prefix instructions on the query.
I have ~50 million sentences from english project gutenberg novels embedded with this.
Why would you do that and I'd love to know more
The larger project is to allow analyzing stories for developmental editing.
Back in June and August i wrote some llm assisted blog posts about a few of the experiments.
They are here: sjsteiner.substack.com
What are you using those embeddings for, If you don't mind me asking? I'd love to know more about the workflow and what the prefix instructions are like.
It's junk compared to BGE M3 on my retrieval tasks
It's a shame EmbeddingGemma is under the shonky Gemma license. I'll be honest: I don't remember what was shonky about it, but that in itself is a problem because now I have to care about, read and maybe even get legal advice before I build anything interesting on top of it!
(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)
For the use cases of embeddings anyways, the issues with the Gemma license should be less significant.
I am trying sentence-transformers/multi-qa-MiniLM-L6-cos-v1 for deploying a light weight transformer on CPU machine -its output dimension is 384. I want to keep the dimension low as possible. nomic-embed-text offers lower dimensions upto 64. I will need to test my dataset. Will comeback with the results.
https://agentset.ai/leaderboard/embeddings good rundown of other open-source embedding models
Can someone explain what's technically better in the recent embedding models. Has there been a big change in their architecture or is it lighter on memory or can handle longer context because of improved training?
Great comment. For what its worth, really think about your vectors before creating them! Any model can be a vector model, you just use the final hidden states... with that, think about your corpus and the model latent space and try to pair them appropriately. For instance, I vectorize and search network data using a model trained on coding, systems, data, etc.
- [deleted]
How do the commercial embedding models compare against each other? Eg Cohere vs OpenAI small vs OpenAI large etc?
I have troubles navigating this space as there’s so much choice, and I don’t know exactly how to “benchmark” an embedding model for my use cases.
One thing that's still compelling about all-Mini is that it's feasible to use it client-side. IIRC it's a 70MB download, versus 300MB for EmbeddingGemma (or perhaps it was 700MB?)
Are there any solid models that can be downloaded client-side in less than 100MB?
This is the smallest model in the top 100 of HF's MTEB Leaderboard: https://huggingface.co/Mihaiii/Ivysaur
Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.
For something under 100 MB, this is probably the strongest option right now.
yeah this, there's much better open weights models out there...
I tried out EmbeddingGemma a few weeks back in AB testing against nomic-embed-text-v1. I got way better results out of the nomic model. Runs fine on CPU as well.