The first thing that comes to mind when reading “custom tokenizer” and “slice off the embedding layers” is that this sounds very much like pre-training from scratch, for which 2GB is far from enough.
Assuming you do get the data though, for a model at the sizes you’re evaluating you’re looking at weeks on a Colab A100-40GB most likely.
My recommendation would be to approach this with a smaller model and with a different training method that doesn’t involve a new tokenizer or new embedding layers because that’s what’s causing the cost x time to balloon beyond feasibility.
Yeah... I'm far from an expert on state-of-the-art ML, but it feels like a new embedding would invalidate any of the layers you keep. Taking off a late layer makes sense to me, like in cases where you want to use an LLM with a different kind of output head for scoring or something like that, because the basic "understanding" layers are still happening in the same numerical space - they're still producing the same "concepts", that are just used in a different way, like applying a different algorithm to the same data structure. But if you have a brand new embedding, then you're taking the bottom layer off. Everything else is based on those dimensions. I suppose it's possible that this "just works", in that there's enough language-agnostic structure in the intermediate layers that the model can sort of self-heal over the initial embeddings... but that intuitively seems kind of incredible to me. A transformation over vectors from a completely different basis space feels vanishingly unlikely to do anything useful. And doubly so given that we're talking about a low-resource language, which might be more likely to have unusual grammatical or linguistic quirks which self-attention may not know how to handle.
Thank you! I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but this is all getting convoluted.
New tokenizer and embeddings will probably be required anyway, since the language is practically missing from any model worth to play with, but at that point simply creating a small specialized model from scratch is perhaps a better bet than trying to glue it upon a big ready model?
Here’s a quick sanity check before you embark on the real thing:
- Tokenize your entire corpus with a few off-the-shelf multilingual tokenizers like Llama, Qwen and Gemma and calculate the ratio of letters to tokens. The higher the better, ideally in the 3-5 range
- Manually produce or select sentences that are similar in meaning but not in writing (embedding models also leverage graphemic overlap not just semantic similarity), and then check if similar sentences show consistently higher cosine similarity than dissimilar sentences. This is for embedding models like XLM RoBERTa not for LLMs, but it has a similar insight potential.
If both of these tests are promising then you likely don’t need custom implementations for these.
Personally if I was you, I would just take Qwen 0.6B Base (not instruct, since you want text completion) and continue pretraining it on the data that you have. It is very likely to work decently out of the box.