Ask HN: Best foundation model for CLM fine-tuning?

Hi,

I have a largish (2 GB) corpus of curated, high-quality text in some low-resource language, and I want to build a model that would provide an advanced "auto complete" service for writers.

I'm thinking of taking a decoder-only model such as Llama, Mistral or Gemma, slice off the embedding layers (which are based on unneeded languages), create new ones (perhaps initialized based on a FastText model trained on the corpus), paired with a tokenizer newly created from my corpus, then train the model on my corpus until convergence.

Additional potential details include: a custom loss function for synonym-aware training (based on a custom high-quality thesaurus), where synonyms of the "correct" word are somewhat rewarded; POS-tagging the corpus with a Language-specific POS-tagger, and add a POS-tagging head to the model as a Multi-task Learning, to force grammatical generation.

In order to be able to use a good model as the base, I will probably be forced to use PEFT (LoRA). My current setup is whatever is available on Colab Pro+, so I can probably use the 7b-12b range of models?

My main question is, which base model would be best for this task? (Again, for completion of general writing of all kinds, not programming or advanced reasoning).

Also, will the synonym and POS additions help or hurt?

Anything else I might be missing?

Thanks!

27 points

・

philomath868

・

7 days ago

18 comments

omneity ・ 3 days ago

The first thing that comes to mind when reading “custom tokenizer” and “slice off the embedding layers” is that this sounds very much like pre-training from scratch, for which 2GB is far from enough.

Assuming you do get the data though, for a model at the sizes you’re evaluating you’re looking at weeks on a Colab A100-40GB most likely.

My recommendation would be to approach this with a smaller model and with a different training method that doesn’t involve a new tokenizer or new embedding layers because that’s what’s causing the cost x time to balloon beyond feasibility.

GeneralMayhem ・ 3 days ago

Yeah... I'm far from an expert on state-of-the-art ML, but it feels like a new embedding would invalidate any of the layers you keep. Taking off a late layer makes sense to me, like in cases where you want to use an LLM with a different kind of output head for scoring or something like that, because the basic "understanding" layers are still happening in the same numerical space - they're still producing the same "concepts", that are just used in a different way, like applying a different algorithm to the same data structure. But if you have a brand new embedding, then you're taking the bottom layer off. Everything else is based on those dimensions. I suppose it's possible that this "just works", in that there's enough language-agnostic structure in the intermediate layers that the model can sort of self-heal over the initial embeddings... but that intuitively seems kind of incredible to me. A transformation over vectors from a completely different basis space feels vanishingly unlikely to do anything useful. And doubly so given that we're talking about a low-resource language, which might be more likely to have unusual grammatical or linguistic quirks which self-attention may not know how to handle.
philomath868 ・ 3 days ago

Thank you! I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but this is all getting convoluted.
New tokenizer and embeddings will probably be required anyway, since the language is practically missing from any model worth to play with, but at that point simply creating a small specialized model from scratch is perhaps a better bet than trying to glue it upon a big ready model?
- omneity ・ 8 hours ago
  
  Here’s a quick sanity check before you embark on the real thing:
  - Tokenize your entire corpus with a few off-the-shelf multilingual tokenizers like Llama, Qwen and Gemma and calculate the ratio of letters to tokens. The higher the better, ideally in the 3-5 range
  - Manually produce or select sentences that are similar in meaning but not in writing (embedding models also leverage graphemic overlap not just semantic similarity), and then check if similar sentences show consistently higher cosine similarity than dissimilar sentences. This is for embedding models like XLM RoBERTa not for LLMs, but it has a similar insight potential.
  If both of these tests are promising then you likely don’t need custom implementations for these.
  Personally if I was you, I would just take Qwen 0.6B Base (not instruct, since you want text completion) and continue pretraining it on the data that you have. It is very likely to work decently out of the box.

jcuenod ・ 2 days ago

My day job involves training language models (mostly seq2seq) for low-resource languages (with substantially less data than 2GB of data).

A few thoughts:

1. You can't cut off the embedding layer or discard the tokenizer without throwing out the model you're starting with. The attention matrices are applied to and trained with the token embedding layer.

2. Basically the same thing regarding the tokenizer. If you need to add some tokens, that can be done (or you can repurpose existing tokens) if your script is unique (a problem I face periodically). But if you are initializing weights for new tokens, that means those tokens are untrained. So if you do that for all your data, you're training a new model.

3. The Gemma model series sounds like a good fit for your use case. I'm not confident about Hebrew support, let alone Hasidic Yiddish, but it is relatively multilingual (more so than many other open models). Being multilingual means that the odds are greater than they have tokens relevant to your corpus that have been trained towards an optimal point for your dataset.

4. If you can generate synthetic data with synonyms or POS tags, then great. But this is a language model, so you need to think how you can usefully teach it natural sequences of text (not how to tag nouns or identify synonyms - I also did a bunch of classic NLP, and it's depressing how irrelevant all that work is these days). I suspect that repurposing this data will not be worth it. So, if anything, I'd recommend doing that as a second pass.

5. Take a look at unsloth notebooks for training a gemma3 model and load up your data. I reckon it'll surprise you how effective these models are...

fzimmermann89 ・ 3 days ago

How foreign is the language - was it likely included in pre training to some degree? Does it use grammar, syllables, and logic similiar to one of the large languages? Your approach assumes there is an easy to learn mapping between context in your target language and concepts in a prettained llm.

Can you get more text written in the low resources language?

Are you ok to share the name of the language?

philomath868 ・ 3 days ago

Thank you!
The language is Hasidic Yiddish (which is by now different enough from YIVO Yiddish to almost be considered a different language). The amount of (all kinds of) Yiddish included in pre training is probably very little, but not nothing. Also, it's a Germanic language with Hebrew script and roots, and some Slavic roots and suffixes. Most concepts and structure are probably not *very* foreign to a good model.
As I wrote in another comment, I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but I'm starting to rethink the feasibility.
I will probably get more text sometime in the future, but I have to build the first version now.
- agentcoops ・ 3 days ago
  
  Not an answer to your original question, but I think you’d be surprised how much high quality historical linguistic content was hiding in the dusty old corners of the internet. I’ve been doing some work recently with LLMs on historical languages (various forms of Latin, Ancient Greek and medieval European languages) and the out-of-the-box performance of state of the art LLMs is shockingly good. It isn’t that surprising when you remember all these archive digitization projects that took place in the early 00s, but ended up either as stale links, preserved only by archive.org, or stored in arcane CRMs essentially unusable by humans. I assume the same is especially true for various historical Yiddish corpora.
  I ran some tests and, without fine-tuning, GPT can translate medieval German, for example, considerably better than well-known scholars today.
  
  mathiaspoint ・ 3 days ago
  
  Why would you throw out the original embedding layer? That seems like a step backwards to me. It's likely it was partly trained on Yiddish and without it you're throwing out a lot of information in the rest of the model.
- bc569a80a344f9c ・ 3 days ago
  
  I strongly suspect you’re overvaluing how far Hasidic Yiddish has drifted, and that fine-tuning an existing model as a dialect will work just fine, particularly given that the languages the different loan words are from will be present in such a model, and that you’re going to a dialect with a simpler grammar.
  There’s plenty of guides online for fine-tuning for dialects. 2GB still isn’t a huge amount of data, but it seems like it would definitely be worth a concerted try (including fiddling with it a bit) given how expensive training from scratch is.
  
  philomath868 ・ 3 days ago
  ・ 4 more
  
  Perhaps. But I don't think there is an existing (open weights) model that really knows YIVO Yiddish, either, so what should I base this fine-tuning on?
  
  yorwba ・ 3 days ago
  
  You might be able to start with German, since German-Yiddish cognates tend to have fairly regular spelling correspondences (not exactly one-to-one, but often few-to-one).
  So given a Latin-script token from a model that does OK in German (bonus points if it also does Hebrew), generate several candidate Hebrew-script tokens with some regex search-and-replace, then use the resulting vocabulary to tokenize your Yiddish corpus and for each original token keep the candidate replacement that was used most often in the tokenization.
  This vocabulary replacement should give you a model that does OK in German-in-Hebrew-script. I think that would be a better base for a Yiddish model than training from scratch, but of course that's just a hunch that might turn out to be wrong.
  
  bc569a80a344f9c ・ 3 days ago
  
  Qwen3 lists Eastern Yiddish (presumably YIVO) as one of the 119 training languages. It’s available at various sizes including rather small ones to experiment with cheaply, and has good documentation for suggested fine-tuning pipelines. I’d start with that.
  
  agentcoops ・ 3 days ago
  
  For a similar project, I worked with GPT to create an extensive dataset of translations from a historical language. I could then use this both to evaluate base capacity of other models in the language, i.e. giving the model the task of translating the various passages and evaluating the results with GPT, as well as for fine-tuning.
fzimmermann89 ・ 3 days ago

Also, for an auto complete I think a small llm trained from scratch should already work well. Have you tried on if the tinystories(also only 3gb..)/nanogpt speed runs without any fancy loss terms etc as a baseline?

ACCount37 ・ 3 days ago

Don't fuck with the architecture for no reason. Just fucking don't. If you really, really want to, ALWAYS have a baseline of "the architecture was not fucked with" with otherwise similar training at hand, so you can compare. You'll see why.

The purpose of using a base model in the first place is to be able to reuse existing learned representations so the model only has to learn the specific task. You propose starting the run off by kicking the base model in the balls and forcing it to relearn a lot of the things that lie at its foundation. While not even doing a full fine tune. And with a dataset that's VERY small for a heavy duty tuning run. I'm not saying it can't work - but I am saying that you'll suffer trying to make it work.

Anything fancy you try during the training? Less of a minefield, but, again: keep a baseline to compare things to. 9 out of 10 fancy training ideas fail to outperform the baseline. And quite a few of those 9 underperform the baseline noticeably. For my first run, I'd maybe implement known-good basics like curriculum learning if possible but nothing fancier than that.

"Softened targets" with semantic similarity off a dictionary might work to improve sample efficiency early into the run, but it's the kind of thing that might hobble your performance further into the run because your dictionary assumptions are worse than what the model could learn on its own, so taper this off at least? POS-tagging might improve things, in a similar way, but only if you find a decent way to feed the known-good tags into the model, which may be as simple as "put the tags in the square bracket after the words with a "this is a POS-tagged text" next to the text, then mask". The "extra POS head" may work but it might be harder to make that work than to rotate the tags into the corpus naively?

Keep in mind that those are suggestions I make based on VIBES ONLY, and the only way to know if those vibes are on point or wildly off base is to actually try those runs, because that's how applied ML is.

So if you want to get fancy, start off with a small model that's cheap and fast to tune, make sure you can validate performance at least somewhat, and be ready to experiment with your runs a lot.

philomath868 ・ 3 days ago

I hear you loud and clear... Thanks!
What about deleting vision layers (e.g. the "multi_modal_projector" and the "vision_tower.vision_model" layers, assuming I go with Gemma 3), since I need just language generation? Would that also be considered a "kick in the balls", or a useful trimming?
- ACCount37 ・ 3 days ago
  
  Should be safe to do, as long as none of that is load bearing. If it's the usual naive "massage the image into a hundred tokens and throw that into the context" vision implementation, nothing bad would happen from removing or just freezing them.
  I've seen "cut off unused vision inputs" done for older multimodals, just not the newer Gemma 3.