Show HN: Fine-tuned Llama 3.2 3B to match 70B models for local transcripts

I wrote a small local tool to transcribe audio notes (Whisper/Parakeet). Code: https://github.com/bilawalriaz/lazy-notes

I wanted to process raw transcripts locally without OpenRouter. Llama 3.2 3B with a prompt was decent but incomplete, so I tried SFT. I fine-tuned Llama 3.2 3B to clean/analyze dictation and emit structured JSON (title, tags, entities, dates, actions).

Data: 13 real memos → Kimi K2 gold JSON → ~40k synthetic + gold; keys canonicalized. Chutes.ai (5k req/day).

Training: RTX 4090 24GB, ~4h, LoRA (r=128, α=128, dropout=0.05), max seq 2048, bs=16, lr=5e-5, cosine, Unsloth. On 2070 Super 8GB it was ~8h.

Inference: merged to GGUF, Q4_K_M (llama.cpp), runs in LM Studio.

Evals (100-sample, scored by GLM 4.5 FP8): overall 5.35 (base 3B) → 8.55 (fine-tuned); completeness 4.12 → 7.62; factual 5.24 → 8.57.

Head-to-head (10 samples): ~8.40 vs Hermes-70B 8.18, Mistral-Small-24B 7.90, Gemma-3-12B 7.76, Qwen3-14B 7.62. Teacher Kimi K2 ~8.82.

Why: task specialization + JSON canonicalization reduces variance; the model learns the exact structure/fields.

Lessons: train on completions only; synthetic is fine for narrow tasks; Llama is straightforward to train. Dataset pipeline + training script + evals: https://github.com/bilawalriaz/local-notes-transcribe-llm

bilawal.net

・

29 points

・

phantompeace

・

3 days ago

9 comments

syntaxing ・ 3 days ago

This is amazing, I’ve been having the problem with live STT (mainly for voice assistants). I’m curious if your model + whisper tiny would outperform Whisper small or even medium. I’ve been having issues where even Fast Whisper small takes too long.

Also bummed how Qwen3-1.7B purely nonthinking hasn’t been released. Otherwise, I’m curious on “how low can you go”

phantompeace ・ 3 days ago

What hardware are you running? Parakeet runs on nvidia and Mac and it’s way faster than Whisper. And I’ve had issues with training Qwen3 (and even Qwen2.5 but I think I was masking stop tokens wrong). I’ve had success with Gemma 3 though, and they have some really small models (270m and 1b). Maybe 270m for just transcript cleaning? I wonder if the 1b model can handle the transcript analysis…
- syntaxing ・ 2 days ago
  
  I’m running on a Jetson Orin Nano. Do you know if there is a parakeet + Wyoming repo?
  
  phantompeace ・ 2 days ago
  ・ 2 more
  
  Unfortunately I have zero experience with the Jetson family, and Parakeet itself is a pain to get running IMO - I took the easy option and used the ONNX version
  
  jiehong ・ 2 days ago
  
  Try the inkvoice app for example. It can run parakeet with a simple click

ruben81ad ・ 3 days ago

Thanks for sharing. It is impressive to see how fine tuning makes such a huge difference. It is a matter of cost to decide whether you use a large llm or fine tune a small one.

As a noob in fine tuning, question: how did you decided the values of the hyper parameters ?

phantompeace ・ 3 days ago

Thank you for reading!
Cost is a big factor - i really want to make models that can run on average CPU only machines so most of the world can benefit, rather than needing expensive GPUs or an internet connection + subscriptions. Another big factor is privacy (you don't need to trust a 3rd party with your inputs).
As for the hyperparameters, pure bruteforce trial and error. It feels more like a dark art than a science. You roll the dice and then start tweaking things until the loss looks like it's dropping nicely and consistently, and the checkpoints are starting to output things resembling what we want. I sometimes do inference using checkpoints just to get a feel of if the model is learning (regardless of loss)

undefined ・ 3 days ago

[deleted]

stuaxo ・ 2 days ago

I can't trust Llama3, because I have no idea what they did to the model to make it "less woke".