Phi-4 Bug Fixes

unsloth.ai

・

92 points

・

danielhanchen

・

7 hours ago

38 comments

danielhanchen ・ 7 hours ago

Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini

1. End of sentence should be <|im_end|> not <|endoftext|>

2. Chat template should not auto add an assistant prompt

3. Padding token should not be EOS but <|dummy_87|>

I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth

I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...

CGamesPlay ・ 4 hours ago

> We converted Phi-4 to Llama’s architecture for better accuracy and easier use.
What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?
- danielhanchen ・ 3 hours ago
  
  Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.
  Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.
  So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.
  The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.
- Sn0wCoder ・ 4 hours ago
  
  Would guess GGUF so you can run on llama.cpp, LM Studio, etc..., but OP can hopefully clarity further for you.
  
  danielhanchen ・ 3 hours ago
  
  Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!
simonw ・ 5 hours ago

Huh! That may explain why I kept on getting visible <|im_end|> output when I tried running a Phi-4 GGUF file using llama.cpp.
- danielhanchen ・ 4 hours ago
  
  Oh yes exactly! I trimmed it out now :)
  The better chat template should be:
  {% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}
sunaookami ・ 6 hours ago

Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft just doesn't care.
>to be on par with GPT-4o mini
Phi is known to overfit benchmarks. It's way, way worse then that.
- throwaway314155 ・ 5 hours ago
  
  Anecdotally, I've been experimenting with Phi-4 the past hour or so (so, yeah, not very comprehensive) and it's certainly a strong model. Definitely better than the previous Phi models.
  
  danielhanchen ・ 4 hours ago
  
  Yep Phi-4 definitely is better than Phi-3.5!
- danielhanchen ・ 4 hours ago
  
  Phi-3 should be fixed as well - but yes there were bugs as well! https://x.com/danielhanchen/status/1782853167572832650
  Phi-3's sliding window should be 2048 and not 2047, and they also had chat template issues - I uploaded correct versions to https://huggingface.co/unsloth/Phi-3.5-mini-instruct
sroussey ・ 3 hours ago

Can you convert to ONNX so I can try in web browser?
- sroussey ・ 3 hours ago
  
  Would like to update this:
  https://huggingface.co/spaces/webml-community/phi-3.5-webgpu
- danielhanchen ・ 3 hours ago
  
  Oh I can probs try doing this!

danielhanchen ・ 2 hours ago

Update: The Phi-4 team is actively working on adding all our fixes into the original model! https://huggingface.co/microsoft/phi-4/discussions/21

NooneAtAll3 ・ an hour ago

Application Error

TypeError: m(...).findLast is not a function

at L (https://unsloth.ai/assets/root-DexjOeLv.js:1:340)

at ia (https://unsloth.ai/assets/components-D38fXVcE.js:7:30549)

at Ac (https://unsloth.ai/assets/components-D38fXVcE.js:7:98661)

at Am (https://unsloth.ai/assets/components-D38fXVcE.js:7:94250)

at o0 (https://unsloth.ai/assets/components-D38fXVcE.js:7:93401)

at ha (https://unsloth.ai/assets/components-D38fXVcE.js:7:93212)

at Mm (https://unsloth.ai/assets/components-D38fXVcE.js:7:90555)

at Om (https://unsloth.ai/assets/components-D38fXVcE.js:7:89963)

at MessagePort.M (https://unsloth.ai/assets/components-D38fXVcE.js:1:11235

danielhanchen ・ an hour ago

Sorry are there some issues with our website?

t1amat ・ 5 hours ago

Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.

Unsloth is a masterpiece, keep up the great work!

danielhanchen ・ 4 hours ago

Thanks a lot!

excerionsforte ・ 2 hours ago

Available on Ollama already: https://ollama.com/vanilj/phi-4-unsloth

danielhanchen ・ 2 hours ago

Oh fabulous! :)

lostmsu ・ 5 hours ago

The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4

According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.

danielhanchen ・ 4 hours ago

Oh yes I found this to be a bit strange - I uploaded our versions and Microsoft's own version to Hugging Face's public LLM leaderboard - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.
It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue

wsintra2022 ・ 3 hours ago

>Reddit comments show our fixes make Phi-4 inference much better

I’d like to try ‘Reddit comments show my fixes make app better’ in my next review

danielhanchen ・ 2 hours ago

Fixed versions are also independently scored by Hugging Face's Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
The Reddit LocalLlama community is actually pretty cool - tonnes of research actually comes from the community - for example kaiokendev's linear RoPE scaling, YaRN, NTK Aware RoPE Scaling, many LLM benchmarks - many researchers use LocalLlama to share research and discuss on new stuff.
I know a lot of AI researchers use the "LocalLlama vibe check" which essentially is an anecdotal approach to LLM evaluation - ie instead of relying on Chat LMsys or LLM benchmarks, 3rd party crowd sourced vibe checks sometimes do much better.
danielhanchen ・ 2 hours ago

As an update - the Phi-4 team is actively working on incorporating all fixes! See https://huggingface.co/microsoft/phi-4/discussions/21

adultSwim ・ 4 hours ago

Are there alternatives to unsloth?

I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.

Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.

danielhanchen ・ 3 hours ago

Multi GPU support is definitely coming to Unsloth OSS! Our goal was to release it this month, but unsure on exact timelines - maybe next month!!

make3 ・ 4 hours ago

"Yes it improves performance!" proceeds to show the most unconvincing stats ever

you can probably blow on your GPU and get a similar performance change

danielhanchen ・ 4 hours ago

I uploaded our fixed versions to https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... which show the difference in scores.
I agree it's not super convincing, so I provided anecdotal evidence as well - I'll work with the Phi-4 team to upstream these fixes!
PS for further credibility, we also fixed 8 bugs in Gemma 1 - see https://x.com/danielhanchen/status/1765446273661075609 , multiple bugs in Llama, Mistral, Qwen and other models
refulgentis ・ 4 hours ago

I'm sorry, I don't understand what you mean. I checked the original article again too. As it stands, my understanding is you are claiming:
- blowing on a GPU (which I take to mean doing roughly nothing)
- gets roughly the same perf change
- as moving from fp16 to q4
- danielhanchen ・ 3 hours ago
  
  Are you referring to the finetuning part?
  The multiple bug fixes are separate from the finetuning sections - Unsloth itself makes finetuning 2x faster and use 70% less memory - the bug fixes are totally detached from finetuning - ie you can take the fixed version we uploaded at https://huggingface.co/unsloth/phi-4, and use it in any framework or inference engine.
  Apologies I'm confused on the comment sorry.
  If you're questioning the credibility of the bug fixes - we fixed 8 bugs in Gemma https://x.com/danielhanchen/status/1765446273661075609, multiple bugs in Llama, Mistral, Qwen, a gradient accumulation bug https://x.com/danielhanchen/status/1846235913443262891 and much more
  
  grumpopotamus ・ 11 minutes ago
  ・ 2 more
  
  2x faster than what?
  
  danielhanchen ・ a few seconds ago
  
  Oh 2x faster and uses >70% less memory than Hugging Face + Flash Attention 2! I did a CUDA / GPU Mode talk about it here: https://www.youtube.com/watch?v=hfb_AIhDYnA Also to the PyTorch team here: https://www.youtube.com/watch?v=MQwryfkydc0 and the PyTorch Conference here: https://www.youtube.com/watch?v=PdtKkc5jB4g
- danielhanchen ・ 2 hours ago
  
  Update - the Phi-4 team is working on adding all our fixes to the original model! https://huggingface.co/microsoft/phi-4/discussions/21

TZubiri ・ 4 hours ago

Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.

danielhanchen ・ 4 hours ago

Anecdotal evidence was provided to show some Redditors tested it out - but I do agree it's not correct to show that as an example - so I uploaded our fixed versions to Hugging Face's public LLM leaderboard here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... - this shows the fixes do in fact work!
- aghilmort ・ an hour ago
  
  [dead]