Phi-4 Bug Fixes

unsloth.ai

92 points

danielhanchen

7 hours ago


38 comments

danielhanchen 7 hours ago

Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini

1. End of sentence should be <|im_end|> not <|endoftext|>

2. Chat template should not auto add an assistant prompt

3. Padding token should not be EOS but <|dummy_87|>

I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth

I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...

  • CGamesPlay 4 hours ago

    > We converted Phi-4 to Llama’s architecture for better accuracy and easier use.

    What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?

    • danielhanchen 3 hours ago

      Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.

      Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.

      So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.

      The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.

    • Sn0wCoder 4 hours ago

      Would guess GGUF so you can run on llama.cpp, LM Studio, etc..., but OP can hopefully clarity further for you.

      • danielhanchen 3 hours ago

        Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!

  • simonw 5 hours ago

    Huh! That may explain why I kept on getting visible <|im_end|> output when I tried running a Phi-4 GGUF file using llama.cpp.

    • danielhanchen 4 hours ago

      Oh yes exactly! I trimmed it out now :)

      The better chat template should be:

      {% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}

  • sunaookami 6 hours ago

    Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft just doesn't care.

    >to be on par with GPT-4o mini

    Phi is known to overfit benchmarks. It's way, way worse then that.

NooneAtAll3 an hour ago
  • danielhanchen an hour ago

    Sorry are there some issues with our website?

t1amat 5 hours ago

Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.

Unsloth is a masterpiece, keep up the great work!

lostmsu 5 hours ago

The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4

According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.

  • danielhanchen 4 hours ago

    Oh yes I found this to be a bit strange - I uploaded our versions and Microsoft's own version to Hugging Face's public LLM leaderboard - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

    You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.

    It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue

wsintra2022 3 hours ago

>Reddit comments show our fixes make Phi-4 inference much better

I’d like to try ‘Reddit comments show my fixes make app better’ in my next review

  • danielhanchen 2 hours ago

    Fixed versions are also independently scored by Hugging Face's Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

    The Reddit LocalLlama community is actually pretty cool - tonnes of research actually comes from the community - for example kaiokendev's linear RoPE scaling, YaRN, NTK Aware RoPE Scaling, many LLM benchmarks - many researchers use LocalLlama to share research and discuss on new stuff.

    I know a lot of AI researchers use the "LocalLlama vibe check" which essentially is an anecdotal approach to LLM evaluation - ie instead of relying on Chat LMsys or LLM benchmarks, 3rd party crowd sourced vibe checks sometimes do much better.

adultSwim 4 hours ago

Are there alternatives to unsloth?

I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.

Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.

  • danielhanchen 3 hours ago

    Multi GPU support is definitely coming to Unsloth OSS! Our goal was to release it this month, but unsure on exact timelines - maybe next month!!

make3 4 hours ago

"Yes it improves performance!" proceeds to show the most unconvincing stats ever

you can probably blow on your GPU and get a similar performance change

TZubiri 4 hours ago

Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.