Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini
1. End of sentence should be <|im_end|> not <|endoftext|>
2. Chat template should not auto add an assistant prompt
3. Padding token should not be EOS but <|dummy_87|>
I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth
I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...
> We converted Phi-4 to Llama’s architecture for better accuracy and easier use.
What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?
Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.
Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.
So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.
The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.
Would guess GGUF so you can run on llama.cpp, LM Studio, etc..., but OP can hopefully clarity further for you.
Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!
Huh! That may explain why I kept on getting visible <|im_end|> output when I tried running a Phi-4 GGUF file using llama.cpp.
Oh yes exactly! I trimmed it out now :)
The better chat template should be:
{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}
Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft just doesn't care.
>to be on par with GPT-4o mini
Phi is known to overfit benchmarks. It's way, way worse then that.
Anecdotally, I've been experimenting with Phi-4 the past hour or so (so, yeah, not very comprehensive) and it's certainly a strong model. Definitely better than the previous Phi models.
Yep Phi-4 definitely is better than Phi-3.5!
Phi-3 should be fixed as well - but yes there were bugs as well! https://x.com/danielhanchen/status/1782853167572832650
Phi-3's sliding window should be 2048 and not 2047, and they also had chat template issues - I uploaded correct versions to https://huggingface.co/unsloth/Phi-3.5-mini-instruct
Can you convert to ONNX so I can try in web browser?
Would like to update this:
https://huggingface.co/spaces/webml-community/phi-3.5-webgpu
Oh I can probs try doing this!