Instructions to use unsloth/Phi-4-mini-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/Phi-4-mini-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/Phi-4-mini-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("unsloth/Phi-4-mini-instruct", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("unsloth/Phi-4-mini-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use unsloth/Phi-4-mini-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/Phi-4-mini-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Phi-4-mini-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/Phi-4-mini-instruct
- SGLang
How to use unsloth/Phi-4-mini-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/Phi-4-mini-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Phi-4-mini-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/Phi-4-mini-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Phi-4-mini-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use unsloth/Phi-4-mini-instruct with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Phi-4-mini-instruct to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Phi-4-mini-instruct to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/Phi-4-mini-instruct to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="unsloth/Phi-4-mini-instruct", max_seq_length=2048, ) - Docker Model Runner
How to use unsloth/Phi-4-mini-instruct with Docker Model Runner:
docker model run hf.co/unsloth/Phi-4-mini-instruct
Phi-4 mini does not work inside of unsloth.
The Phi-4 mini release is very promising, but sadly it cannot load inside of the Unsloth framework: "RuntimeError: rope_scaling's short_factor field must have length 64, got 48".
Will unsloth release a fixed version possibly?
seems like the modelling_phi3.py is not included
The Phi-4 mini release is very promising, but sadly it cannot load inside of the Unsloth framework: "RuntimeError:
rope_scaling's short_factor field must have length 64, got 48".Will unsloth release a fixed version possibly?
seems like the modelling_phi3.py is not included
unfortunately doesnt work atm in any framework. doesnt work in unsloth, ollama, llama.cpp etc because of new arch
will update u guys when it does
The architecture isn’t even particularly new, just none of these frameworks respect the “partial_rotary_factor” config line. (Only 3/4 of the embeddings are subject to RoPE, presumably to weight latest context more heavily than long context). I took a crack at adding it to Exllamav2, and while quantization now appears to work inference is wildly broken. Guess it’ll take a while before we see this in a usable state if all the upstream packages need to be updated to support it.
Hello @shimmyshimmer
We already added this 'partial_rotary_factor' support to the latest HF and VLLM before the release.
The new model feature is added to the latest HF(v4.49.0) and vllm (v0.7.3) already.
Can you take a look at the PRs?
They are relatively simple if the new config is utilized.
VLLM: https://github.com/vllm-project/vllm/pull/12718
HF: https://github.com/huggingface/transformers/pull/35947
Hello @shimmyshimmer
We already added this 'partial_rotary_factor' support to the latest HF and VLLM before the release.
The new model feature is added to the latest HF(v4.49.0) and vllm (v0.7.3) already.Can you take a look at the PRs?
They are relatively simple if the new config is utilized.VLLM: https://github.com/vllm-project/vllm/pull/12718
HF: https://github.com/huggingface/transformers/pull/35947
Can you guys prepare a complete fine-tuning solution for general users? I have tried a lot of methods and nothing works.
RuntimeError: rope_scaling's short_factor field must have length 64, got 48
I was getting this issue today on VLLM 0.8.3 serving Phi-4
Because my "vllm --version" reported 0.8.3 (which is more than 0.7.3, meaning my vLLM should be fine), I did "pip install transformers==0.49.0" as ykin362 suggested (although presented as a git pull) and now it works fine.
I had old transformers version 4.48.2 but that was not new enough. 0.49.0 (or higher) is required.
Coming back to this, we were able to fine tune phi-4-mini by doing the following:
(full fine tuning and Lora also works)
After installing unsloth, if you are on Google colab notebook or any other jupyter notebook, make your second cell after your pip installs:
import os
os.environ["TORCH_DYNAMO_DISABLE"] = "1"
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
And to start fine tuning
import torch
import torch.nn as nn
torch._dynamo.config.disable = True
trainer_stats = trainer.train()
These steps did allow us to fine-tune phi-4-mini models. keep in mind it does use quite a bit of memory but the loss is good
Coming back to this, we were able to fine tune phi-4-mini by doing the following:
(full fine tuning and Lora also works)
After installing unsloth, if you are on Google colab notebook or any other jupyter notebook, make your second cell after your pip installs:import os
os.environ["TORCH_DYNAMO_DISABLE"] = "1"
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"And to start fine tuning
import torch
import torch.nn as nn
torch._dynamo.config.disable = True
trainer_stats = trainer.train()These steps did allow us to fine-tune phi-4-mini models. keep in mind it does use quite a bit of memory but the loss is good
Thanks so much for the input, I'm sure people will find this useful! If there was only some way to pin this so others could know