Instructions to use microsoft/phi-4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/phi-4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/phi-4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/phi-4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/phi-4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/phi-4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/phi-4

SGLang

How to use microsoft/phi-4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/phi-4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/phi-4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/phi-4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/phi-4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/phi-4 with Docker Model Runner:
```
docker model run hf.co/microsoft/phi-4
```

Any tips to speed up inference?

#35

by LinoHong - opened Jan 31, 2025

Discussion

LinoHong

Jan 31, 2025

Hi all,
and thanks Microsoft for this amazing model!

I'm using huggingface pipeline to inference from the phi-4.
However, it feels really slow using 14B model.

I'm using 4 gpus Tesla V100 (32GB each) to distribute the model in inference time.
Is there a quick and easy way to make the model faster in inference?
It would be awesome if it's some kind of parameter in a pipeline function.

Thanks !
Lino Hong.

JLouisBiz

Feb 2, 2025

Using llama.cpp is an excellent way to speed up inference for large language models like Phi-4, especially if you want to run the model efficiently on CPUs or even GPUs with minimal overhead. llama.cpp is optimized for inference and supports quantization, which can significantly reduce the model size and improve speed without a large drop in accuracy.

Here’s how you can use llama.cpp with your Phi-4 model:

1. Convert the Model to GGUF Format

llama.cpp uses the GGUF format for models. You need to convert your Hugging Face model to this format.

Steps:

Clone the llama.cpp repository:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Install dependencies:
```
make
```
Convert the Hugging Face model to GGUF format:
- First, install the required Python dependencies:
```
pip install torch transformers
```
- Use the convert-hf-to-gguf.py script provided in llama.cpp:
```
python3 convert-hf-to-gguf.py --model /path/to/phi-4 --outfile /path/to/phi-4-gguf
```
  Replace /path/to/phi-4 with the path to your Hugging Face model and /path/to/phi-4-gguf with the desired output path.

2. Quantize the Model (Optional but Recommended)

Quantization reduces the model size and speeds up inference. llama.cpp supports several quantization levels (e.g., q4_0, q4_1, q5_0, etc.).

Steps:

Run the quantization script:
```
./quantize /path/to/phi-4-gguf /path/to/phi-4-gguf-q4_0 q4_0
```
This will create a quantized version of the model at /path/to/phi-4-gguf-q4_0.

3. Run Inference with `llama.cpp`

Once the model is converted and optionally quantized, you can run inference using llama.cpp.

Steps:

Run the main executable:
```
./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time"
```
- -m: Path to the GGUF model.
- -p: Prompt for inference.
For GPU acceleration (if supported), use the --n-gpu-layers flag:
```
./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --n-gpu-layers 20
```
Replace 20 with the number of layers you want to offload to the GPU.

4. Advanced Options

Batch Size: Use the -b flag to set the batch size.
Threads: Use the -t flag to specify the number of CPU threads.
Temperature and Top-p Sampling: Use --temp and --top-p for better control over text generation.

Example:

./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --temp 0.7 --top-p 0.9 -t 8 -b 512 --n-gpu-layers 20

5. Benchmarking

To measure performance, use the --benchmark flag:

./main -m /path/to/phi-4-gguf-q4_0 --benchmark

6. Using `llama.cpp` in Python

If you prefer to use llama.cpp in a Python script, you can use the llama-cpp-python package.

Steps:

Install the package:
```
pip install llama-cpp-python
```

Load and run the model:

from llama_cpp import Llama

# Load the model
llm = Llama(model_path="/path/to/phi-4-gguf-q4_0", n_gpu_layers=20)

# Run inference
output = llm("Once upon a time", max_tokens=50)
print(output["choices"][0]["text"])

Benefits of Using `llama.cpp`

Efficiency: Optimized for CPU and GPU inference.
Quantization: Reduces model size and speeds up inference.
Portability: Runs on a wide range of hardware, including CPUs and GPUs.
Minimal Dependencies: Lightweight and easy to set up.

By using llama.cpp, you can achieve faster inference times and lower resource usage compared to running the model directly through Hugging Face Transformers.

LinoHong

Feb 2, 2025

@JLouisBiz
Oh my gosh
this is perfect!!!!!!
Thanks a million!!! :D

gugarosa changed discussion status to closed May 9, 2025

xujfcn

Feb 24

A few tips for faster inference:

Use a hosted API instead of running locally — providers optimize for throughput with batching and better hardware
Reduce max_tokens if you do not need long outputs
Use streaming to get first tokens faster

If you want to try Phi-4 via API without managing infrastructure, Crazyrouter provides access to 600+ models including Phi-4 through an OpenAI-compatible endpoint. Typically faster than self-hosting on consumer hardware.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Any tips to speed up inference?

1. Convert the Model to GGUF Format

Steps:

2. Quantize the Model (Optional but Recommended)

Steps:

3. Run Inference with llama.cpp

Steps:

4. Advanced Options

5. Benchmarking

6. Using llama.cpp in Python

Steps:

Benefits of Using llama.cpp

3. Run Inference with `llama.cpp`

6. Using `llama.cpp` in Python

Benefits of Using `llama.cpp`