Instructions to use microsoft/Phi-4-mini-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-4-mini-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-4-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/Phi-4-mini-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-4-mini-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-4-mini-instruct

SGLang

How to use microsoft/Phi-4-mini-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-4-mini-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-4-mini-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-4-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-4-mini-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-4-mini-instruct
```

text-generation-inference error

#25

by BenjaminAslamy - opened Apr 17, 2025

Discussion

BenjaminAslamy

Apr 17, 2025

I'm using the latest version of text-generation-inference, but I'm getting this error:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/src/server/text_generation_server/cli.py:119 in serve │
│ │
│ 116 │ │ raise RuntimeError( │
│ 117 │ │ │ "Only 1 can be set between dtype and quantize, as they │
│ 118 │ │ ) │
│ ❱ 119 │ server.serve( │
│ 120 │ │ model_id, │
│ 121 │ │ lora_adapters, │
│ 122 │ │ revision, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ json_output = True │ │
│ │ kv_cache_dtype = None │ │
│ │ logger_level = 'INFO' │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'microsoft/Phi-4-mini-instruct' │ │
│ │ otlp_endpoint = None │ │
│ │ otlp_service_name = 'text-generation-inference.router' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ server = <module 'text_generation_server.server' from │ │
│ │ '/usr/src/server/text_generation_server/server.py'> │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/server.py:315 in serve │
│ │
│ 312 │ │ while signal_handler.KEEP_PROCESSING: │
│ 313 │ │ │ await asyncio.sleep(0.5) │
│ 314 │ │
│ ❱ 315 │ asyncio.run( │
│ 316 │ │ serve_inner( │
│ 317 │ │ │ model_id, │
│ 318 │ │ │ lora_adapters, │
│ │
│ ╭─────────────────────────── locals ───────────────────────────╮ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'microsoft/Phi-4-mini-instruct' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────╯ │
│ │
│ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │
│ /asyncio/runners.py:190 in run │
│ │
│ 187 │ │ │ "asyncio.run() cannot be called from a running event loop" │
│ 188 │ │
│ 189 │ with Runner(debug=debug) as runner: │
│ ❱ 190 │ │ return runner.run(main) │
│ 191 │
│ 192 │
│ 193 def _cancel_all_tasks(loop): │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ debug = None │ │
│ │ main = <coroutine object serve..serve_inner at 0x75c34d6c81f0> │ │
│ │ runner = <asyncio.runners.Runner object at 0x75c3c84b9ed0> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │
│ /asyncio/runners.py:118 in run │
│ │
│ 115 │ │ │
│ 116 │ │ self._interrupt_count = 0 │
│ 117 │ │ try: │
│ ❱ 118 │ │ │ return self._loop.run_until_complete(task) │
│ 119 │ │ except exceptions.CancelledError: │
│ 120 │ │ │ if self._interrupt_count > 0: │
│ 121 │ │ │ │ uncancel = getattr(task, "uncancel", None) │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ context = <_contextvars.Context object at 0x75c35ff50a80> │ │
│ │ coro = <coroutine object serve..serve_inner at │ │
│ │ 0x75c34d6c81f0> │ │
│ │ self = <asyncio.runners.Runner object at 0x75c3c84b9ed0> │ │
│ │ sigint_handler = functools.partial(<bound method Runner._on_sigint of │ │
│ │ <asyncio.runners.Runner object at 0x75c3c84b9ed0>>, │ │
│ │ main_task=<Task finished name='Task-1' │ │
│ │ coro=<serve..serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:244> │ │
│ │ exception=RuntimeError('The size of tensor a (48) must │ │
│ │ match the size of tensor b (64) at non-singleton │ │
│ │ dimension 0')>) │ │
│ │ task = <Task finished name='Task-1' │ │
│ │ coro=<serve..serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:244> │ │
│ │ exception=RuntimeError('The size of tensor a (48) must │ │
│ │ match the size of tensor b (64) at non-singleton │ │
│ │ dimension 0')> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │
│ /asyncio/base_events.py:654 in run_until_complete │
│ │
│ 651 │ │ if not future.done(): │
│ 652 │ │ │ raise RuntimeError('Event loop stopped before Future comp │
│ 653 │ │ │
│ ❱ 654 │ │ return future.result() │
│ 655 │ │
│ 656 │ def stop(self): │
│ 657 │ │ """Stop running the event loop. │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ future = <Task finished name='Task-1' │ │
│ │ coro=<serve..serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:244> │ │
│ │ exception=RuntimeError('The size of tensor a (48) must match │ │
│ │ the size of tensor b (64) at non-singleton dimension 0')> │ │
│ │ new_task = False │ │
│ │ self = <_UnixSelectorEventLoop running=False closed=True │ │
│ │ debug=False> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/server.py:268 in serve_inner │
│ │
│ 265 │ │ │ server_urls = [local_url] │
│ 266 │ │ │
│ 267 │ │ try: │
│ ❱ 268 │ │ │ model = get_model_with_lora_adapters( │
│ 269 │ │ │ │ model_id, │
│ 270 │ │ │ │ lora_adapters, │
│ 271 │ │ │ │ revision, │
│ │
│ ╭──────────────────────────── locals ─────────────────────────────╮ │
│ │ adapter_to_index = {} │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ local_url = 'unix:///tmp/text-generation-server-0' │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'microsoft/Phi-4-mini-instruct' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ server_urls = ['unix:///tmp/text-generation-server-0'] │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ │ unix_socket_template = 'unix://{}-{}' │ │
│ ╰─────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/models/init.py:1814 in │
│ get_model_with_lora_adapters │
│ │
│ 1811 │ adapter_to_index: Dict[str, int], │
│ 1812 ): │
│ 1813 │ lora_adapter_ids = [adapter.id for adapter in lora_adapters] │
│ ❱ 1814 │ model = get_model( │
│ 1815 │ │ model_id, │
│ 1816 │ │ lora_adapter_ids, │
│ 1817 │ │ revision, │
│ │
│ ╭────────────────────── locals ───────────────────────╮ │
│ │ adapter_to_index = {} │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ lora_adapter_ids = [] │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'microsoft/Phi-4-mini-instruct' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ ╰─────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/models/init.py:998 in get_model │
│ │
│ 995 │ │
│ 996 │ elif model_type == LLAMA or model_type == PHI3 or model_type == G │
│ 997 │ │ if FLASH_ATTENTION: │
│ ❱ 998 │ │ │ return FlashCausalLM( │
│ 999 │ │ │ │ model_id=model_id, │
│ 1000 │ │ │ │ model_class=FlashLlamaForCausalLM, │
│ 1001 │ │ │ │ revision=revision, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ _ = {} │ │
│ │ compressed_tensors_config = None │ │
│ │ config_dict = { │ │
│ │ │ '_name_or_path': 'Phi-4-mini-instruct', │ │
│ │ │ 'architectures': ['Phi3ForCausalLM'], │ │
│ │ │ 'attention_bias': False, │ │
│ │ │ 'attention_dropout': 0.0, │ │
│ │ │ 'auto_map': { │ │
│ │ │ │ 'AutoConfig': │ │
│ │ 'microsoft/Phi-4-mini-instruct--configurati… │ │
│ │ │ │ 'AutoModelForCausalLM': │ │
│ │ 'microsoft/Phi-4-mini-instruct--modeling_ph… │ │
│ │ │ │ 'AutoTokenizer': │ │
│ │ 'microsoft/Phi-4-mini-instruct--Xenova/gpt-… │ │
│ │ │ }, │ │
│ │ │ 'bos_token_id': 199999, │ │
│ │ │ 'embd_pdrop': 0.0, │ │
│ │ │ 'eos_token_id': 199999, │ │
│ │ │ 'full_attn_mod': 1, │ │
│ │ │ 'hidden_act': 'silu', │ │
│ │ │ ... +25 │ │
│ │ } │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ kv_cache_scheme = None │ │
│ │ lora_adapter_ids = [] │ │
│ │ max_input_tokens = None │ │
│ │ method = 'n-gram' │ │
│ │ model_id = 'microsoft/Phi-4-mini-instruct' │ │
│ │ model_type = 'phi3' │ │
│ │ needs_sliding_window = False │ │
│ │ quantization_config = None │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ sliding_window = 262144 │ │
│ │ speculate = 0 │ │
│ │ speculator = None │ │
│ │ trust_remote_code = False │ │
│ │ use_sliding_window = True │ │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (48) must match the size of tensor b (64) at
non-singleton dimension 0 rank=0
Error: ShardCannotStart
2025-04-17T08:38:24.705756Z ERROR text_generation_launcher: Shard 0 failed to start
2025-04-17T08:38:24.705774Z INFO text_generation_launcher: Shutting down shards

BenjaminAslamy changed discussion status to closed Apr 17, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment