Text Generation
Transformers
PyTorch
multilingual
phi3
torchao
phi
phi4
nlp
code
math
chat
conversational
custom_code
text-generation-inference
Instructions to use pytorch/Phi-4-mini-instruct-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pytorch/Phi-4-mini-instruct-INT4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use pytorch/Phi-4-mini-instruct-INT4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pytorch/Phi-4-mini-instruct-INT4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pytorch/Phi-4-mini-instruct-INT4
- SGLang
How to use pytorch/Phi-4-mini-instruct-INT4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use pytorch/Phi-4-mini-instruct-INT4 with Docker Model Runner:
docker model run hf.co/pytorch/Phi-4-mini-instruct-INT4
Update model card: Add TorchAO paper, code, documentation links and correct license
#3
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,5 +1,11 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
- torchao
|
| 5 |
- phi
|
|
@@ -9,15 +15,20 @@ tags:
|
|
| 9 |
- math
|
| 10 |
- chat
|
| 11 |
- conversational
|
| 12 |
-
license: mit
|
| 13 |
-
language:
|
| 14 |
-
- multilingual
|
| 15 |
-
base_model:
|
| 16 |
-
- microsoft/Phi-4-mini-instruct
|
| 17 |
-
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
# Inference with vLLM
|
| 23 |
Install vllm nightly and torchao nightly to get some recent changes:
|
|
@@ -49,7 +60,9 @@ if __name__ == '__main__':
|
|
| 49 |
# that contain the prompt, generated text, and other information.
|
| 50 |
outputs = llm.generate(prompts, sampling_params)
|
| 51 |
# Print the outputs.
|
| 52 |
-
print("
|
|
|
|
|
|
|
| 53 |
for output in outputs:
|
| 54 |
prompt = output.prompt
|
| 55 |
generated_text = output.outputs[0].text
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- microsoft/Phi-4-mini-instruct
|
| 4 |
+
language:
|
| 5 |
+
- multilingual
|
| 6 |
library_name: transformers
|
| 7 |
+
license: bsd-3-clause
|
| 8 |
+
pipeline_tag: text-generation
|
| 9 |
tags:
|
| 10 |
- torchao
|
| 11 |
- phi
|
|
|
|
| 15 |
- math
|
| 16 |
- chat
|
| 17 |
- conversational
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
+
This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) using int4 weight-only quantization and the [hqq](https://mobiusml.github.io/hqq_blog/) algorithm. This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for significant VRAM reduction and speedup on A100 GPUs.
|
| 21 |
+
|
| 22 |
+
## Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
|
| 23 |
+
The model's quantization is powered by **TorchAO**, a framework presented in the paper [TorchAO: PyTorch-Native Training-to-Serving Model Optimization](https://huggingface.co/papers/2507.16099).
|
| 24 |
+
|
| 25 |
+
**Abstract:** We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL .
|
| 26 |
+
|
| 27 |
+
## Resources
|
| 28 |
+
* **Official TorchAO GitHub Repository:** [https://github.com/pytorch/ao](https://github.com/pytorch/ao)
|
| 29 |
+
* **TorchAO Documentation:** [https://docs.pytorch.org/ao/stable/index.html](https://docs.pytorch.org/ao/stable/index.html)
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
|
| 33 |
# Inference with vLLM
|
| 34 |
Install vllm nightly and torchao nightly to get some recent changes:
|
|
|
|
| 60 |
# that contain the prompt, generated text, and other information.
|
| 61 |
outputs = llm.generate(prompts, sampling_params)
|
| 62 |
# Print the outputs.
|
| 63 |
+
print("
|
| 64 |
+
Generated Outputs:
|
| 65 |
+
" + "-" * 60)
|
| 66 |
for output in outputs:
|
| 67 |
prompt = output.prompt
|
| 68 |
generated_text = output.outputs[0].text
|