|
|
--- |
|
|
license: mit |
|
|
license_link: https://huggingface.co/microsoft/Phi-4-reasoning/resolve/main/LICENSE |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- microsoft/Phi-4-reasoning |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- phi |
|
|
- nlp |
|
|
- math |
|
|
- code |
|
|
- chat |
|
|
- conversational |
|
|
- reasoning |
|
|
- red hat |
|
|
- FP8 |
|
|
- compressed-tensors |
|
|
- llm-compressor |
|
|
--- |
|
|
|
|
|
## Model Overview |
|
|
- **Model Architecture:** Phi3ForCausalLM |
|
|
- **Input:** Text |
|
|
- **Output:** Text |
|
|
- **Model Optimizations:** |
|
|
- **Activation quantization:** FP8 |
|
|
- **Weight quantization:** FP8 |
|
|
- **Intended Use Cases:** This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: |
|
|
1. Memory/compute constrained environments. |
|
|
2. Latency bound scenarios. |
|
|
3. Math reasoning and logic. |
|
|
- **Release Date:** 01/26/2026 |
|
|
- **Version:** 1.0 |
|
|
- **Model Developers:** Red Hat |
|
|
|
|
|
|
|
|
### Model Optimizations |
|
|
|
|
|
This model was obtained by quantizing activation and weights of [Phi-4-reasoning](https://huggingface.co/microsoft/Phi-4-reasoning) to FP8 data type. |
|
|
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). |
|
|
Weight quantization also reduces disk size requirements by approximately 50%. |
|
|
|
|
|
Only weights and activations of the linear operators within transformers blocks are quantized. |
|
|
Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. |
|
|
The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization. |
|
|
|
|
|
## Deployment |
|
|
|
|
|
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. |
|
|
|
|
|
```bash |
|
|
vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1 |
|
|
``` |
|
|
|
|
|
```python |
|
|
from openai import OpenAI |
|
|
# Set OpenAI's API key and API base to use vLLM's API server. |
|
|
openai_api_key = "EMPTY" |
|
|
openai_api_base = "http://localhost:8000/v1" |
|
|
|
|
|
client = OpenAI( |
|
|
api_key=openai_api_key, |
|
|
base_url=openai_api_base, |
|
|
) |
|
|
|
|
|
generated_text = client.chat.completions.create( |
|
|
model="RedHatAI/Phi-4-reasoning-FP8-dynamic", |
|
|
messages=[ |
|
|
{"role": "user", "content": "Give me a short introduction to large language model."}, |
|
|
], |
|
|
) |
|
|
print(generated_text.choices[0].message.content) |
|
|
``` |
|
|
|
|
|
## Creation |
|
|
|
|
|
<details> |
|
|
<summary>Creation details</summary> |
|
|
This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. |
|
|
|
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from llmcompressor.modifiers.quantization import QuantizationModifier |
|
|
from llmcompressor.transformers import oneshot |
|
|
|
|
|
# Load model |
|
|
model_stub = "microsoft/Phi-4-reasoning" |
|
|
model_name = model_stub.split("/")[-1] |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_stub) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_stub, |
|
|
device_map="auto", |
|
|
torch_dtype="auto", |
|
|
) |
|
|
|
|
|
# Configure the quantization algorithm and scheme |
|
|
recipe = QuantizationModifier( |
|
|
targets="Linear", |
|
|
scheme="FP8_dynamic", |
|
|
ignore=["lm_head"], |
|
|
) |
|
|
|
|
|
# Apply quantization |
|
|
oneshot( |
|
|
model=model, |
|
|
recipe=recipe, |
|
|
) |
|
|
|
|
|
# Save to disk in compressed-tensors format |
|
|
save_path = model_name + "-FP8-dynamic" |
|
|
model.save_pretrained(save_path) |
|
|
tokenizer.save_pretrained(save_path) |
|
|
print(f"Model and tokenizer saved to: {save_path}") |
|
|
``` |
|
|
</details> |
|
|
|
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The model was evaluated on the AIME25, GPQA Diamond and Mathh 500 benchmarks using [lighteval](https://github.com/huggingface/lighteval), and on MMLU-Pro using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). |
|
|
In both cases [vLLM](https://vllm.ai) is used as the backend |
|
|
|
|
|
<details> |
|
|
<summary>Evaluation commands</summary> |
|
|
|
|
|
### Start vLLM server |
|
|
```bash |
|
|
vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1 |
|
|
``` |
|
|
|
|
|
### lm-evaluation-harness |
|
|
```bash |
|
|
lm_eval --model local-chat-completions \ |
|
|
--tasks mmlu_pro_chat \ |
|
|
--model_args "model=RedHatAI/Phi-4-reasoning-FP8-dynamic,max_length=32000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=2400,tokenizer_backend=None" \ |
|
|
--apply_chat_template \ |
|
|
--num_fewshot 0 \ |
|
|
--output_path mmlu_pro_phi4_reasoning_fp8_dynamic \ |
|
|
--gen_kwargs "do_sample=True,temperature=0.8,top_k=50,top_p=0.95,max_gen_toks=24000" |
|
|
``` |
|
|
|
|
|
### lighteval |
|
|
litellm_config.yaml |
|
|
```yaml |
|
|
model_parameters: |
|
|
provider: "hosted_vllm" |
|
|
model_name: "hosted_vllm/RedHatAI/Phi-4-reasoning-FP8-dynamic" |
|
|
base_url: "http://0.0.0.0:8000/v1" |
|
|
api_key: "" |
|
|
timeout: 1200 |
|
|
concurrent_requests: 64 |
|
|
generation_parameters: |
|
|
temperature: 0.8 |
|
|
top_k: 50 |
|
|
top_p: 0.95 |
|
|
max_new_tokens: 24000 |
|
|
``` |
|
|
|
|
|
```bash |
|
|
lighteval endpoint litellm litellm_config.yaml \ |
|
|
gpqa:diamond|0,math_500|0,aime25|0 \ |
|
|
--output-dir phi4_reasoning_fp8_dynamic \ |
|
|
--save-details |
|
|
``` |
|
|
</details> |
|
|
|
|
|
### Accuracy |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<td><strong>Benchmark</strong> |
|
|
</td> |
|
|
<td><strong>Phi-4-reasoning</strong> |
|
|
</td> |
|
|
<td><strong>Phi-4-reasoning FP8-dynamic<br>(this model)</strong> |
|
|
</td> |
|
|
<td><strong>Recovery</strong> |
|
|
</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>AIME25 |
|
|
</td> |
|
|
<td>61.25 |
|
|
</td> |
|
|
<td>64.58 |
|
|
</td> |
|
|
<td>105.4% |
|
|
</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>GPQA Diamond |
|
|
</td> |
|
|
<td>64.65 |
|
|
</td> |
|
|
<td>66.50 |
|
|
</td> |
|
|
<td>102.9% |
|
|
</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Math 500 |
|
|
</td> |
|
|
<td>90.01 |
|
|
</td> |
|
|
<td>88.60 |
|
|
</td> |
|
|
<td>98.4% |
|
|
</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>MMLU-Pro |
|
|
</td> |
|
|
<td>76.49 |
|
|
</td> |
|
|
<td>76.85 |
|
|
</td> |
|
|
<td>100.5% |
|
|
</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
|