File size: 8,237 Bytes

---
language:
- multilingual
- ar
- zh
- cs
- da
- nl
- en
- fi
- fr
- de
- he
- hu
- it
- ja
- ko
- 'no'
- pl
- pt
- ru
- es
- sv
- th
- tr
- uk
license: mit
license_name: mit
license_link: https://huggingface.co/microsoft/Phi-4-mini-instruct/resolve/main/LICENSE
name: RedHatAI/Phi-4-mini-instruct-FP8-dynamic
base_model:
- microsoft/Phi-4-mini-instruct
provider: Microsoft
description: This model was obtained by quantizing activation and weights of Phi-4-mini-instruct to FP8 data type.
validated_on:
  - RHOAI 3.4 EA1
  - RHAIIS 3.4 EA1
readme: https://huggingface.co/RedHatAI/Phi-4-mini-instruct-FP8-dynamic/blob/main/README.md
pipeline_tag: text-generation
tags:
- nlp
- code
- red hat
- FP8
- compressed-tensors
- llm-compressor
---

<h1 align: center; style="display: flex; align-items: center; gap: 10px; margin: 0;">
  Phi-4-mini-instruct-FP8-dynamic
  <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
</h1>
<a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
<img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
</a>


## Model Overview
- **Model Architecture:** Phi3ForCausalLM
  - **Input:** Text
  - **Output:** Text
- **Model Optimizations:**
  - **Activation quantization:** FP8
  - **Weight quantization:** FP8
- **Intended Use Cases:** The model is intended for broad multilingual commercial and research use. The model provides uses for general purpose AI systems and applications which require:
1. Memory/compute constrained environments.
2. Latency bound scenarios.
3. Math reasoning and logic.
- **Release Date:** 03/03/2025
- **Version:** 1.0
- **Model Developers:** Red Hat


### Model Optimizations

This model was obtained by quantizing activation and weights of [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) to FP8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized.
Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.

## Deployment

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

```bash
vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072
```

```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

generated_text = client.chat.completions.create(
    model="RedHatAI/Phi-4-mini-instruct-FP8-dynamic",
    messages=[
        {"role": "user", "content": "Give me a short introduction to large language model."},
    ],
)
print(generated_text.choices[0].message.content)
```

## Creation

<details>
  <summary>Creation details</summary>
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. 


  ```python
  from transformers import AutoModelForCausalLM, AutoTokenizer
  from llmcompressor.modifiers.quantization import QuantizationModifier
  from llmcompressor import oneshot
  
  # Load model
  model_stub = "microsoft/Phi-4-mini-instruct"
  model_name = model_stub.split("/")[-1]
  
  tokenizer = AutoTokenizer.from_pretrained(model_stub)
  
  model = AutoModelForCausalLM.from_pretrained(
      model_stub,
      device_map="auto",
      torch_dtype="auto",
  )
  
  # Configure the quantization algorithm and scheme
  recipe = QuantizationModifier(
      targets="Linear",
      scheme="FP8_dynamic",
      ignore=["lm_head"],
  )
  
  # Apply quantization
  oneshot(
      model=model,
      recipe=recipe,
  )
  
  # Save to disk in compressed-tensors format
  save_path = model_name + "-FP8-dynamic"
  model.save_pretrained(save_path)
  tokenizer.save_pretrained(save_path)
  print(f"Model and tokenizer saved to: {save_path}")
  ```
</details>
 


## Evaluation

The model was evaluated on the Mathh 500 benchmarks using [lighteval](https://github.com/huggingface/lighteval), and on GSM8k-Platinum, MMLU CoT, MMLU-Pro, and IFEval using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
In both cases [vLLM](https://vllm.ai) is used as the backend

<details>
  <summary>Evaluation commands</summary>

### Start vLLM server
```bash
vllm serve RedHatAI/Phi-4-mini-instruct-FP8-dynamic --max_model_len 131072
```

### lm-evaluation-harness
```bash
lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
  --apply_chat_template \
  --num_fewshot 5 \
  --fewshot_as_multiturn \
  --output_path gsm8k_platinum_phi4_mini_instruct_fp8_dynamic \
  --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
```

```bash
lm_eval --model local-chat-completions \
  --tasks mmlu_cot_llama \
  --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
  --apply_chat_template \
  --output_path mmlu_cot_phi4_mini_instruct_fp8_dynamic \
  --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
```

```bash
lm_eval --model local-chat-completions \
  --tasks mmlu_pro_chat \
  --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
  --apply_chat_template \
  --num_fewshot 5 \
  --fewshot_as_multiturn \
  --output_path mmlu_pro_phi4_mini_instruct_fp8_dynamic \
  --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
```

```bash
lm_eval --model local-chat-completions \
  --tasks ifeval \
  --model_args "model=RedHatAI/Phi-4-mini-instruct-FP8-dynamic,max_length=131072,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=600,tokenizer_backend=None" \
  --apply_chat_template \
  --output_path ifeval_phi4_mini_instruct_fp8_dynamic \
  --gen_kwargs "do_sample=False,temperature=0.0,max_gen_toks=16000"
```

### lighteval
litellm_config.yaml
```yaml
model_parameters:
  provider: "hosted_vllm"
  model_name: "hosted_vllm/RedHatAI/Phi-4-mini-instruct-FP8-dynamic"
  base_url: "http://0.0.0.0:8000/v1"
  api_key: ""
  timeout: 600
  concurrent_requests: 128
  generation_parameters:
    temperature: 0.0
    max_new_tokens: 16000
```

```bash
lighteval endpoint litellm litellm_config.yaml \
    math_500|0 \
    --output-dir phi4_mini_instruct_fp8_dynamic \
    --save-details
```
</details>

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Phi-4-mini-instruct</strong>
   </td>
   <td><strong>Phi-4-mini-instruct-FP8-dynamic<br>(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
  <tr>
   <td>Math 500
   </td>
   <td>57.60
   </td>
   <td>58.20
   </td>
   <td>101.7%
   </td>
  </tr>
  <tr>
   <td>GSM8k-Platinum
   </td>
   <td>84.12
   </td>
   <td>84.70
   </td>
   <td>100.7%
   </td>
  </tr>
  <tr>
   <td>MMLU CoT
   </td>
   <td>67.01
   </td>
   <td>66.97
   </td>
   <td>99.9%
   </td>
  </tr>
  <tr>
   <td>MMLU-Pro
   </td>
   <td>46.75
   </td>
   <td>45.60
   </td>
   <td>97.5%
   </td>
  </tr>
</table>