README.md · RedHatAI/Phi-4-reasoning-FP8-dynamic at main

Phi-4-reasoning-FP8-dynamic / README.md

alexmarques

Update README.md

f130049 verified 10 days ago

preview code

raw

history blame contribute delete

5.92 kB

	---
	license: mit
	license_link: https://huggingface.co/microsoft/Phi-4-reasoning/resolve/main/LICENSE
	language:
	- en
	base_model:
	- microsoft/Phi-4-reasoning
	pipeline_tag: text-generation
	tags:
	- phi
	- nlp
	- math
	- code
	- chat
	- conversational
	- reasoning
	- red hat
	- FP8
	- compressed-tensors
	- llm-compressor
	---

	## Model Overview
	- Model Architecture: Phi3ForCausalLM
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Activation quantization: FP8
	- Weight quantization: FP8
	- Intended Use Cases: This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:
	1. Memory/compute constrained environments.
	2. Latency bound scenarios.
	3. Math reasoning and logic.
	- Release Date: 01/26/2026
	- Version: 1.0
	- Model Developers: Red Hat


	### Model Optimizations

	This model was obtained by quantizing activation and weights of [Phi-4-reasoning](https://huggingface.co/microsoft/Phi-4-reasoning) to FP8 data type.
	This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
	Weight quantization also reduces disk size requirements by approximately 50%.

	Only weights and activations of the linear operators within transformers blocks are quantized.
	Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
	The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.

	## Deployment

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

	```bash
	vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1
	```

	```python
	from openai import OpenAI
	# Set OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://localhost:8000/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	generated_text = client.chat.completions.create(
	model="RedHatAI/Phi-4-reasoning-FP8-dynamic",
	messages=[
	{"role": "user", "content": "Give me a short introduction to large language model."},
	],
	)
	print(generated_text.choices[0].message.content)
	```

	## Creation

	<details>
	<summary>Creation details</summary>
	This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.


	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from llmcompressor.modifiers.quantization import QuantizationModifier
	from llmcompressor.transformers import oneshot

	# Load model
	model_stub = "microsoft/Phi-4-reasoning"
	model_name = model_stub.split("/")[-1]

	tokenizer = AutoTokenizer.from_pretrained(model_stub)

	model = AutoModelForCausalLM.from_pretrained(
	model_stub,
	device_map="auto",
	torch_dtype="auto",
	)

	# Configure the quantization algorithm and scheme
	recipe = QuantizationModifier(
	targets="Linear",
	scheme="FP8_dynamic",
	ignore=["lm_head"],
	)

	# Apply quantization
	oneshot(
	model=model,
	recipe=recipe,
	)

	# Save to disk in compressed-tensors format
	save_path = model_name + "-FP8-dynamic"
	model.save_pretrained(save_path)
	tokenizer.save_pretrained(save_path)
	print(f"Model and tokenizer saved to: {save_path}")
	```
	</details>



	## Evaluation

	The model was evaluated on the AIME25, GPQA Diamond and Mathh 500 benchmarks using [lighteval](https://github.com/huggingface/lighteval), and on MMLU-Pro using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
	In both cases [vLLM](https://vllm.ai) is used as the backend

	<details>
	<summary>Evaluation commands</summary>

	### Start vLLM server
	```bash
	vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1
	```

	### lm-evaluation-harness
	```bash
	lm_eval --model local-chat-completions \
	--tasks mmlu_pro_chat \
	--model_args "model=RedHatAI/Phi-4-reasoning-FP8-dynamic,max_length=32000,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,timeout=2400,tokenizer_backend=None" \
	--apply_chat_template \
	--num_fewshot 0 \
	--output_path mmlu_pro_phi4_reasoning_fp8_dynamic \
	--gen_kwargs "do_sample=True,temperature=0.8,top_k=50,top_p=0.95,max_gen_toks=24000"
	```

	### lighteval
	litellm_config.yaml
	```yaml
	model_parameters:
	provider: "hosted_vllm"
	model_name: "hosted_vllm/RedHatAI/Phi-4-reasoning-FP8-dynamic"
	base_url: "http://0.0.0.0:8000/v1"
	api_key: ""
	timeout: 1200
	concurrent_requests: 64
	generation_parameters:
	temperature: 0.8
	top_k: 50
	top_p: 0.95
	max_new_tokens: 24000
	```

	```bash
	lighteval endpoint litellm litellm_config.yaml \
	gpqa:diamond\|0,math_500\|0,aime25\|0 \
	--output-dir phi4_reasoning_fp8_dynamic \
	--save-details
	```
	</details>

	### Accuracy

	<table>
	<tr>
	<td><strong>Benchmark</strong>
	</td>
	<td><strong>Phi-4-reasoning</strong>
	</td>
	<td><strong>Phi-4-reasoning FP8-dynamic<br>(this model)</strong>
	</td>
	<td><strong>Recovery</strong>
	</td>
	</tr>
	<tr>
	<td>AIME25
	</td>
	<td>61.25
	</td>
	<td>64.58
	</td>
	<td>105.4%
	</td>
	</tr>
	<tr>
	<td>GPQA Diamond
	</td>
	<td>64.65
	</td>
	<td>66.50
	</td>
	<td>102.9%
	</td>
	</tr>
	<tr>
	<td>Math 500
	</td>
	<td>90.01
	</td>
	<td>88.60
	</td>
	<td>98.4%
	</td>
	</tr>
	<tr>
	<td>MMLU-Pro
	</td>
	<td>76.49
	</td>
	<td>76.85
	</td>
	<td>100.5%
	</td>
	</tr>
	</table>