alexmarques commited on
Commit
9d5ff2f
·
verified ·
1 Parent(s): b2e1ea1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -0
README.md ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ license_link: https://huggingface.co/microsoft/Phi-4-reasoning/resolve/main/LICENSE
4
+ language:
5
+ - en
6
+ base_model:
7
+ - microsoft/Phi-4-reasoning
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - phi
11
+ - nlp
12
+ - math
13
+ - code
14
+ - chat
15
+ - conversational
16
+ - reasoning
17
+ - red hat
18
+ - FP8
19
+ - compressed-tensors
20
+ - llm-compressor
21
+ ---
22
+
23
+ ## Model Overview
24
+ - **Model Architecture:** Phi3ForCausalLM
25
+ - **Input:** Text
26
+ - **Output:** Text
27
+ - **Model Optimizations:**
28
+ - **Activation quantization:** FP8
29
+ - **Weight quantization:** FP8
30
+ - **Intended Use Cases:** This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:
31
+ 1. Memory/compute constrained environments.
32
+ 2. Latency bound scenarios.
33
+ 3. Math reasoning and logic.
34
+ - **Release Date:** 01/26/2026
35
+ - **Version:** 1.0
36
+ - **Model Developers:** Red Hat
37
+
38
+
39
+ ### Model Optimizations
40
+
41
+ This model was obtained by quantizing activation and weights of [Phi-4-reasoning](https://huggingface.co/microsoft/Phi-4-reasoning) to FP8 data type.
42
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
43
+ Weight quantization also reduces disk size requirements by approximately 50%.
44
+
45
+ Only weights and activations of the linear operators within transformers blocks are quantized.
46
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
47
+ The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
48
+
49
+ ## Deployment
50
+
51
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
52
+
53
+ ```bash
54
+ vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1
55
+ ```
56
+
57
+ ```python
58
+ from openai import OpenAI
59
+ # Set OpenAI's API key and API base to use vLLM's API server.
60
+ openai_api_key = "EMPTY"
61
+ openai_api_base = "http://localhost:8000/v1"
62
+
63
+ client = OpenAI(
64
+ api_key=openai_api_key,
65
+ base_url=openai_api_base,
66
+ )
67
+
68
+ generated_text = client.chat.completions.create(
69
+ model="RedHatAI/Phi-4-reasoning-FP8-dynamic",
70
+ messages=[
71
+ {"role": "user", "content": "Give me a short introduction to large language model."},
72
+ ],
73
+ )
74
+ print(generated_text.choices[0].message.content)
75
+ ```
76
+
77
+ ## Creation
78
+
79
+ <details>
80
+ <summary>Creation details</summary>
81
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
82
+
83
+
84
+ ```python
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer
86
+ from llmcompressor.modifiers.quantization import QuantizationModifier
87
+ from llmcompressor.transformers import oneshot
88
+
89
+ # Load model
90
+ model_stub = "microsoft/Phi-4-reasoning"
91
+ model_name = model_stub.split("/")[-1]
92
+
93
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
94
+
95
+ model = AutoModelForCausalLM.from_pretrained(
96
+ model_stub,
97
+ device_map="auto",
98
+ torch_dtype="auto",
99
+ )
100
+
101
+ # Configure the quantization algorithm and scheme
102
+ recipe = QuantizationModifier(
103
+ targets="Linear",
104
+ scheme="FP8_dynamic",
105
+ ignore=["lm_head"],
106
+ )
107
+
108
+ # Apply quantization
109
+ oneshot(
110
+ model=model,
111
+ recipe=recipe,
112
+ )
113
+
114
+ # Save to disk in compressed-tensors format
115
+ save_path = model_name + "-FP8-dynamic"
116
+ model.save_pretrained(save_path)
117
+ tokenizer.save_pretrained(save_path)
118
+ print(f"Model and tokenizer saved to: {save_path}")
119
+ ```
120
+ </details>
121
+
122
+
123
+
124
+ ## Evaluation
125
+
126
+ The model was evaluated on the AIME25, GPQA Diamond and Mathh 500 benchmarks using [lighteval](https://github.com/huggingface/lighteval) and [vLLM](https://vllm.ai).
127
+
128
+ <details>
129
+ <summary>Evaluation commands</summary>
130
+ litellm_config.yaml
131
+ ```yaml
132
+ model_parameters:
133
+ provider: "hosted_vllm"
134
+ model_name: "hosted_vllm/RedHatAI/Phi-4-reasoning-FP8-dynamic"
135
+ base_url: "http://0.0.0.0:8000/v1"
136
+ api_key: ""
137
+ timeout: 1200
138
+ concurrent_requests: 64
139
+ generation_parameters:
140
+ temperature: 0.8
141
+ top_k: 50
142
+ top_p: 0.95
143
+ max_new_tokens: 24000
144
+ ```
145
+
146
+ ```bash
147
+ lighteval endpoint litellm litellm_config.yaml \
148
+ gpqa:diamond|0,math_500|0,aime25|0 \
149
+ --output-dir phi4_reasoning_fp8_dynamic \
150
+ --save-details
151
+ ```
152
+ </details>
153
+
154
+ ### Accuracy
155
+
156
+ <table>
157
+ <tr>
158
+ <td><strong>Benchmark</strong>
159
+ </td>
160
+ <td><strong>Phi-4-reasoning</strong>
161
+ </td>
162
+ <td><strong>Phi-4-reasoning FP8-dynamic<br>(this model)</strong>
163
+ </td>
164
+ <td><strong>Recovery</strong>
165
+ </td>
166
+ </tr>
167
+ <tr>
168
+ <td>AIME25
169
+ </td>
170
+ <td>61.25
171
+ </td>
172
+ <td>64.58
173
+ </td>
174
+ <td>105.4%
175
+ </td>
176
+ </tr>
177
+ <tr>
178
+ <td>GPQA Diamond
179
+ </td>
180
+ <td>64.65
181
+ </td>
182
+ <td>66.50
183
+ </td>
184
+ <td>102.9%
185
+ </td>
186
+ </tr>
187
+ <tr>
188
+ <td>Math 500
189
+ </td>
190
+ <td>90.01
191
+ </td>
192
+ <td>88.60
193
+ </td>
194
+ <td>98.4%
195
+ </td>
196
+ </tr>
197
+ </table>
198
+