krishnateja95 commited on
Commit
99e3fc4
·
verified ·
1 Parent(s): 3cb1c89

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +509 -0
README.md ADDED
@@ -0,0 +1,509 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - fp8
6
+ - quantized
7
+ - llm-compressor
8
+ - compressed-tensors
9
+ - red hat
10
+ base_model:
11
+ - ibm-granite/granite-4.0-h-small
12
+ ---
13
+
14
+
15
+ # Granite-4.0-h-small
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** GraniteMoeHybridForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** FP8
23
+ - **Activation quantization:** FP8
24
+ - **Release Date:**
25
+ - **Version:** 1.0
26
+ - **Model Developers:**: Red Hat
27
+
28
+ Quantized version of [ibm-granite/granite-4.0-h-small](https://huggingface.co/ibm-granite/granite-4.0-h-small).
29
+
30
+ ### Model Optimizations
31
+
32
+ This model was obtained by quantizing the weights and activations of [ibm-granite/granite-4.0-h-small](https://huggingface.co/ibm-granite/granite-4.0-h-small) to FP8 data type.
33
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
34
+ Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.
35
+
36
+ ## Deployment
37
+
38
+ ### Use with vLLM
39
+
40
+ 1. Initialize vLLM server:
41
+ ```
42
+ vllm serve RedHatAI/granite-4.0-h-small-FP8-block --tensor_parallel_size 4
43
+ ```
44
+
45
+ 2. Send requests to the server:
46
+
47
+ ```python
48
+ from openai import OpenAI
49
+
50
+ # Modify OpenAI's API key and API base to use vLLM's API server.
51
+ openai_api_key = "EMPTY"
52
+ openai_api_base = "http://<your-server-host>:8000/v1"
53
+
54
+ client = OpenAI(
55
+ api_key=openai_api_key,
56
+ base_url=openai_api_base,
57
+ )
58
+
59
+ model = "RedHatAI/granite-4.0-h-small-FP8-block"
60
+
61
+ messages = [
62
+ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
63
+ ]
64
+
65
+
66
+ outputs = client.chat.completions.create(
67
+ model=model,
68
+ messages=messages,
69
+ )
70
+
71
+ generated_text = outputs.choices[0].message.content
72
+ print(generated_text)
73
+ ```
74
+
75
+ <!-- ## Creation
76
+
77
+ This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
78
+
79
+ <details>
80
+ <summary>Creation details</summary>
81
+
82
+ ```python
83
+ from transformers import AutoProcessor, Qwen3ForCausalLM
84
+
85
+ from llmcompressor import oneshot
86
+ from llmcompressor.modeling import replace_modules_for_calibration
87
+ from llmcompressor.modifiers.quantization import QuantizationModifier
88
+
89
+ MODEL_ID = "Qwen/Qwen3-8B"
90
+
91
+ # Load model.
92
+ model = Qwen3ForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
93
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
94
+ model = replace_modules_for_calibration(model)
95
+
96
+ # Configure the quantization algorithm and scheme.
97
+ # In this case, we:
98
+ # * quantize the weights to fp8 with per-block quantization
99
+ # * quantize the activations to fp8 with dynamic token activations
100
+ recipe = QuantizationModifier(
101
+ targets="Linear",
102
+ scheme="FP8_BLOCK",
103
+ ignore=["lm_head"],
104
+ )
105
+
106
+ # Apply quantization.
107
+ oneshot(model=model, recipe=recipe)
108
+
109
+ # Save to disk in compressed-tensors format.
110
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-block"
111
+ model.save_pretrained(SAVE_DIR)
112
+ processor.save_pretrained(SAVE_DIR)
113
+ ```
114
+ </details> -->
115
+
116
+
117
+ ## Evaluation
118
+
119
+
120
+ The model was evaluated on the OpenLLM leaderboard task, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
121
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
122
+
123
+ <details>
124
+ <summary>Evaluation details</summary>
125
+
126
+ **Openllm V1**
127
+ ```
128
+ lm_eval \
129
+ --model vllm \
130
+ --model_args pretrained="RedHatAI/granite-4.0-h-small-FP8-block",dtype=auto,add_bos_token=True,max_model_len=16384,tensor_parallel_size=4,gpu_memory_utilization=0.9,enable_chunked_prefill=True,trust_remote_code=True \
131
+ --tasks openllm \
132
+ --write_out \
133
+ --batch_size auto \
134
+ --show_config
135
+ ```
136
+
137
+
138
+ **Openllm V2**
139
+ ```
140
+ lm_eval \
141
+ --model vllm \
142
+ --model_args pretrained="RedHatAI/granite-4.0-h-small-FP8-block",dtype=auto,add_bos_token=False,max_model_len=16384,tensor_parallel_size=4,gpu_memory_utilization=0.7,disable_log_stats=True,enable_chunked_prefill=True,trust_remote_code=True \
143
+ --tasks leaderboard \
144
+ --apply_chat_template \
145
+ --fewshot_as_multiturn \
146
+ --write_out \
147
+ --batch_size auto \
148
+ --show_config
149
+ ```
150
+
151
+
152
+ **Coding Benchmarks**
153
+
154
+ ```
155
+ evalplus.evaluate --model "RedHatAI/granite-4.0-h-small-FP8-block" \
156
+ --dataset "humaneval" \
157
+ --backend vllm \
158
+ --tp 4 \
159
+ --greedy
160
+
161
+ evalplus.evaluate --model "RedHatAI/granite-4.0-h-small-FP8-block" \
162
+ --dataset "mbpp" \
163
+ --backend vllm \
164
+ --tp 4 \
165
+ --greedy
166
+
167
+ ```
168
+
169
+ </details>
170
+
171
+
172
+ <b>*</b> I/p Length = 2048, O/p Length = 2048, #Requests = 1024
173
+
174
+ ### Accuracy
175
+ <table>
176
+
177
+ <thead>
178
+ <tr>
179
+ <th>Category</th>
180
+ <th>Metric</th>
181
+ <th>ibm-granite/granite-4.0-h-small</th>
182
+ <th>ibm-granite/granite-4.0-h-small-FP8</th>
183
+ <th>RedHatAI/granite-4.0-h-small-FP8-block</th>
184
+ </tr>
185
+ </thead>
186
+ <tbody>
187
+ <tr>
188
+ <td rowspan="1"><b>Model Size (GB)</b></td>
189
+ <td></td>
190
+ <td>64.41</td>
191
+ <td>33.48</td>
192
+ <td>36.43</td>
193
+ </tr>
194
+ <tr>
195
+ <td rowspan="1"><b>Throughput (Requests/sec)*</b></td>
196
+ <td></td>
197
+ <td>2.031</td>
198
+ <td>2.144</td>
199
+ <td>2.066</td>
200
+ </tr>
201
+ <tr>
202
+ <td rowspan="7"><b>OpenLLM V1</b></td>
203
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
204
+ <td>72.27</td>
205
+ <td>71.67 (99.17)</td>
206
+ <td>72.27 (100.00)</td>
207
+ </tr>
208
+ <tr>
209
+ <td>GSM8K (Strict-Match, 5-shot)</td>
210
+ <td>85.06</td>
211
+ <td>85.60 (100.62)</td>
212
+ <td>85.60 (100.62)</td>
213
+ </tr>
214
+ <tr>
215
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
216
+ <td>86.07</td>
217
+ <td>86.02 (99.94)</td>
218
+ <td>85.96 (99.87)</td>
219
+ </tr>
220
+ <tr>
221
+ <td>MMLU (Acc, 5-shot)</td>
222
+ <td>77.15</td>
223
+ <td>76.94 (99.73)</td>
224
+ <td>77.23 (100.10)</td>
225
+ </tr>
226
+ <tr>
227
+ <td>TruthfulQA (MC2, 0-shot)</td>
228
+ <td>57.97</td>
229
+ <td>57.62 (99.40)</td>
230
+ <td>57.85 (99.80)</td>
231
+ </tr>
232
+ <tr>
233
+ <td>Winogrande (Acc, 5-shot)</td>
234
+ <td>81.45</td>
235
+ <td>81.14 (99.61)</td>
236
+ <td>80.82 (99.22)</td>
237
+ </tr>
238
+ <tr>
239
+ <td><b>Average Score</b></td>
240
+ <td><b>76.66</b></td>
241
+ <td><b>76.50 (99.79)</b></td>
242
+ <td><b>76.62 (99.95)</b></td>
243
+ </tr>
244
+ <!-- OpenLLM Leaderboard V2 -->
245
+ <tr>
246
+ <td rowspan="7"><b>OpenLLM V2</b></td>
247
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
248
+ <td>87.41</td>
249
+ <td>87.65 (100.27)</td>
250
+ <td>87.89 (100.55)</td>
251
+ </tr>
252
+ <tr>
253
+ <td>BBH (Acc-Norm, 3-shot)</td>
254
+ <td>61.52</td>
255
+ <td>61.31 (99.66)</td>
256
+ <td>61.40 (99.80)</td>
257
+ </tr>
258
+ <tr>
259
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
260
+ <td>46.60</td>
261
+ <td>44.34 (95.14)</td>
262
+ <td>44.94 (96.43)</td>
263
+ </tr>
264
+ <tr>
265
+ <td>GPQA (Acc-Norm, 0-shot)</td>
266
+ <td>32.55</td>
267
+ <td>32.05 (98.45)</td>
268
+ <td>34.23 (105.15)</td>
269
+ </tr>
270
+ <tr>
271
+ <td>MUSR (Acc-Norm, 0-shot)</td>
272
+ <td>46.43</td>
273
+ <td>46.30 (99.72)</td>
274
+ <td>45.77 (98.58)</td>
275
+ </tr>
276
+ <tr>
277
+ <td>MMLU-Pro (Acc, 5-shot)</td>
278
+ <td>47.96</td>
279
+ <td>47.91 (99.88)</td>
280
+ <td>47.93 (99.93)</td>
281
+ </tr>
282
+ <tr>
283
+ <td><b>Average Score</b></td>
284
+ <td><b>53.75</b></td>
285
+ <td><b>53.26 (99.09)</b></td>
286
+ <td><b>53.69 (99.89)</b></td>
287
+ </tr>
288
+ </tbody>
289
+ </table>
290
+
291
+
292
+
293
+
294
+
295
+
296
+ <!--
297
+ ### Accuracy
298
+ <table>
299
+ <thead>
300
+ <tr>
301
+ <th>Category</th>
302
+ <th>Metric</th>
303
+ <th>ibm-granite/granite-4.0-h-small</th>
304
+ <th>ibm-granite/granite-4.0-h-small-FP8</th>
305
+ <th>Recovery (%)</th>
306
+ </tr>
307
+ </thead>
308
+ <tbody>
309
+
310
+ <tr>
311
+ <td rowspan="7"><b>OpenLLM V1</b></td>
312
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
313
+ <td>72.27</td>
314
+ <td>71.67</td>
315
+ <td>99.17</td>
316
+ </tr>
317
+ <tr>
318
+ <td>GSM8K (Strict-Match, 5-shot)</td>
319
+ <td>85.06</td>
320
+ <td>85.60</td>
321
+ <td>100.62</td>
322
+ </tr>
323
+ <tr>
324
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
325
+ <td>86.07</td>
326
+ <td>86.02</td>
327
+ <td>99.94</td>
328
+ </tr>
329
+ <tr>
330
+ <td>MMLU (Acc, 5-shot)</td>
331
+ <td>77.15</td>
332
+ <td>76.94</td>
333
+ <td>99.73</td>
334
+ </tr>
335
+ <tr>
336
+ <td>TruthfulQA (MC2, 0-shot)</td>
337
+ <td>57.97</td>
338
+ <td>57.62</td>
339
+ <td>99.40</td>
340
+ </tr>
341
+ <tr>
342
+ <td>Winogrande (Acc, 5-shot)</td>
343
+ <td>81.45</td>
344
+ <td>81.14</td>
345
+ <td>99.61</td>
346
+ </tr>
347
+ <tr>
348
+ <td><b>Average Score</b></td>
349
+ <td><b>76.66</b></td>
350
+ <td><b>76.50</b></td>
351
+ <td><b>99.79</b></td>
352
+ </tr>
353
+
354
+ <tr>
355
+ <td rowspan="7"><b>OpenLLM V2</b></td>
356
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
357
+ <td>87.41</td>
358
+ <td>87.65</td>
359
+ <td>100.27</td>
360
+ </tr>
361
+ <tr>
362
+ <td>BBH (Acc-Norm, 3-shot)</td>
363
+ <td>61.52</td>
364
+ <td>61.31</td>
365
+ <td>99.66</td>
366
+ </tr>
367
+ <tr>
368
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
369
+ <td>46.60</td>
370
+ <td>44.34</td>
371
+ <td>95.14</td>
372
+ </tr>
373
+ <tr>
374
+ <td>GPQA (Acc-Norm, 0-shot)</td>
375
+ <td>32.55</td>
376
+ <td>32.05</td>
377
+ <td>98.45</td>
378
+ </tr>
379
+ <tr>
380
+ <td>MUSR (Acc-Norm, 0-shot)</td>
381
+ <td>46.43</td>
382
+ <td>46.30</td>
383
+ <td>99.72</td>
384
+ </tr>
385
+ <tr>
386
+ <td>MMLU-Pro (Acc, 5-shot)</td>
387
+ <td>47.96</td>
388
+ <td>47.91</td>
389
+ <td>99.88</td>
390
+ </tr>
391
+ <tr>
392
+ <td><b>Average Score</b></td>
393
+ <td><b>53.75</b></td>
394
+ <td><b>53.26</b></td>
395
+ <td><b>99.09</b></td>
396
+ </tr>
397
+ </tbody>
398
+ </table>
399
+
400
+
401
+
402
+
403
+
404
+
405
+
406
+
407
+
408
+ ### Accuracy
409
+ <table>
410
+ <thead>
411
+ <tr>
412
+ <th>Category</th>
413
+ <th>Metric</th>
414
+ <th>ibm-granite/granite-4.0-h-small</th>
415
+ <th>RedHatAI/granite-4.0-h-small-FP8-block</th>
416
+ <th>Recovery (%)</th>
417
+ </tr>
418
+ </thead>
419
+ <tbody>
420
+
421
+ <tr>
422
+ <td rowspan="7"><b>OpenLLM V1</b></td>
423
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
424
+ <td>72.27</td>
425
+ <td>72.27</td>
426
+ <td>100.00</td>
427
+ </tr>
428
+ <tr>
429
+ <td>GSM8K (Strict-Match, 5-shot)</td>
430
+ <td>85.06</td>
431
+ <td>85.60</td>
432
+ <td>100.62</td>
433
+ </tr>
434
+ <tr>
435
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
436
+ <td>86.07</td>
437
+ <td>85.96</td>
438
+ <td>99.87</td>
439
+ </tr>
440
+ <tr>
441
+ <td>MMLU (Acc, 5-shot)</td>
442
+ <td>77.15</td>
443
+ <td>77.23</td>
444
+ <td>100.10</td>
445
+ </tr>
446
+ <tr>
447
+ <td>TruthfulQA (MC2, 0-shot)</td>
448
+ <td>57.97</td>
449
+ <td>57.85</td>
450
+ <td>99.80</td>
451
+ </tr>
452
+ <tr>
453
+ <td>Winogrande (Acc, 5-shot)</td>
454
+ <td>81.45</td>
455
+ <td>80.82</td>
456
+ <td>99.22</td>
457
+ </tr>
458
+ <tr>
459
+ <td><b>Average Score</b></td>
460
+ <td><b>76.66</b></td>
461
+ <td><b>76.62</b></td>
462
+ <td><b>99.95</b></td>
463
+ </tr>
464
+
465
+ <tr>
466
+ <td rowspan="7"><b>OpenLLM V2</b></td>
467
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
468
+ <td>87.41</td>
469
+ <td>87.89</td>
470
+ <td>100.55</td>
471
+ </tr>
472
+ <tr>
473
+ <td>BBH (Acc-Norm, 3-shot)</td>
474
+ <td>61.52</td>
475
+ <td>61.40</td>
476
+ <td>99.80</td>
477
+ </tr>
478
+ <tr>
479
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
480
+ <td>46.60</td>
481
+ <td>44.94</td>
482
+ <td>96.43</td>
483
+ </tr>
484
+ <tr>
485
+ <td>GPQA (Acc-Norm, 0-shot)</td>
486
+ <td>32.55</td>
487
+ <td>34.23</td>
488
+ <td>105.15</td>
489
+ </tr>
490
+ <tr>
491
+ <td>MUSR (Acc-Norm, 0-shot)</td>
492
+ <td>46.43</td>
493
+ <td>45.77</td>
494
+ <td>98.58</td>
495
+ </tr>
496
+ <tr>
497
+ <td>MMLU-Pro (Acc, 5-shot)</td>
498
+ <td>47.96</td>
499
+ <td>47.93</td>
500
+ <td>99.93</td>
501
+ </tr>
502
+ <tr>
503
+ <td><b>Average Score</b></td>
504
+ <td><b>53.75</b></td>
505
+ <td><b>53.69</b></td>
506
+ <td><b>99.89</b></td>
507
+ </tr>
508
+ </tbody>
509
+ </table> -->