alexmarques commited on
Commit
a6e1a12
·
verified ·
1 Parent(s): bb4e5b8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +521 -0
README.md ADDED
@@ -0,0 +1,521 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - pt
8
+ - it
9
+ - ja
10
+ - ko
11
+ - ru
12
+ - zh
13
+ - ar
14
+ - fa
15
+ - id
16
+ - ms
17
+ - ne
18
+ - pl
19
+ - ro
20
+ - sr
21
+ - sv
22
+ - tr
23
+ - uk
24
+ - vi
25
+ - hi
26
+ - bn
27
+ license: apache-2.0
28
+ library_name: vllm
29
+ base_model:
30
+ - mistralai/Mistral-Small-3.1-24B-Instruct-2503
31
+ pipeline_tag: image-text-to-text
32
+ tags:
33
+ - neuralmagic
34
+ - redhat
35
+ - llmcompressor
36
+ - quantized
37
+ - int4
38
+ ---
39
+
40
+ # Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8
41
+
42
+ ## Model Overview
43
+ - **Model Architecture:** Mistral3ForConditionalGeneration
44
+ - **Input:** Text / Image
45
+ - **Output:** Text
46
+ - **Model Optimizations:**
47
+ - **Activation quantization:** INT8
48
+ - **Weight quantization:** INT8
49
+ - **Intended Use Cases:** It is ideal for:
50
+ - Fast-response conversational agents.
51
+ - Low-latency function calling.
52
+ - Subject matter experts via fine-tuning.
53
+ - Local inference for hobbyists and organizations handling sensitive data.
54
+ - Programming and math reasoning.
55
+ - Long document understanding.
56
+ - Visual understanding.
57
+ - **Out-of-scope:** This model is not specifically designed or evaluated for all downstream purposes, thus:
58
+ 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
59
+ 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.
60
+ 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
61
+ - **Release Date:** 04/15/2025
62
+ - **Version:** 1.0
63
+ - **Model Developers:** RedHat (Neural Magic)
64
+
65
+
66
+ ### Model Optimizations
67
+
68
+ This model was obtained by quantizing activations and weights of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) to INT8 data type.
69
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
70
+ Weight quantization also reduces disk size requirements by approximately 50%.
71
+
72
+ Only weights and activations of the linear operators within transformers blocks are quantized.
73
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
74
+ A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
75
+
76
+
77
+ ## Deployment
78
+
79
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
80
+
81
+ ```python
82
+ from vllm import LLM, SamplingParams
83
+ from transformers import AutoTokenizer
84
+
85
+ model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8"
86
+ number_gpus = 1
87
+
88
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
89
+
90
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
91
+
92
+ prompt = "Give me a short introduction to large language model."
93
+
94
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
95
+
96
+ outputs = llm.generate(prompt, sampling_params)
97
+
98
+ generated_text = outputs[0].outputs[0].text
99
+ print(generated_text)
100
+ ```
101
+
102
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
103
+
104
+ ## Creation
105
+
106
+ <details>
107
+ <summary>Creation details</summary>
108
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
109
+
110
+
111
+ ```python
112
+ from transformers import AutoProcessor
113
+ from llmcompressor.modifiers.quantization import GPTQModifier
114
+ from llmcompressor.transformers import oneshot
115
+ from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration
116
+ from PIL import Image
117
+ import io
118
+
119
+ # Load model
120
+ model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
121
+ model_name = model_stub.split("/")[-1]
122
+
123
+ num_text_samples = 1024
124
+ num_vison_samples = 1024
125
+ max_seq_len = 8192
126
+
127
+ processor = AutoProcessor.from_pretrained(model_stub)
128
+
129
+ model = TraceableMistral3ForConditionalGeneration.from_pretrained(
130
+ model_stub,
131
+ device_map="auto",
132
+ torch_dtype="auto",
133
+ )
134
+
135
+ # Text-only data subset
136
+ def preprocess_text(example):
137
+ input = {
138
+ "text": processor.apply_chat_template(
139
+ example["messages"],
140
+ add_generation_prompt=False,
141
+ ),
142
+ "images" = None,
143
+ }
144
+ tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
145
+ tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
146
+ tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
147
+
148
+ dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(num_text_samples)
149
+ dst = dst.map(preprocess_text, remove_columns=dst.column_names)
150
+
151
+ # Text + vision data subset
152
+ def preprocess_vision(example):
153
+ messages = []
154
+ image = None
155
+ for message in example["messages"]:
156
+ message_content = []
157
+ for content in message["content"]
158
+ if content["type"] == "text":
159
+ message_content = {"type": "text", "text": content["text"]}
160
+ else:
161
+ message_content = {"type": "image"}}
162
+ image = Image.open(io.BytesIO(content["image"]))
163
+
164
+ messages.append(
165
+ {
166
+ "role": message["role"],
167
+ "content": message_content,
168
+ }
169
+ )
170
+
171
+ input = {
172
+ "text": processor.apply_chat_template(
173
+ messages,
174
+ add_generation_prompt=False,
175
+ ),
176
+ "images" = image,
177
+ }
178
+ tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
179
+ tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
180
+ tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
181
+
182
+ dsv = load_dataset("neuralmagic/calibration", name="VLLM", split="train").select(num_vision_samples)
183
+ dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)
184
+
185
+ # Interleave subsets
186
+ ds = interleave_datasets((dsv, dst))
187
+
188
+ # Configure the quantization algorithm and scheme
189
+ recipe = [
190
+ SmoothQuantModifier(),
191
+ GPTQModifier(
192
+ ignore=["language_model.lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"]
193
+ sequential_targets=["MistralDecoderLayer"]
194
+ dampening_frac=0.01
195
+ targets="Linear",
196
+ scheme="W8A8",
197
+ ),
198
+ ]
199
+
200
+ # Define data collator
201
+ def data_collator(batch):
202
+ import torch
203
+ assert len(batch) == 1
204
+ collated = {}
205
+ for k, v in batch[0].items():
206
+ if v is None:
207
+ continue
208
+ if k == "input_ids":
209
+ collated[k] = torch.LongTensor(v)
210
+ elif k == "pixel_values":
211
+ collated[k] = torch.tensor(v, dtype=torch.bfloat16)
212
+ else:
213
+ collated[k] = torch.tensor(v)
214
+ return collated
215
+
216
+
217
+ # Apply quantization
218
+ oneshot(
219
+ model=model,
220
+ dataset=ds,
221
+ recipe=recipe,
222
+ max_seq_length=max_seq_len,
223
+ data_collator=data_collator,
224
+ )
225
+
226
+ # Save to disk in compressed-tensors format
227
+ save_path = model_name + "-quantized.w8a8
228
+ model.save_pretrained(save_path)
229
+ tokenizer.save_pretrained(save_path)
230
+ print(f"Model and tokenizer saved to: {save_path}")
231
+ ```
232
+ </details>
233
+
234
+
235
+
236
+ ## Evaluation
237
+
238
+ The model was evaluated on the OpenLLM leaderboard tasks (version 1), MMLU-pro, GPQA, HumanEval and MBPP.
239
+ Non-coding tasks were evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), whereas coding tasks were evaluated with a fork of [evalplus](https://github.com/neuralmagic/evalplus).
240
+ [vLLM](https://docs.vllm.ai/en/stable/) is used as the engine in all cases.
241
+
242
+ <details>
243
+ <summary>Evaluation details</summary>
244
+
245
+ **MMLU**
246
+ ```
247
+ lm_eval \
248
+ --model vllm \
249
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
250
+ --tasks mmlu \
251
+ --num_fewshot 5 \
252
+ --apply_chat_template\
253
+ --fewshot_as_multiturn \
254
+ --batch_size auto
255
+ ```
256
+
257
+ **ARC Challenge**
258
+ ```
259
+ lm_eval \
260
+ --model vllm \
261
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
262
+ --tasks arc_challenge \
263
+ --num_fewshot 25 \
264
+ --apply_chat_template\
265
+ --fewshot_as_multiturn \
266
+ --batch_size auto
267
+ ```
268
+
269
+ **GSM8k**
270
+ ```
271
+ lm_eval \
272
+ --model vllm \
273
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
274
+ --tasks gsm8k \
275
+ --num_fewshot 8 \
276
+ --apply_chat_template\
277
+ --fewshot_as_multiturn \
278
+ --batch_size auto
279
+ ```
280
+
281
+ **Hellaswag**
282
+ ```
283
+ lm_eval \
284
+ --model vllm \
285
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
286
+ --tasks hellaswag \
287
+ --num_fewshot 10 \
288
+ --apply_chat_template\
289
+ --fewshot_as_multiturn \
290
+ --batch_size auto
291
+ ```
292
+
293
+ **Winogrande**
294
+ ```
295
+ lm_eval \
296
+ --model vllm \
297
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
298
+ --tasks winogrande \
299
+ --num_fewshot 5 \
300
+ --apply_chat_template\
301
+ --fewshot_as_multiturn \
302
+ --batch_size auto
303
+ ```
304
+
305
+ **TruthfulQA**
306
+ ```
307
+ lm_eval \
308
+ --model vllm \
309
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
310
+ --tasks truthfulqa \
311
+ --num_fewshot 0 \
312
+ --apply_chat_template\
313
+ --batch_size auto
314
+ ```
315
+
316
+ **MMLU-pro**
317
+ ```
318
+ lm_eval \
319
+ --model vllm \
320
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
321
+ --tasks mmlu_pro \
322
+ --num_fewshot 5 \
323
+ --apply_chat_template\
324
+ --fewshot_as_multiturn \
325
+ --batch_size auto
326
+ ```
327
+
328
+ **Coding**
329
+
330
+ The commands below can be used for mbpp by simply replacing the dataset name.
331
+
332
+ *Generation*
333
+ ```
334
+ python3 codegen/generate.py \
335
+ --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 \
336
+ --bs 16 \
337
+ --temperature 0.2 \
338
+ --n_samples 50 \
339
+ --root "." \
340
+ --dataset humaneval
341
+
342
+ ```
343
+
344
+ *Sanitization*
345
+ ```
346
+ python3 evalplus/sanitize.py \
347
+ humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2
348
+ ```
349
+
350
+ *Evaluation*
351
+ ```
352
+ evalplus.evaluate \
353
+ --dataset humaneval \
354
+ --samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2-sanitized
355
+ ```
356
+ </details>
357
+
358
+ ### Accuracy
359
+
360
+ #### Open LLM Leaderboard evaluation scores
361
+ <table>
362
+ <tr>
363
+ <th>Category
364
+ </th>
365
+ <th>Benchmark
366
+ </th>
367
+ <th>Mistral-Small-3.1-24B-Instruct-2503
368
+ </th>
369
+ <th>Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8<br>(this model)
370
+ </th>
371
+ <th>Recovery
372
+ </th>
373
+ </tr>
374
+ <tr>
375
+ <td rowspan="7" ><strong>OpenLLM v1</strong>
376
+ </td>
377
+ <td>MMLU (5-shot)
378
+ </td>
379
+ <td>80.67
380
+ </td>
381
+ <td>80.40
382
+ </td>
383
+ <td>99.7%
384
+ </td>
385
+ </tr>
386
+ <tr>
387
+ <td>ARC Challenge (25-shot)
388
+ </td>
389
+ <td>72.78
390
+ </td>
391
+ <td>73.46
392
+ </td>
393
+ <td>100.9%
394
+ </td>
395
+ </tr>
396
+ <tr>
397
+ <td>GSM-8K (5-shot, strict-match)
398
+ </td>
399
+ <td>65.35
400
+ </td>
401
+ <td>70.58
402
+ </td>
403
+ <td>108.0%
404
+ </td>
405
+ </tr>
406
+ <tr>
407
+ <td>Hellaswag (10-shot)
408
+ </td>
409
+ <td>83.70
410
+ </td>
411
+ <td>82.26
412
+ </td>
413
+ <td>98.3%
414
+ </td>
415
+ </tr>
416
+ <tr>
417
+ <td>Winogrande (5-shot)
418
+ </td>
419
+ <td>83.74
420
+ </td>
421
+ <td>80.90
422
+ </td>
423
+ <td>96.6%
424
+ </td>
425
+ </tr>
426
+ <tr>
427
+ <td>TruthfulQA (0-shot, mc2)
428
+ </td>
429
+ <td>70.62
430
+ </td>
431
+ <td>69.15
432
+ </td>
433
+ <td>97.9%
434
+ </td>
435
+ </tr>
436
+ <tr>
437
+ <td><strong>Average</strong>
438
+ </td>
439
+ <td><strong>76.14</strong>
440
+ </td>
441
+ <td><strong>76.13</strong>
442
+ </td>
443
+ <td><strong>100.0%</strong>
444
+ </td>
445
+ </tr>
446
+ <tr>
447
+ <td rowspan="3" ><strong></strong>
448
+ </td>
449
+ <td>MMLU-Pro (5-shot)
450
+ </td>
451
+ <td>67.25
452
+ </td>
453
+ <td>66.54
454
+ </td>
455
+ <td>98.9%
456
+ </td>
457
+ </tr>
458
+ <tr>
459
+ <td>GPQA CoT main (5-shot)
460
+ </td>
461
+ <td>42.63
462
+ </td>
463
+ <td>44.64
464
+ </td>
465
+ <td>104.7%
466
+ </td>
467
+ </tr>
468
+ <tr>
469
+ <td>GPQA CoT diamond (5-shot)
470
+ </td>
471
+ <td>45.96
472
+ </td>
473
+ <td>41.92
474
+ </td>
475
+ <td>91.2%
476
+ </td>
477
+ </tr>
478
+ <tr>
479
+ <td rowspan="4" ><strong>Coding</strong>
480
+ </td>
481
+ <td>HumanEval pass@1
482
+ </td>
483
+ <td>84.70
484
+ </td>
485
+ <td>
486
+ </td>
487
+ <td>%
488
+ </td>
489
+ </tr>
490
+ <tr>
491
+ <td>HumanEval+ pass@1
492
+ </td>
493
+ <td>79.50
494
+ </td>
495
+ <td>
496
+ </td>
497
+ <td>%
498
+ </td>
499
+ </tr>
500
+ <tr>
501
+ <td>MBPP pass@1
502
+ </td>
503
+ <td>71.10
504
+ </td>
505
+ <td>
506
+ </td>
507
+ <td>%
508
+ </td>
509
+ </tr>
510
+ <tr>
511
+ <td>MBPP+ pass@1
512
+ </td>
513
+ <td>60.60
514
+ </td>
515
+ <td>
516
+ </td>
517
+ <td>%
518
+ </td>
519
+ </tr>
520
+ </table>
521
+