RedHatAI
/

Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8

+---
+language:
+- en
+- fr
+- de
+- es
+- pt
+- it
+- ja
+- ko
+- ru
+- zh
+- ar
+- fa
+- id
+- ms
+- ne
+- pl
+- ro
+- sr
+- sv
+- tr
+- uk
+- vi
+- hi
+- bn
+license: apache-2.0
+library_name: vllm
+base_model:
+- mistralai/Mistral-Small-3.1-24B-Instruct-2503
+pipeline_tag: image-text-to-text
+tags:
+- neuralmagic
+- redhat
+- llmcompressor
+- quantized
+- int4
+---
+# Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8
+## Model Overview
+- **Model Architecture:** Mistral3ForConditionalGeneration
+  - **Input:** Text / Image
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Activation quantization:** INT8
+  - **Weight quantization:** INT8
+- **Intended Use Cases:** It is ideal for:
+  - Fast-response conversational agents.
+  - Low-latency function calling.
+  - Subject matter experts via fine-tuning.
+  - Local inference for hobbyists and organizations handling sensitive data.
+  - Programming and math reasoning.
+  - Long document understanding.
+  - Visual understanding.
+- **Out-of-scope:** This model is not specifically designed or evaluated for all downstream purposes, thus:
+  1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
+  2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.
+  3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
+- **Release Date:** 04/15/2025
+- **Version:** 1.0
+- **Model Developers:** RedHat (Neural Magic)
+### Model Optimizations
+This model was obtained by quantizing activations and weights of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) to INT8 data type.
+This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
+Weight quantization also reduces disk size requirements by approximately 50%.
+Only weights and activations of the linear operators within transformers blocks are quantized.
+Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
+A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
+## Deployment
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8"
+number_gpus = 1
+sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+prompt = "Give me a short introduction to large language model."
+llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
+outputs = llm.generate(prompt, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+## Creation
+<details>
+  <summary>Creation details</summary>
+  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
+  ```python
+  from transformers import AutoProcessor
+  from llmcompressor.modifiers.quantization import GPTQModifier
+  from llmcompressor.transformers import oneshot
+  from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration
+  from PIL import Image
+  import io
+  # Load model
+  model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
+  model_name = model_stub.split("/")[-1]
+  num_text_samples = 1024
+  num_vison_samples = 1024
+  max_seq_len = 8192
+  processor = AutoProcessor.from_pretrained(model_stub)
+  model = TraceableMistral3ForConditionalGeneration.from_pretrained(
+      model_stub,
+      device_map="auto",
+      torch_dtype="auto",
+  )
+  # Text-only data subset
+  def preprocess_text(example):
+      input = {
+          "text": processor.apply_chat_template(
+              example["messages"],
+              add_generation_prompt=False,
+          ),
+          "images" = None,
+      }
+      tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
+      tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
+      tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
+  dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(num_text_samples)
+  dst = dst.map(preprocess_text, remove_columns=dst.column_names)
+  # Text + vision data subset
+  def preprocess_vision(example):
+      messages = []
+      image = None
+      for message in example["messages"]:
+          message_content = []
+          for content in message["content"]
+              if content["type"] == "text":
+                  message_content = {"type": "text", "text": content["text"]}
+              else:
+                  message_content = {"type": "image"}}
+                  image = Image.open(io.BytesIO(content["image"]))
+          messages.append(
+              {
+                  "role": message["role"],
+                  "content": message_content,
+              }
+          )
+      input = {
+          "text": processor.apply_chat_template(
+              messages,
+              add_generation_prompt=False,
+          ),
+          "images" = image,
+      }
+      tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
+      tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
+      tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
+  dsv = load_dataset("neuralmagic/calibration", name="VLLM", split="train").select(num_vision_samples)
+  dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)
+  # Interleave subsets
+  ds = interleave_datasets((dsv, dst))
+  # Configure the quantization algorithm and scheme
+  recipe = [
+      SmoothQuantModifier(),
+      GPTQModifier(
+          ignore=["language_model.lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"]
+          sequential_targets=["MistralDecoderLayer"]
+          dampening_frac=0.01
+          targets="Linear",
+          scheme="W8A8",
+      ),
+  ]
+  # Define data collator
+  def data_collator(batch):
+      import torch
+      assert len(batch) == 1
+      collated = {}
+      for k, v in batch[0].items():
+          if v is None:
+              continue
+          if k == "input_ids":
+              collated[k] = torch.LongTensor(v)
+          elif k == "pixel_values":
+              collated[k] = torch.tensor(v, dtype=torch.bfloat16)
+          else:
+              collated[k] = torch.tensor(v)
+      return collated
+  # Apply quantization
+  oneshot(
+      model=model,
+      dataset=ds,
+      recipe=recipe,
+      max_seq_length=max_seq_len,
+      data_collator=data_collator,
+  )
+  # Save to disk in compressed-tensors format
+  save_path = model_name + "-quantized.w8a8
+  model.save_pretrained(save_path)
+  tokenizer.save_pretrained(save_path)
+  print(f"Model and tokenizer saved to: {save_path}")
+  ```
+</details>
+## Evaluation
+The model was evaluated on the OpenLLM leaderboard tasks (version 1), MMLU-pro, GPQA, HumanEval and MBPP.
+Non-coding tasks were evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), whereas coding tasks were evaluated with a fork of [evalplus](https://github.com/neuralmagic/evalplus).
+[vLLM](https://docs.vllm.ai/en/stable/) is used as the engine in all cases.
+<details>
+  <summary>Evaluation details</summary>
+  **MMLU**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
+    --tasks mmlu \
+    --num_fewshot 5 \
+    --apply_chat_template\
+    --fewshot_as_multiturn \
+    --batch_size auto
+  ```
+  **ARC Challenge**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
+    --tasks arc_challenge \
+    --num_fewshot 25 \
+    --apply_chat_template\
+    --fewshot_as_multiturn \
+    --batch_size auto
+  ```
+  **GSM8k**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
+    --tasks gsm8k \
+    --num_fewshot 8 \
+    --apply_chat_template\
+    --fewshot_as_multiturn \
+    --batch_size auto
+  ```
+  **Hellaswag**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
+    --tasks hellaswag \
+    --num_fewshot 10 \
+    --apply_chat_template\
+    --fewshot_as_multiturn \
+    --batch_size auto
+  ```
+  **Winogrande**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
+    --tasks winogrande \
+    --num_fewshot 5 \
+    --apply_chat_template\
+    --fewshot_as_multiturn \
+    --batch_size auto
+  ```
+  **TruthfulQA**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
+    --tasks truthfulqa \
+    --num_fewshot 0 \
+    --apply_chat_template\
+    --batch_size auto
+  ```
+  **MMLU-pro**
+  ```
+  lm_eval \
+    --model vllm \
+    --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
+    --tasks mmlu_pro \
+    --num_fewshot 5 \
+    --apply_chat_template\
+    --fewshot_as_multiturn \
+    --batch_size auto
+  ```
+**Coding**
+The commands below can be used for mbpp by simply replacing the dataset name.
+*Generation*
+```
+python3 codegen/generate.py \
+  --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8 \
+  --bs 16 \
+  --temperature 0.2 \
+  --n_samples 50 \
+  --root "." \
+  --dataset humaneval
+```
+*Sanitization*
+```
+python3 evalplus/sanitize.py \
+  humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2
+```
+*Evaluation*
+```
+evalplus.evaluate \
+  --dataset humaneval \
+  --samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8_vllm_temp_0.2-sanitized
+```
+</details>
+### Accuracy
+#### Open LLM Leaderboard evaluation scores
+<table>
+  <tr>
+   <th>Category
+   </th>
+   <th>Benchmark
+   </th>
+   <th>Mistral-Small-3.1-24B-Instruct-2503
+   </th>
+   <th>Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8<br>(this model)
+   </th>
+   <th>Recovery
+   </th>
+  </tr>
+  <tr>
+   <td rowspan="7" ><strong>OpenLLM v1</strong>
+   </td>
+   <td>MMLU (5-shot)
+   </td>
+   <td>80.67
+   </td>
+   <td>80.40
+   </td>
+   <td>99.7%
+   </td>
+  </tr>
+  <tr>
+   <td>ARC Challenge (25-shot)
+   </td>
+   <td>72.78
+   </td>
+   <td>73.46
+   </td>
+   <td>100.9%
+   </td>
+  </tr>
+  <tr>
+   <td>GSM-8K (5-shot, strict-match)
+   </td>
+   <td>65.35
+   </td>
+   <td>70.58
+   </td>
+   <td>108.0%
+   </td>
+  </tr>
+  <tr>
+   <td>Hellaswag (10-shot)
+   </td>
+   <td>83.70
+   </td>
+   <td>82.26
+   </td>
+   <td>98.3%
+   </td>
+  </tr>
+  <tr>
+   <td>Winogrande (5-shot)
+   </td>
+   <td>83.74
+   </td>
+   <td>80.90
+   </td>
+   <td>96.6%
+   </td>
+  </tr>
+  <tr>
+   <td>TruthfulQA (0-shot, mc2)
+   </td>
+   <td>70.62
+   </td>
+   <td>69.15
+   </td>
+   <td>97.9%
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Average</strong>
+   </td>
+   <td><strong>76.14</strong>
+   </td>
+   <td><strong>76.13</strong>
+   </td>
+   <td><strong>100.0%</strong>
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="3" ><strong></strong>
+   </td>
+   <td>MMLU-Pro (5-shot)
+   </td>
+   <td>67.25
+   </td>
+   <td>66.54
+   </td>
+   <td>98.9%
+   </td>
+  </tr>
+  <tr>
+   <td>GPQA CoT main (5-shot)
+   </td>
+   <td>42.63
+   </td>
+   <td>44.64
+   </td>
+   <td>104.7%
+   </td>
+  </tr>
+  <tr>
+   <td>GPQA CoT diamond (5-shot)
+   </td>
+   <td>45.96
+   </td>
+   <td>41.92
+   </td>
+   <td>91.2%
+   </td>
+  </tr>
+  <tr>
+   <td rowspan="4" ><strong>Coding</strong>
+   </td>
+   <td>HumanEval pass@1
+   </td>
+   <td>84.70
+   </td>
+   <td>
+   </td>
+   <td>%
+   </td>
+  </tr>
+  <tr>
+   <td>HumanEval+ pass@1
+   </td>
+   <td>79.50
+   </td>
+   <td>
+   </td>
+   <td>%
+   </td>
+  </tr>
+  <tr>
+   <td>MBPP pass@1
+   </td>
+   <td>71.10
+   </td>
+   <td>
+   </td>
+   <td>%
+   </td>
+  </tr>
+  <tr>
+   <td>MBPP+ pass@1
+   </td>
+   <td>60.60
+   </td>
+   <td>
+   </td>
+   <td>%
+   </td>
+  </tr>
+</table>