File size: 5,666 Bytes
4197cf0 82eb10c b152b05 82eb10c 4197cf0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
language:
- en
base_model:
- mistralai/Devstral-Small-2507
pipeline_tag: text-generation
tags:
- mistral
- neuralmagic
- redhat
- llmcompressor
- quantized
- INT8
- compressed-tensors
license: mit
license_name: mit
name: RedHatAI/Devstral-Small-2507
description: This model was obtained by quantizing weights and activations of Devstral-Small-2507 to INT8 data type.
readme: https://huggingface.co/RedHatAI/Devstral-Small-2507-quantized.w8a8/main/README.md
tasks:
- text-to-text
provider: mistralai
---
# Devstral-Small-2507-quantized.w8a8
## Model Overview
- **Model Architecture:** MistralForCausalLM
- **Input:** Text
- **Output:** Text
- **Model Optimizations:**
- **Activation quantization:** INT8
- **Weight quantization:** INT8
- **Release Date:** 08/29/2025
- **Version:** 1.0
- **Model Developers:** Red Hat (Neural Magic)
### Model Optimizations
This model was obtained by quantizing weights and activations of [Devstral-Small-2507](https://huggingface.co/mistralai/Devstral-Small-2507) to INT8 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%).
Weight quantization also reduces disk size requirements by approximately 50%.
## Creation
<details>
This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
```bash
python quantize.py --model_path mistralai/Devstral-Small-2507 --calib_size 512 --dampening_frac 0.05
```
```python
import argparse
import os
from datasets import load_dataset
from transformers import AutoModelForCausalLM
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers import oneshot
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.instruct.messages import (
SystemMessage, UserMessage
)
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = os.path.join(repo_id, filename)
with open(file_path, "r") as file:
system_prompt = file.read()
return system_prompt
parser = argparse.ArgumentParser()
parser.add_argument('--model_path', type=str)
parser.add_argument('--calib_size', type=int, default=256)
parser.add_argument('--dampening_frac', type=float, default=0.1)
args = parser.parse_args()
model = AutoModelForCausalLM.from_pretrained(
args.model_path,
device_map="auto",
torch_dtype="auto",
use_cache=False,
trust_remote_code=True,
)
ds = load_dataset("garage-bAInd/Open-Platypus", split="train")
ds = ds.shuffle(seed=42).select(range(args.calib_size))
SYSTEM_PROMPT = load_system_prompt(args.model_path, "SYSTEM_PROMPT.txt")
tokenizer = MistralTokenizer.from_hf_hub("mistralai/Devstral-Small-2507")
def tokenize(sample):
tmp = tokenizer.encode_chat_completion(
ChatCompletionRequest(
messages=[
SystemMessage(content=SYSTEM_PROMPT),
UserMessage(content=sample['instruction']),
],
)
)
return {'input_ids': tmp.tokens}
ds = ds.map(tokenize, remove_columns=ds.column_names)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.8,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
targets=["Linear"],
ignore=["lm_head"],
scheme="W8A8",
dampening_frac=args.dampening_frac,
)
]
oneshot(
model=model,
dataset=ds,
recipe=recipe,
num_calibration_samples=args.calib_size,
max_seq_length=8192,
)
save_path = args.model_path + "-quantized.w8a8"
model.save_pretrained(save_path)
```
</details>
## Deployment
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
```bash
vllm serve RedHatAI/Devstral-Small-2507-quantized.w8a8 --tensor-parallel-size 1 --tokenizer_mode mistral
```
## Evaluation
The model was evaluated on popular coding tasks (HumanEval, HumanEval+, MBPP, MBPP+) via [EvalPlus](https://github.com/evalplus/evalplus) and vllm backend (v0.10.1.1).
For evaluations, we run greedy sampling and report pass@1. The command to reproduce evals:
```bash
evalplus.evaluate --model "RedHatAI/Devstral-Small-2507-quantized.w8a8" \
--dataset [humaneval|mbpp] \
--base-url http://localhost:8000/v1 \
--backend openai --greedy
```
### Accuracy
| | Recovery (%) | mistralai/Devstral-Small-2507 | RedHatAI/Devstral-Small-2507-quantized.w8a8<br>(this model) |
| --------------------------- | :----------: | :------------------: | :--------------------------------------------------: |
| HumanEval | 100.67 | 89.0 | 89.6 |
| HumanEval+ | 101.48 | 81.1 | 82.3 |
| MBPP | 98.71 | 77.5 | 76.5 |
| MBPP+ | 102.42 | 66.1 | 67.7 |
| **Average Score** | **100.77** | **78.43** | **79.03** |
|