Devstral-Small-2-24B-Instruct - Mixed Precision GPTQ (INT4/INT8)
Mixed-precision GPTQ quantization of mistralai/Devstral-Small-2-24B-Instruct-2512.
Quantization Details
The original model uses FP8 weights which perform best using hardware with native FP8 support. This GPTQ quantization provides compatibility with a wider range of GPUs. Mixed quantization used to reduce model size without significant performance degredation. This quantization sacrifices some performance and memory for better accuracy.
Quantization scheme:
- Attention layers (
q_proj,k_proj,v_proj,o_proj): INT4, group_size=128 - MLP layers (
gate_proj,up_proj,down_proj): INT8, group_size=128 - Vision layers : unmodified
All layers use group quantization (not channelwise) for ROCm compatibility.
Quantized using llmcompressor with GPTQ. See quantize.py for the full quantization script.
Calibration
- Dataset: theblackcat102/evol-codealpaca-v1
- Samples: 256
- Sequence Length: 8192 tokens
Model Size
| Version | Size |
|---|---|
| Original (FP8) | ~25 GB |
| Quantized (INT4/INT8) | ~24 GB |
Perplexity
Evaluated on wikitext-2-raw-v1 (test set):
| Model | Perplexity | Degradation |
|---|---|---|
| Original (FP8) | 4.5408 | - |
| Quantized (INT4/INT8) | 4.6044 | +1.4% |
| cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit | 5.0161 | +10.5% |
The 4bit AWQ quantization performs significantly faster in my testing, but you do lose some accuracy relative to the original model. That trade off may be worth it depending on your system hardware.
Usage
You will need transformers v5+ to run ministral models. Install via:
pip install transformers>=5.0.0
vLLM (Recommended)
vllm serve btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ \
--tensor-parallel-size 4 \
--quantization compressed-tensors
Consider using VLLM_DISABLED_KERNELS=ConchLinearKernel for ROCm. On MI100s performance was degraded using these kernels.
HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ"
)
Hardware Compatibility
- AMD GPUs (ROCm): Tested on MI100. Used exllama kernels with group quantization by disabling ConchLinearKernels. Conch kernels work, but may cause a performance degredation.
- NVIDIA GPUs (CUDA): Untested, but should work with marlin or exllama kernels.
Credits
- Base Model: Mistral AI - Devstral-Small-2-24B-Instruct
- Quantization: GPTQ via llmcompressor
- Quantized by: btbtyler09
License
This model inherits the license from the base model. See LICENSE for details.
- Downloads last month
- 976
Model tree for btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503