Devstral-Small-2-24B-Instruct - Mixed Precision GPTQ (INT4/INT8)

Mixed-precision GPTQ quantization of mistralai/Devstral-Small-2-24B-Instruct-2512.

Quantization Details

The original model uses FP8 weights which perform best using hardware with native FP8 support. This GPTQ quantization provides compatibility with a wider range of GPUs. Mixed quantization used to reduce model size without significant performance degredation. This quantization sacrifices some performance and memory for better accuracy.

Quantization scheme:

Attention layers (q_proj, k_proj, v_proj, o_proj): INT4, group_size=128
MLP layers (gate_proj, up_proj, down_proj): INT8, group_size=128
Vision layers : unmodified

All layers use group quantization (not channelwise) for ROCm compatibility.

Quantized using llmcompressor with GPTQ. See quantize.py for the full quantization script.

Calibration

Dataset: theblackcat102/evol-codealpaca-v1
Samples: 256
Sequence Length: 8192 tokens

Model Size

Version	Size
Original (FP8)	~25 GB
Quantized (INT4/INT8)	~24 GB

Perplexity

Evaluated on wikitext-2-raw-v1 (test set):

Model	Perplexity	Degradation
Original (FP8)	4.5408	-
Quantized (INT4/INT8)	4.6044	+1.4%
cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit	5.0161	+10.5%

The 4bit AWQ quantization performs significantly faster in my testing, but you do lose some accuracy relative to the original model. That trade off may be worth it depending on your system hardware.

Usage

You will need transformers v5+ to run ministral models. Install via:

pip install transformers>=5.0.0

vLLM (Recommended)

vllm serve btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ \
  --tensor-parallel-size 4 \
  --quantization compressed-tensors

Consider using VLLM_DISABLED_KERNELS=ConchLinearKernel for ROCm. On MI100s performance was degraded using these kernels.

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ"
)

Hardware Compatibility

AMD GPUs (ROCm): Tested on MI100. Used exllama kernels with group quantization by disabling ConchLinearKernels. Conch kernels work, but may cause a performance degredation.
NVIDIA GPUs (CUDA): Untested, but should work with marlin or exllama kernels.

Credits

Base Model: Mistral AI - Devstral-Small-2-24B-Instruct
Quantization: GPTQ via llmcompressor
Quantized by: btbtyler09

License

This model inherits the license from the base model. See LICENSE for details.

Downloads last month: 826

Safetensors

Model size

7B params

Tensor type

I64

I32

BF16

Model tree for btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Quantized

mistralai/Devstral-Small-2-24B-Instruct-2512

Quantized

(31)

this model