Devstral-Small-2-24B-Instruct - Mixed Precision GPTQ (INT4/INT8)

Mixed-precision GPTQ quantization of mistralai/Devstral-Small-2-24B-Instruct-2512.

Quantization Details

The original model uses FP8 weights which perform best using hardware with native FP8 support. This GPTQ quantization provides compatibility with a wider range of GPUs. Mixed quantization used to reduce model size without significant performance degredation. This quantization sacrifices some performance and memory for better accuracy.

Quantization scheme:

  • Attention layers (q_proj, k_proj, v_proj, o_proj): INT4, group_size=128
  • MLP layers (gate_proj, up_proj, down_proj): INT8, group_size=128
  • Vision layers : unmodified

All layers use group quantization (not channelwise) for ROCm compatibility.

Quantized using llmcompressor with GPTQ. See quantize.py for the full quantization script.

Calibration

  • Dataset: theblackcat102/evol-codealpaca-v1
  • Samples: 256
  • Sequence Length: 8192 tokens

Model Size

Version Size
Original (FP8) ~25 GB
Quantized (INT4/INT8) ~24 GB

Perplexity

Evaluated on wikitext-2-raw-v1 (test set):

Model Perplexity Degradation
Original (FP8) 4.5408 -
Quantized (INT4/INT8) 4.6044 +1.4%
cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit 5.0161 +10.5%

The 4bit AWQ quantization performs significantly faster in my testing, but you do lose some accuracy relative to the original model. That trade off may be worth it depending on your system hardware.

Usage

You will need transformers v5+ to run ministral models. Install via:

pip install transformers>=5.0.0

vLLM (Recommended)

vllm serve btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ \
  --tensor-parallel-size 4 \
  --quantization compressed-tensors

Consider using VLLM_DISABLED_KERNELS=ConchLinearKernel for ROCm. On MI100s performance was degraded using these kernels.

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ"
)

Hardware Compatibility

  • AMD GPUs (ROCm): Tested on MI100. Used exllama kernels with group quantization by disabling ConchLinearKernels. Conch kernels work, but may cause a performance degredation.
  • NVIDIA GPUs (CUDA): Untested, but should work with marlin or exllama kernels.

Credits

License

This model inherits the license from the base model. See LICENSE for details.

Downloads last month
976
Safetensors
Model size
7B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ