Devstral-Small-2-24B TextOnly FP8

Text-only version of mistralai/Devstral-Small-2-24B-Instruct-2512 with the Pixtral vision encoder and multimodal projector removed.

Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.

Requirements

  • transformers >= 5.0 — the ministral3 model type and Ministral3ForCausalLM class were added in transformers 5.0. Will not load on transformers 4.x.
  • vLLM nightly (0.18+) with transformers 5.3.0 — vLLM stable (0.16) pins transformers<5. The nightly allows the upgrade. vLLM does not have a native Ministral3ForCausalLM — it falls back to TransformersForCausalLM, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (llama_4_scaling_beta) and YaRN RoPE properly.

Warning: Do NOT override the architecture to MistralForCausalLM. While the model will load and serve, MistralForCausalLM silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.

Model Details

Property Value
Architecture Ministral3ForCausalLM
Model type ministral3
Parameters 23.57B
Quantization FP8 W8A8 static (float8_e4m3fn)
Layers 40
Hidden size 5120
Attention heads 32 (8 KV heads)
Context length 393K tokens (YaRN RoPE)
Vocab size 131,072
Size on disk ~24.9 GB

What Changed

The source model (Mistral3ForConditionalGeneration) is a VLM containing:

  • Language model (23.57B params, FP8) — kept
  • Vision tower (Pixtral, ~0.4B params, BF16) — removed
  • Multimodal projector (BF16) — removed

Changes from the original:

  1. Stripped language_model.* prefix from all tensor names
  2. Config: Ministral3ForCausalLM / model_type: "ministral3" (requires transformers >= 5.0)
  3. Quantization config: removed vision module references from modules_to_not_convert
  4. Renamed FP8 scale tensors for vLLM compatibility: activation_scaleinput_scale, weight_scale_invweight_scale (same values, no inversion — both conventions use multiplication for dequantization)

Usage

With vLLM (nightly + transformers 5)

pip install transformers>=5.0

vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

vLLM will resolve to the TransformersForCausalLM backend, which delegates to transformers 5's native Ministral3ForCausalLM.

With transformers (>= 5.0)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
model = AutoModelForCausalLM.from_pretrained(
    "levara/Devstral-Small-2-24B-TextOnly-FP8",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Note: For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set dequantize: true in the quantization config.

Verification

Verified against the original VLM:

  • 923 tensors, 40 layers, no vision keys
  • FP8 dtypes preserved on all linear weights
  • First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065

Why Not MistralForCausalLM?

The original VLM avoids this problem because Mistral3ForConditionalGeneration loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:

  • Position-dependent attention scaling (llama_4_scaling_beta) — dampens attention at longer positions
  • YaRN RoPE with beta_fast, beta_slow, mscale — context length scaling

MistralForCausalLM ignores these config fields. Ministral3ForCausalLM (transformers 5) handles them correctly.

Downloads last month
158
Safetensors
Model size
24B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for levara/Devstral-Small-2-24B-TextOnly-FP8