Devstral-Small-2-24B TextOnly FP8
Text-only version of mistralai/Devstral-Small-2-24B-Instruct-2512 with the Pixtral vision encoder and multimodal projector removed.
Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.
Requirements
- transformers >= 5.0 — the
ministral3model type andMinistral3ForCausalLMclass were added in transformers 5.0. Will not load on transformers 4.x. - vLLM nightly (0.18+) with transformers 5.3.0 — vLLM stable (0.16) pins
transformers<5. The nightly allows the upgrade. vLLM does not have a nativeMinistral3ForCausalLM— it falls back toTransformersForCausalLM, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (llama_4_scaling_beta) and YaRN RoPE properly.
Warning: Do NOT override the architecture to
MistralForCausalLM. While the model will load and serve,MistralForCausalLMsilently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.
Model Details
| Property | Value |
|---|---|
| Architecture | Ministral3ForCausalLM |
| Model type | ministral3 |
| Parameters | 23.57B |
| Quantization | FP8 W8A8 static (float8_e4m3fn) |
| Layers | 40 |
| Hidden size | 5120 |
| Attention heads | 32 (8 KV heads) |
| Context length | 393K tokens (YaRN RoPE) |
| Vocab size | 131,072 |
| Size on disk | ~24.9 GB |
What Changed
The source model (Mistral3ForConditionalGeneration) is a VLM containing:
- Language model (23.57B params, FP8) — kept
- Vision tower (Pixtral, ~0.4B params, BF16) — removed
- Multimodal projector (BF16) — removed
Changes from the original:
- Stripped
language_model.*prefix from all tensor names - Config:
Ministral3ForCausalLM/model_type: "ministral3"(requires transformers >= 5.0) - Quantization config: removed vision module references from
modules_to_not_convert - Renamed FP8 scale tensors for vLLM compatibility:
activation_scale→input_scale,weight_scale_inv→weight_scale(same values, no inversion — both conventions use multiplication for dequantization)
Usage
With vLLM (nightly + transformers 5)
pip install transformers>=5.0
vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser mistral
vLLM will resolve to the TransformersForCausalLM backend, which delegates to transformers 5's native Ministral3ForCausalLM.
With transformers (>= 5.0)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
model = AutoModelForCausalLM.from_pretrained(
"levara/Devstral-Small-2-24B-TextOnly-FP8",
device_map="auto",
torch_dtype=torch.bfloat16,
)
Note: For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set dequantize: true in the quantization config.
Verification
Verified against the original VLM:
- 923 tensors, 40 layers, no vision keys
- FP8 dtypes preserved on all linear weights
- First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065
Why Not MistralForCausalLM?
The original VLM avoids this problem because Mistral3ForConditionalGeneration loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:
- Position-dependent attention scaling (
llama_4_scaling_beta) — dampens attention at longer positions - YaRN RoPE with
beta_fast,beta_slow,mscale— context length scaling
MistralForCausalLM ignores these config fields. Ministral3ForCausalLM (transformers 5) handles them correctly.
- Downloads last month
- 158
Model tree for levara/Devstral-Small-2-24B-TextOnly-FP8
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503