--- license: apache-2.0 base_model: mistralai/Devstral-Small-2-24B-Instruct-2512 tags: - mistral - ministral3 - text-only - fp8 - code - vllm library_name: transformers pipeline_tag: text-generation --- # Devstral-Small-2-24B TextOnly FP8 Text-only version of [mistralai/Devstral-Small-2-24B-Instruct-2512](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512) with the Pixtral vision encoder and multimodal projector removed. Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original. ## Requirements - **transformers >= 5.0** — the `ministral3` model type and `Ministral3ForCausalLM` class were added in transformers 5.0. Will not load on transformers 4.x. - **vLLM nightly (0.18+) with transformers 5.3.0** — vLLM stable (0.16) pins `transformers<5`. The nightly allows the upgrade. vLLM does not have a native `Ministral3ForCausalLM` — it falls back to `TransformersForCausalLM`, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (`llama_4_scaling_beta`) and YaRN RoPE properly. > **Warning:** Do NOT override the architecture to `MistralForCausalLM`. While the model will load and serve, `MistralForCausalLM` silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output. ## Model Details | Property | Value | |---|---| | Architecture | `Ministral3ForCausalLM` | | Model type | `ministral3` | | Parameters | 23.57B | | Quantization | FP8 W8A8 static (`float8_e4m3fn`) | | Layers | 40 | | Hidden size | 5120 | | Attention heads | 32 (8 KV heads) | | Context length | 393K tokens (YaRN RoPE) | | Vocab size | 131,072 | | Size on disk | ~24.9 GB | ## What Changed The source model (`Mistral3ForConditionalGeneration`) is a VLM containing: - **Language model** (23.57B params, FP8) — kept - **Vision tower** (Pixtral, ~0.4B params, BF16) — removed - **Multimodal projector** (BF16) — removed Changes from the original: 1. Stripped `language_model.*` prefix from all tensor names 2. Config: `Ministral3ForCausalLM` / `model_type: "ministral3"` (requires transformers >= 5.0) 3. Quantization config: removed vision module references from `modules_to_not_convert` 4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale` (same values, no inversion — both conventions use multiplication for dequantization) ## Usage ### With vLLM (nightly + transformers 5) ```bash pip install transformers>=5.0 vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --enable-auto-tool-choice \ --tool-call-parser mistral ``` vLLM will resolve to the `TransformersForCausalLM` backend, which delegates to transformers 5's native `Ministral3ForCausalLM`. ### With transformers (>= 5.0) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8") model = AutoModelForCausalLM.from_pretrained( "levara/Devstral-Small-2-24B-TextOnly-FP8", device_map="auto", torch_dtype=torch.bfloat16, ) ``` **Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config. ## Verification Verified against the original VLM: - 923 tensors, 40 layers, no vision keys - FP8 dtypes preserved on all linear weights - First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065 ## Why Not MistralForCausalLM? The original VLM avoids this problem because `Mistral3ForConditionalGeneration` loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features: - **Position-dependent attention scaling** (`llama_4_scaling_beta`) — dampens attention at longer positions - **YaRN RoPE** with `beta_fast`, `beta_slow`, `mscale` — context length scaling `MistralForCausalLM` ignores these config fields. `Ministral3ForCausalLM` (transformers 5) handles them correctly.