| --- |
| license: apache-2.0 |
| base_model: mistralai/Devstral-Small-2-24B-Instruct-2512 |
| tags: |
| - mistral |
| - ministral3 |
| - text-only |
| - fp8 |
| - code |
| - vllm |
| library_name: transformers |
| pipeline_tag: text-generation |
| --- |
| |
| # Devstral-Small-2-24B TextOnly FP8 |
|
|
| Text-only version of [mistralai/Devstral-Small-2-24B-Instruct-2512](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512) with the Pixtral vision encoder and multimodal projector removed. |
|
|
| Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original. |
|
|
| ## Requirements |
|
|
| - **transformers >= 5.0** — the `ministral3` model type and `Ministral3ForCausalLM` class were added in transformers 5.0. Will not load on transformers 4.x. |
| - **vLLM nightly (0.18+) with transformers 5.3.0** — vLLM stable (0.16) pins `transformers<5`. The nightly allows the upgrade. vLLM does not have a native `Ministral3ForCausalLM` — it falls back to `TransformersForCausalLM`, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (`llama_4_scaling_beta`) and YaRN RoPE properly. |
|
|
| > **Warning:** Do NOT override the architecture to `MistralForCausalLM`. While the model will load and serve, `MistralForCausalLM` silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | Architecture | `Ministral3ForCausalLM` | |
| | Model type | `ministral3` | |
| | Parameters | 23.57B | |
| | Quantization | FP8 W8A8 static (`float8_e4m3fn`) | |
| | Layers | 40 | |
| | Hidden size | 5120 | |
| | Attention heads | 32 (8 KV heads) | |
| | Context length | 393K tokens (YaRN RoPE) | |
| | Vocab size | 131,072 | |
| | Size on disk | ~24.9 GB | |
|
|
| ## What Changed |
|
|
| The source model (`Mistral3ForConditionalGeneration`) is a VLM containing: |
| - **Language model** (23.57B params, FP8) — kept |
| - **Vision tower** (Pixtral, ~0.4B params, BF16) — removed |
| - **Multimodal projector** (BF16) — removed |
|
|
| Changes from the original: |
| 1. Stripped `language_model.*` prefix from all tensor names |
| 2. Config: `Ministral3ForCausalLM` / `model_type: "ministral3"` (requires transformers >= 5.0) |
| 3. Quantization config: removed vision module references from `modules_to_not_convert` |
| 4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale` (same values, no inversion — both conventions use multiplication for dequantization) |
|
|
| ## Usage |
|
|
| ### With vLLM (nightly + transformers 5) |
|
|
| ```bash |
| pip install transformers>=5.0 |
| |
| vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \ |
| --tensor-parallel-size 2 \ |
| --max-model-len 32768 \ |
| --enable-auto-tool-choice \ |
| --tool-call-parser mistral |
| ``` |
|
|
| vLLM will resolve to the `TransformersForCausalLM` backend, which delegates to transformers 5's native `Ministral3ForCausalLM`. |
|
|
| ### With transformers (>= 5.0) |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8") |
| model = AutoModelForCausalLM.from_pretrained( |
| "levara/Devstral-Small-2-24B-TextOnly-FP8", |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| ) |
| ``` |
|
|
| **Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config. |
|
|
| ## Verification |
|
|
| Verified against the original VLM: |
| - 923 tensors, 40 layers, no vision keys |
| - FP8 dtypes preserved on all linear weights |
| - First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065 |
|
|
| ## Why Not MistralForCausalLM? |
|
|
| The original VLM avoids this problem because `Mistral3ForConditionalGeneration` loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features: |
|
|
| - **Position-dependent attention scaling** (`llama_4_scaling_beta`) — dampens attention at longer positions |
| - **YaRN RoPE** with `beta_fast`, `beta_slow`, `mscale` — context length scaling |
|
|
| `MistralForCausalLM` ignores these config fields. `Ministral3ForCausalLM` (transformers 5) handles them correctly. |
|
|