levara's picture
Upload folder using huggingface_hub
a9b25b1 verified
---
license: apache-2.0
base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
tags:
- mistral
- ministral3
- text-only
- fp8
- code
- vllm
library_name: transformers
pipeline_tag: text-generation
---
# Devstral-Small-2-24B TextOnly FP8
Text-only version of [mistralai/Devstral-Small-2-24B-Instruct-2512](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512) with the Pixtral vision encoder and multimodal projector removed.
Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.
## Requirements
- **transformers >= 5.0** — the `ministral3` model type and `Ministral3ForCausalLM` class were added in transformers 5.0. Will not load on transformers 4.x.
- **vLLM nightly (0.18+) with transformers 5.3.0** — vLLM stable (0.16) pins `transformers<5`. The nightly allows the upgrade. vLLM does not have a native `Ministral3ForCausalLM` — it falls back to `TransformersForCausalLM`, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (`llama_4_scaling_beta`) and YaRN RoPE properly.
> **Warning:** Do NOT override the architecture to `MistralForCausalLM`. While the model will load and serve, `MistralForCausalLM` silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.
## Model Details
| Property | Value |
|---|---|
| Architecture | `Ministral3ForCausalLM` |
| Model type | `ministral3` |
| Parameters | 23.57B |
| Quantization | FP8 W8A8 static (`float8_e4m3fn`) |
| Layers | 40 |
| Hidden size | 5120 |
| Attention heads | 32 (8 KV heads) |
| Context length | 393K tokens (YaRN RoPE) |
| Vocab size | 131,072 |
| Size on disk | ~24.9 GB |
## What Changed
The source model (`Mistral3ForConditionalGeneration`) is a VLM containing:
- **Language model** (23.57B params, FP8) — kept
- **Vision tower** (Pixtral, ~0.4B params, BF16) — removed
- **Multimodal projector** (BF16) — removed
Changes from the original:
1. Stripped `language_model.*` prefix from all tensor names
2. Config: `Ministral3ForCausalLM` / `model_type: "ministral3"` (requires transformers >= 5.0)
3. Quantization config: removed vision module references from `modules_to_not_convert`
4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale``input_scale`, `weight_scale_inv``weight_scale` (same values, no inversion — both conventions use multiplication for dequantization)
## Usage
### With vLLM (nightly + transformers 5)
```bash
pip install transformers>=5.0
vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser mistral
```
vLLM will resolve to the `TransformersForCausalLM` backend, which delegates to transformers 5's native `Ministral3ForCausalLM`.
### With transformers (>= 5.0)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
model = AutoModelForCausalLM.from_pretrained(
"levara/Devstral-Small-2-24B-TextOnly-FP8",
device_map="auto",
torch_dtype=torch.bfloat16,
)
```
**Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config.
## Verification
Verified against the original VLM:
- 923 tensors, 40 layers, no vision keys
- FP8 dtypes preserved on all linear weights
- First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065
## Why Not MistralForCausalLM?
The original VLM avoids this problem because `Mistral3ForConditionalGeneration` loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:
- **Position-dependent attention scaling** (`llama_4_scaling_beta`) — dampens attention at longer positions
- **YaRN RoPE** with `beta_fast`, `beta_slow`, `mscale` — context length scaling
`MistralForCausalLM` ignores these config fields. `Ministral3ForCausalLM` (transformers 5) handles them correctly.