---
license: apache-2.0
base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
tags:
  - mistral
  - ministral3
  - text-only
  - fp8
  - code
  - vllm
library_name: transformers
pipeline_tag: text-generation
---

# Devstral-Small-2-24B TextOnly FP8

Text-only version of [mistralai/Devstral-Small-2-24B-Instruct-2512](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512) with the Pixtral vision encoder and multimodal projector removed.

Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.

## Requirements

- **transformers >= 5.0** — the `ministral3` model type and `Ministral3ForCausalLM` class were added in transformers 5.0. Will not load on transformers 4.x.
- **vLLM nightly (0.18+) with transformers 5.3.0** — vLLM stable (0.16) pins `transformers<5`. The nightly allows the upgrade. vLLM does not have a native `Ministral3ForCausalLM` — it falls back to `TransformersForCausalLM`, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (`llama_4_scaling_beta`) and YaRN RoPE properly.

> **Warning:** Do NOT override the architecture to `MistralForCausalLM`. While the model will load and serve, `MistralForCausalLM` silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.

## Model Details

| Property | Value |
|---|---|
| Architecture | `Ministral3ForCausalLM` |
| Model type | `ministral3` |
| Parameters | 23.57B |
| Quantization | FP8 W8A8 static (`float8_e4m3fn`) |
| Layers | 40 |
| Hidden size | 5120 |
| Attention heads | 32 (8 KV heads) |
| Context length | 393K tokens (YaRN RoPE) |
| Vocab size | 131,072 |
| Size on disk | ~24.9 GB |

## What Changed

The source model (`Mistral3ForConditionalGeneration`) is a VLM containing:
- **Language model** (23.57B params, FP8) — kept
- **Vision tower** (Pixtral, ~0.4B params, BF16) — removed
- **Multimodal projector** (BF16) — removed

Changes from the original:
1. Stripped `language_model.*` prefix from all tensor names
2. Config: `Ministral3ForCausalLM` / `model_type: "ministral3"` (requires transformers >= 5.0)
3. Quantization config: removed vision module references from `modules_to_not_convert`
4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale` (same values, no inversion — both conventions use multiplication for dequantization)

## Usage

### With vLLM (nightly + transformers 5)

```bash
pip install transformers>=5.0

vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral
```

vLLM will resolve to the `TransformersForCausalLM` backend, which delegates to transformers 5's native `Ministral3ForCausalLM`.

### With transformers (>= 5.0)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
model = AutoModelForCausalLM.from_pretrained(
    "levara/Devstral-Small-2-24B-TextOnly-FP8",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
```

**Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config.

## Verification

Verified against the original VLM:
- 923 tensors, 40 layers, no vision keys
- FP8 dtypes preserved on all linear weights
- First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065

## Why Not MistralForCausalLM?

The original VLM avoids this problem because `Mistral3ForConditionalGeneration` loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:

- **Position-dependent attention scaling** (`llama_4_scaling_beta`) — dampens attention at longer positions
- **YaRN RoPE** with `beta_fast`, `beta_slow`, `mscale` — context length scaling

`MistralForCausalLM` ignores these config fields. `Ministral3ForCausalLM` (transformers 5) handles them correctly.