Upload folder using huggingface_hub

a9b25b1 verified 17 days ago

4.37 kB

	---
	license: apache-2.0
	base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
	tags:
	- mistral
	- ministral3
	- text-only
	- fp8
	- code
	- vllm
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Devstral-Small-2-24B TextOnly FP8

	Text-only version of [mistralai/Devstral-Small-2-24B-Instruct-2512](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512) with the Pixtral vision encoder and multimodal projector removed.

	Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.

	## Requirements

	- transformers >= 5.0 — the `ministral3` model type and `Ministral3ForCausalLM` class were added in transformers 5.0. Will not load on transformers 4.x.
	- vLLM nightly (0.18+) with transformers 5.3.0 — vLLM stable (0.16) pins `transformers<5`. The nightly allows the upgrade. vLLM does not have a native `Ministral3ForCausalLM` — it falls back to `TransformersForCausalLM`, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (`llama_4_scaling_beta`) and YaRN RoPE properly.

	> Warning: Do NOT override the architecture to `MistralForCausalLM`. While the model will load and serve, `MistralForCausalLM` silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| `Ministral3ForCausalLM` \|
	\| Model type \| `ministral3` \|
	\| Parameters \| 23.57B \|
	\| Quantization \| FP8 W8A8 static (`float8_e4m3fn`) \|
	\| Layers \| 40 \|
	\| Hidden size \| 5120 \|
	\| Attention heads \| 32 (8 KV heads) \|
	\| Context length \| 393K tokens (YaRN RoPE) \|
	\| Vocab size \| 131,072 \|
	\| Size on disk \| ~24.9 GB \|

	## What Changed

	The source model (`Mistral3ForConditionalGeneration`) is a VLM containing:
	- Language model (23.57B params, FP8) — kept
	- Vision tower (Pixtral, ~0.4B params, BF16) — removed
	- Multimodal projector (BF16) — removed

	Changes from the original:
	1. Stripped `language_model.*` prefix from all tensor names
	2. Config: `Ministral3ForCausalLM` / `model_type: "ministral3"` (requires transformers >= 5.0)
	3. Quantization config: removed vision module references from `modules_to_not_convert`
	4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale` (same values, no inversion — both conventions use multiplication for dequantization)

	## Usage

	### With vLLM (nightly + transformers 5)

	```bash
	pip install transformers>=5.0

	vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
	--tensor-parallel-size 2 \
	--max-model-len 32768 \
	--enable-auto-tool-choice \
	--tool-call-parser mistral
	```

	vLLM will resolve to the `TransformersForCausalLM` backend, which delegates to transformers 5's native `Ministral3ForCausalLM`.

	### With transformers (>= 5.0)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
	model = AutoModelForCausalLM.from_pretrained(
	"levara/Devstral-Small-2-24B-TextOnly-FP8",
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)
	```

	Note: For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config.

	## Verification

	Verified against the original VLM:
	- 923 tensors, 40 layers, no vision keys
	- FP8 dtypes preserved on all linear weights
	- First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065

	## Why Not MistralForCausalLM?

	The original VLM avoids this problem because `Mistral3ForConditionalGeneration` loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:

	- Position-dependent attention scaling (`llama_4_scaling_beta`) — dampens attention at longer positions
	- YaRN RoPE with `beta_fast`, `beta_slow`, `mscale` — context length scaling

	`MistralForCausalLM` ignores these config fields. `Ministral3ForCausalLM` (transformers 5) handles them correctly.