Instructions to use levara/Devstral-Small-2-24B-TextOnly-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="levara/Devstral-Small-2-24B-TextOnly-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
model = AutoModelForCausalLM.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "levara/Devstral-Small-2-24B-TextOnly-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "levara/Devstral-Small-2-24B-TextOnly-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/levara/Devstral-Small-2-24B-TextOnly-FP8

SGLang

How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "levara/Devstral-Small-2-24B-TextOnly-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "levara/Devstral-Small-2-24B-TextOnly-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "levara/Devstral-Small-2-24B-TextOnly-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "levara/Devstral-Small-2-24B-TextOnly-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with Docker Model Runner:
```
docker model run hf.co/levara/Devstral-Small-2-24B-TextOnly-FP8
```

Devstral-Small-2-24B TextOnly FP8

Text-only version of mistralai/Devstral-Small-2-24B-Instruct-2512 with the Pixtral vision encoder and multimodal projector removed.

Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.

Requirements

transformers >= 5.0 — the ministral3 model type and Ministral3ForCausalLM class were added in transformers 5.0. Will not load on transformers 4.x.
vLLM nightly (0.18+) with transformers 5.3.0 — vLLM stable (0.16) pins transformers<5. The nightly allows the upgrade. vLLM does not have a native Ministral3ForCausalLM — it falls back to TransformersForCausalLM, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (llama_4_scaling_beta) and YaRN RoPE properly.

Warning: Do NOT override the architecture to MistralForCausalLM. While the model will load and serve, MistralForCausalLM silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.

Model Details

Property	Value
Architecture	`Ministral3ForCausalLM`
Model type	`ministral3`
Parameters	23.57B
Quantization	FP8 W8A8 static (`float8_e4m3fn`)
Layers	40
Hidden size	5120
Attention heads	32 (8 KV heads)
Context length	393K tokens (YaRN RoPE)
Vocab size	131,072
Size on disk	~24.9 GB

What Changed

The source model (Mistral3ForConditionalGeneration) is a VLM containing:

Language model (23.57B params, FP8) — kept
Vision tower (Pixtral, ~0.4B params, BF16) — removed
Multimodal projector (BF16) — removed

Changes from the original:

Stripped language_model.* prefix from all tensor names
Config: Ministral3ForCausalLM / model_type: "ministral3" (requires transformers >= 5.0)
Quantization config: removed vision module references from modules_to_not_convert
Renamed FP8 scale tensors for vLLM compatibility: activation_scale → input_scale, weight_scale_inv → weight_scale (same values, no inversion — both conventions use multiplication for dequantization)

Usage

With vLLM (nightly + transformers 5)

pip install transformers>=5.0

vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

vLLM will resolve to the TransformersForCausalLM backend, which delegates to transformers 5's native Ministral3ForCausalLM.

With transformers (>= 5.0)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
model = AutoModelForCausalLM.from_pretrained(
    "levara/Devstral-Small-2-24B-TextOnly-FP8",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Note: For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set dequantize: true in the quantization config.

Verification

Verified against the original VLM:

923 tensors, 40 layers, no vision keys
FP8 dtypes preserved on all linear weights
First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065

Why Not MistralForCausalLM?

The original VLM avoids this problem because Mistral3ForConditionalGeneration loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:

Position-dependent attention scaling (llama_4_scaling_beta) — dampens attention at longer positions
YaRN RoPE with beta_fast, beta_slow, mscale — context length scaling

MistralForCausalLM ignores these config fields. Ministral3ForCausalLM (transformers 5) handles them correctly.

Downloads last month: 47

Safetensors

Model size

24B params

Tensor type

BF16

F8_E4M3

Model tree for levara/Devstral-Small-2-24B-TextOnly-FP8

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Quantized

mistralai/Devstral-Small-2-24B-Instruct-2512

Quantized

(35)

this model