Instructions to use levara/Devstral-Small-2-24B-TextOnly-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="levara/Devstral-Small-2-24B-TextOnly-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
model = AutoModelForCausalLM.from_pretrained("levara/Devstral-Small-2-24B-TextOnly-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "levara/Devstral-Small-2-24B-TextOnly-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "levara/Devstral-Small-2-24B-TextOnly-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/levara/Devstral-Small-2-24B-TextOnly-FP8

SGLang

How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "levara/Devstral-Small-2-24B-TextOnly-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "levara/Devstral-Small-2-24B-TextOnly-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "levara/Devstral-Small-2-24B-TextOnly-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "levara/Devstral-Small-2-24B-TextOnly-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use levara/Devstral-Small-2-24B-TextOnly-FP8 with Docker Model Runner:
```
docker model run hf.co/levara/Devstral-Small-2-24B-TextOnly-FP8
```

levara commited on Mar 25

Commit

a9b25b1

verified ·

1 Parent(s): b3671d5

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +34 -13
config.json +2 -2

README.md CHANGED Viewed

@@ -3,6 +3,7 @@ license: apache-2.0
 base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
 tags:
   - mistral
   - text-only
   - fp8
   - code
@@ -17,11 +18,19 @@ Text-only version of [mistralai/Devstral-Small-2-24B-Instruct-2512](https://hugg
 Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.
 ## Model Details
 | Property | Value |
 |---|---|
-| Architecture | `MistralForCausalLM` |
 | Parameters | 23.57B |
 | Quantization | FP8 W8A8 static (`float8_e4m3fn`) |
 | Layers | 40 |
@@ -40,23 +49,17 @@ The source model (`Mistral3ForConditionalGeneration`) is a VLM containing:
 Changes from the original:
 1. Stripped `language_model.*` prefix from all tensor names
-2. Config: `MistralForCausalLM` / `model_type: "mistral"` (compatible with transformers 4.x and vLLM)
 3. Quantization config: removed vision module references from `modules_to_not_convert`
-4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale`
-## Verification
-Verified against the original VLM running on vLLM:
-- 923 tensors, 40 layers, no vision keys
-- FP8 dtypes preserved on all linear weights
-- First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065 (FP8 dequantize vs native precision noise)
-- Successfully served on vLLM with tensor parallelism on 2x RTX 3090
 ## Usage
-### With vLLM
 ```bash
 vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
     --tensor-parallel-size 2 \
     --max-model-len 32768 \
@@ -64,7 +67,9 @@ vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
     --tool-call-parser mistral
 ```
-### With transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -79,3 +84,19 @@ model = AutoModelForCausalLM.from_pretrained(
 ```
 **Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config.

 base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
 tags:
   - mistral
+  - ministral3
   - text-only
   - fp8
   - code
 Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.
+## Requirements
+- **transformers >= 5.0** — the `ministral3` model type and `Ministral3ForCausalLM` class were added in transformers 5.0. Will not load on transformers 4.x.
+- **vLLM nightly (0.18+) with transformers 5.3.0** — vLLM stable (0.16) pins `transformers<5`. The nightly allows the upgrade. vLLM does not have a native `Ministral3ForCausalLM` — it falls back to `TransformersForCausalLM`, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (`llama_4_scaling_beta`) and YaRN RoPE properly.
+> **Warning:** Do NOT override the architecture to `MistralForCausalLM`. While the model will load and serve, `MistralForCausalLM` silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.
 ## Model Details
 | Property | Value |
 |---|---|
+| Architecture | `Ministral3ForCausalLM` |
+| Model type | `ministral3` |
 | Parameters | 23.57B |
 | Quantization | FP8 W8A8 static (`float8_e4m3fn`) |
 | Layers | 40 |
 Changes from the original:
 1. Stripped `language_model.*` prefix from all tensor names
+2. Config: `Ministral3ForCausalLM` / `model_type: "ministral3"` (requires transformers >= 5.0)
 3. Quantization config: removed vision module references from `modules_to_not_convert`
+4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale` (same values, no inversion — both conventions use multiplication for dequantization)
 ## Usage
+### With vLLM (nightly + transformers 5)
 ```bash
+pip install transformers>=5.0
 vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
     --tensor-parallel-size 2 \
     --max-model-len 32768 \
     --tool-call-parser mistral
 ```
+vLLM will resolve to the `TransformersForCausalLM` backend, which delegates to transformers 5's native `Ministral3ForCausalLM`.
+### With transformers (>= 5.0)
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 ```
 **Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config.
+## Verification
+Verified against the original VLM:
+- 923 tensors, 40 layers, no vision keys
+- FP8 dtypes preserved on all linear weights
+- First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065
+## Why Not MistralForCausalLM?
+The original VLM avoids this problem because `Mistral3ForConditionalGeneration` loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:
+- **Position-dependent attention scaling** (`llama_4_scaling_beta`) — dampens attention at longer positions
+- **YaRN RoPE** with `beta_fast`, `beta_slow`, `mscale` — context length scaling
+`MistralForCausalLM` ignores these config fields. `Ministral3ForCausalLM` (transformers 5) handles them correctly.

config.json CHANGED Viewed

@@ -6,7 +6,7 @@
   "initializer_range": 0.02,
   "intermediate_size": 32768,
   "max_position_embeddings": 393216,
-  "model_type": "mistral",
   "num_attention_heads": 32,
   "num_hidden_layers": 40,
   "num_key_value_heads": 8,
@@ -27,7 +27,7 @@
   "use_cache": true,
   "vocab_size": 131072,
   "architectures": [
-    "MistralForCausalLM"
   ],
   "torch_dtype": "bfloat16",
   "quantization_config": {

   "initializer_range": 0.02,
   "intermediate_size": 32768,
   "max_position_embeddings": 393216,
+  "model_type": "ministral3",
   "num_attention_heads": 32,
   "num_hidden_layers": 40,
   "num_key_value_heads": 8,
   "use_cache": true,
   "vocab_size": 131072,
   "architectures": [
+    "Ministral3ForCausalLM"
   ],
   "torch_dtype": "bfloat16",
   "quantization_config": {