levara commited on
Commit
a9b25b1
·
verified ·
1 Parent(s): b3671d5

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +34 -13
  2. config.json +2 -2
README.md CHANGED
@@ -3,6 +3,7 @@ license: apache-2.0
3
  base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
4
  tags:
5
  - mistral
 
6
  - text-only
7
  - fp8
8
  - code
@@ -17,11 +18,19 @@ Text-only version of [mistralai/Devstral-Small-2-24B-Instruct-2512](https://hugg
17
 
18
  Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.
19
 
 
 
 
 
 
 
 
20
  ## Model Details
21
 
22
  | Property | Value |
23
  |---|---|
24
- | Architecture | `MistralForCausalLM` |
 
25
  | Parameters | 23.57B |
26
  | Quantization | FP8 W8A8 static (`float8_e4m3fn`) |
27
  | Layers | 40 |
@@ -40,23 +49,17 @@ The source model (`Mistral3ForConditionalGeneration`) is a VLM containing:
40
 
41
  Changes from the original:
42
  1. Stripped `language_model.*` prefix from all tensor names
43
- 2. Config: `MistralForCausalLM` / `model_type: "mistral"` (compatible with transformers 4.x and vLLM)
44
  3. Quantization config: removed vision module references from `modules_to_not_convert`
45
- 4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale`
46
-
47
- ## Verification
48
-
49
- Verified against the original VLM running on vLLM:
50
- - 923 tensors, 40 layers, no vision keys
51
- - FP8 dtypes preserved on all linear weights
52
- - First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065 (FP8 dequantize vs native precision noise)
53
- - Successfully served on vLLM with tensor parallelism on 2x RTX 3090
54
 
55
  ## Usage
56
 
57
- ### With vLLM
58
 
59
  ```bash
 
 
60
  vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
61
  --tensor-parallel-size 2 \
62
  --max-model-len 32768 \
@@ -64,7 +67,9 @@ vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
64
  --tool-call-parser mistral
65
  ```
66
 
67
- ### With transformers
 
 
68
 
69
  ```python
70
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -79,3 +84,19 @@ model = AutoModelForCausalLM.from_pretrained(
79
  ```
80
 
81
  **Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  base_model: mistralai/Devstral-Small-2-24B-Instruct-2512
4
  tags:
5
  - mistral
6
+ - ministral3
7
  - text-only
8
  - fp8
9
  - code
 
18
 
19
  Native FP8 weights, vLLM-compatible scale naming. No dtype conversion — tensors copied byte-for-byte from the original.
20
 
21
+ ## Requirements
22
+
23
+ - **transformers >= 5.0** — the `ministral3` model type and `Ministral3ForCausalLM` class were added in transformers 5.0. Will not load on transformers 4.x.
24
+ - **vLLM nightly (0.18+) with transformers 5.3.0** — vLLM stable (0.16) pins `transformers<5`. The nightly allows the upgrade. vLLM does not have a native `Ministral3ForCausalLM` — it falls back to `TransformersForCausalLM`, which delegates to transformers 5's implementation. This is the correct path: it handles Ministral3's attention scaling (`llama_4_scaling_beta`) and YaRN RoPE properly.
25
+
26
+ > **Warning:** Do NOT override the architecture to `MistralForCausalLM`. While the model will load and serve, `MistralForCausalLM` silently drops the position-dependent attention scaling and YaRN RoPE parameters, producing wordier and less disciplined output.
27
+
28
  ## Model Details
29
 
30
  | Property | Value |
31
  |---|---|
32
+ | Architecture | `Ministral3ForCausalLM` |
33
+ | Model type | `ministral3` |
34
  | Parameters | 23.57B |
35
  | Quantization | FP8 W8A8 static (`float8_e4m3fn`) |
36
  | Layers | 40 |
 
49
 
50
  Changes from the original:
51
  1. Stripped `language_model.*` prefix from all tensor names
52
+ 2. Config: `Ministral3ForCausalLM` / `model_type: "ministral3"` (requires transformers >= 5.0)
53
  3. Quantization config: removed vision module references from `modules_to_not_convert`
54
+ 4. Renamed FP8 scale tensors for vLLM compatibility: `activation_scale` → `input_scale`, `weight_scale_inv` → `weight_scale` (same values, no inversion — both conventions use multiplication for dequantization)
 
 
 
 
 
 
 
 
55
 
56
  ## Usage
57
 
58
+ ### With vLLM (nightly + transformers 5)
59
 
60
  ```bash
61
+ pip install transformers>=5.0
62
+
63
  vllm serve levara/Devstral-Small-2-24B-TextOnly-FP8 \
64
  --tensor-parallel-size 2 \
65
  --max-model-len 32768 \
 
67
  --tool-call-parser mistral
68
  ```
69
 
70
+ vLLM will resolve to the `TransformersForCausalLM` backend, which delegates to transformers 5's native `Ministral3ForCausalLM`.
71
+
72
+ ### With transformers (>= 5.0)
73
 
74
  ```python
75
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
84
  ```
85
 
86
  **Note:** For native FP8 inference, requires SM 8.9+ GPU (RTX 4090, H100). On older GPUs (e.g. RTX 3090), vLLM uses the Marlin kernel for weight-only dequantization. For CPU, set `dequantize: true` in the quantization config.
87
+
88
+ ## Verification
89
+
90
+ Verified against the original VLM:
91
+ - 923 tensors, 40 layers, no vision keys
92
+ - FP8 dtypes preserved on all linear weights
93
+ - First-token logprob comparison: top-1 match, 80% top-20 overlap, max logprob diff 0.065
94
+
95
+ ## Why Not MistralForCausalLM?
96
+
97
+ The original VLM avoids this problem because `Mistral3ForConditionalGeneration` loads the text backbone through its own internal code path, bypassing the model registry. When we extract the text model standalone, we need an architecture that preserves Ministral3-specific features:
98
+
99
+ - **Position-dependent attention scaling** (`llama_4_scaling_beta`) — dampens attention at longer positions
100
+ - **YaRN RoPE** with `beta_fast`, `beta_slow`, `mscale` — context length scaling
101
+
102
+ `MistralForCausalLM` ignores these config fields. `Ministral3ForCausalLM` (transformers 5) handles them correctly.
config.json CHANGED
@@ -6,7 +6,7 @@
6
  "initializer_range": 0.02,
7
  "intermediate_size": 32768,
8
  "max_position_embeddings": 393216,
9
- "model_type": "mistral",
10
  "num_attention_heads": 32,
11
  "num_hidden_layers": 40,
12
  "num_key_value_heads": 8,
@@ -27,7 +27,7 @@
27
  "use_cache": true,
28
  "vocab_size": 131072,
29
  "architectures": [
30
- "MistralForCausalLM"
31
  ],
32
  "torch_dtype": "bfloat16",
33
  "quantization_config": {
 
6
  "initializer_range": 0.02,
7
  "intermediate_size": 32768,
8
  "max_position_embeddings": 393216,
9
+ "model_type": "ministral3",
10
  "num_attention_heads": 32,
11
  "num_hidden_layers": 40,
12
  "num_key_value_heads": 8,
 
27
  "use_cache": true,
28
  "vocab_size": 131072,
29
  "architectures": [
30
+ "Ministral3ForCausalLM"
31
  ],
32
  "torch_dtype": "bfloat16",
33
  "quantization_config": {