kleinpanic93 commited on
Commit
d196e2c
·
verified ·
1 Parent(s): d30664e

NVFP4 quantization of Qwen3-Coder-30B-A3B-Instruct via spark-maker v3

Browse files
README.md CHANGED
@@ -5,88 +5,60 @@ tags:
5
  - qwen3
6
  - moe
7
  - nvfp4
8
- - 4-bit
9
  - quantized
10
  - nvidia-modelopt
11
- - blackwell
12
- - dgx-spark
13
  - coding
14
- - code-generation
15
- - mixture-of-experts
16
  model_type: qwen3_moe
17
  quantized_by: kleinpanic93
18
  pipeline_tag: text-generation
19
  library_name: transformers
20
- inference: false
21
  ---
22
 
23
- <div align="center">
24
-
25
- # 🧠 Qwen3-Coder-30B-A3B-Instruct — NVFP4
26
-
27
- **4-bit quantization of Qwen's 30B Mixture-of-Experts coding model**
28
- **Optimized for NVIDIA Blackwell (GB10 / GB200 / B200)**
29
 
30
- [![Base Model](https://img.shields.io/badge/Base-Qwen3--Coder--30B--A3B-blue)](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct)
31
- [![Quantization](https://img.shields.io/badge/Quant-NVFP4_(4--bit)-green)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
32
- [![Hardware](https://img.shields.io/badge/Hardware-DGX_Spark_GB10-76b900)](https://www.nvidia.com/en-us/products/workstations/dgx-spark/)
33
- [![License](https://img.shields.io/badge/License-Apache_2.0-orange)](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE)
34
-
35
- </div>
36
-
37
- ---
38
 
39
- ## Overview
40
 
41
- NVFP4 post-training quantization of [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct) — a 30 billion parameter Mixture-of-Experts model specialized for code generation, with 3B parameters active per token across 128 experts per layer.
42
-
43
- Quantized on an **NVIDIA DGX Spark** (Blackwell GB10, 128 GB unified memory) using NVIDIA ModelOpt with 512 calibration samples. All MoE routing/gate layers are preserved in original precision to maintain expert selection quality.
44
-
45
- ## Specifications
46
-
47
- <table>
48
- <tr><td><b>Architecture</b></td><td>Qwen3MoeForCausalLM Mixture-of-Experts</td></tr>
49
- <tr><td><b>Parameters</b></td><td>30B total · 3B active per token</td></tr>
50
- <tr><td><b>Experts</b></td><td>128 per layer · 48 layers</td></tr>
51
- <tr><td><b>Quantization</b></td><td>NVFP4 4-bit NVIDIA floating point (weights + activations)</td></tr>
52
- <tr><td><b>KV Cache</b></td><td>FP8 quantized</td></tr>
53
- <tr><td><b>Block Size</b></td><td>16 (weights and activations)</td></tr>
54
- <tr><td><b>Preserved Layers</b></td><td><code>lm_head</code> + 48 MoE gate/router layers (full precision)</td></tr>
55
- <tr><td><b>Original Precision</b></td><td>BF16</td></tr>
56
- <tr><td><b>Model Size</b></td><td>~57 GB</td></tr>
57
- <tr><td><b>Context Length</b></td><td>Up to 131,072 tokens</td></tr>
58
- </table>
59
 
60
  ## Quantization Details
61
 
62
- | | |
63
- |---|---|
64
- | **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
65
- | **Config** | `NVFP4_DEFAULT_CFG` |
66
- | **Calibration** | 512 samples (synthetic) |
67
- | **Export** | `save_pretrained` + manual `quantization_config` injection |
68
- | **Quantization Time** | 7 minutes on DGX Spark GB10 |
69
- | **Hardware** | NVIDIA DGX Spark — Blackwell GB10, 128 GB unified memory |
70
-
71
- > **Note on export method:** ModelOpt 0.41.0's native HF checkpoint exporter does not yet support `Qwen3MoeExperts` in its allowlist (only `Llama4TextExperts` and `GptOssExperts`). The quantization math is identical — calibration and weight conversion run through ModelOpt's standard NVFP4 pipeline. The checkpoint is serialized via HuggingFace `save_pretrained()` with a manually constructed `quantization_config` matching NVIDIA's schema. NVIDIA's own pre-quantized MoE models (e.g., Qwen3-Next-80B-A3B-NVFP4) use an internal dev build of ModelOpt that includes this export path.
72
 
73
  ## Usage
74
 
75
- ### vLLM (Recommended)
76
 
77
  ```bash
78
  vllm serve kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4 \
79
  --quantization modelopt \
80
  --trust-remote-code \
81
- --max-model-len 32768 \
82
- --gpu-memory-utilization 0.90
83
  ```
84
 
85
- ### Transformers
86
 
87
  ```python
88
  from transformers import AutoModelForCausalLM, AutoTokenizer
89
- import torch
90
 
91
  model = AutoModelForCausalLM.from_pretrained(
92
  "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
@@ -96,77 +68,40 @@ model = AutoModelForCausalLM.from_pretrained(
96
  tokenizer = AutoTokenizer.from_pretrained(
97
  "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4"
98
  )
99
-
100
- messages = [{"role": "user", "content": "Write a Python async web scraper with rate limiting."}]
101
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
102
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
103
-
104
- with torch.no_grad():
105
- output = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
106
-
107
- print(tokenizer.decode(output[0], skip_special_tokens=True))
108
- ```
109
-
110
- ### OpenAI-Compatible API (via vLLM)
111
-
112
- ```python
113
- from openai import OpenAI
114
-
115
- client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
116
- response = client.chat.completions.create(
117
- model="kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
118
- messages=[{"role": "user", "content": "Implement a B-tree in Rust."}],
119
- max_tokens=4096,
120
- )
121
- print(response.choices[0].message.content)
122
  ```
123
 
124
  ## Hardware Requirements
125
 
126
- | Configuration | Status |
127
- |--------------|--------|
128
- | NVIDIA DGX Spark (GB10, 128 GB UMA) | ✅ Tested — primary target |
129
- | NVIDIA GB200 / B200 (Blackwell HBM) | ✅ Should work |
130
- | NVIDIA H100 / A100 (80 GB HBM) | ⚠️ Tight — 57 GB model + KV cache may exceed 80 GB at long contexts |
131
- | NVIDIA L40S (48 GB) | ❌ Insufficient VRAM |
132
 
133
  ## Provenance
134
 
135
- This model was quantized using [spark-maker](https://github.com/kleinpanic), a quantization and fine-tuning toolkit for the NVIDIA DGX Spark platform.
136
-
137
  ```json
138
  {
139
  "source_model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
140
  "quantization": "NVFP4",
141
  "tool": "nvidia-modelopt 0.41.0",
142
  "export_method": "save_pretrained_manual",
143
- "calibration_samples": 512,
144
- "calibration_dataset": "synthetic-random",
145
- "hardware": "NVIDIA DGX Spark GB10 (Blackwell)",
146
- "quantization_time_seconds": 472
147
  }
148
  ```
149
 
150
  ## Limitations
151
 
152
- - **Synthetic calibration data**: Quantized with random token sequences because the container runs in offline mode. Real calibration data (C4, RedPajama, The Stack) would improve quantization quality, particularly for code-heavy workloads. Re-quantization with domain-specific calibration data is recommended for production use.
153
- - **Export path**: Uses `save_pretrained` serialization rather than ModelOpt's native checkpoint exporter. Functionally equivalent, but compatibility with all NVFP4-aware inference backends should be verified.
154
- - **MoE routing preserved**: All 48 gate/router layers remain in original BF16 precision by design — quantizing these would degrade expert selection quality.
155
 
156
- ## Citation
157
 
158
- ```bibtex
159
- @misc{kleinpanic2026qwen3codernvfp4,
160
- title={Qwen3-Coder-30B-A3B-Instruct-NVFP4},
161
- author={kleinpanic},
162
- year={2026},
163
- url={https://huggingface.co/kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4},
164
- note={NVFP4 quantization via NVIDIA ModelOpt on DGX Spark GB10}
165
- }
166
- ```
167
 
168
  ## Acknowledgments
169
 
170
- - **[Qwen Team](https://huggingface.co/Qwen)** Base model architecture and training
171
- - **[NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** Quantization toolkit
172
- - **[NVIDIA DGX Spark](https://www.nvidia.com/en-us/products/workstations/dgx-spark/)** — Hardware platform
 
5
  - qwen3
6
  - moe
7
  - nvfp4
 
8
  - quantized
9
  - nvidia-modelopt
 
 
10
  - coding
11
+ - dgx-spark
 
12
  model_type: qwen3_moe
13
  quantized_by: kleinpanic93
14
  pipeline_tag: text-generation
15
  library_name: transformers
 
16
  ---
17
 
18
+ # Qwen3-Coder-30B-A3B-Instruct-NVFP4
 
 
 
 
 
19
 
20
+ NVFP4 (4-bit floating point) quantization of [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct), optimized for NVIDIA Blackwell GPUs.
 
 
 
 
 
 
 
21
 
22
+ ## Model Details
23
 
24
+ | Property | Value |
25
+ |----------|-------|
26
+ | **Base Model** | Qwen/Qwen3-Coder-30B-A3B-Instruct |
27
+ | **Architecture** | Qwen3MoeForCausalLM (Mixture-of-Experts) |
28
+ | **Total Parameters** | 30B (3B active per token) |
29
+ | **Experts** | 128 per layer |
30
+ | **Quantization** | NVFP4 (4-bit NV floating point) |
31
+ | **KV Cache** | FP8 (8-bit float) |
32
+ | **Original Precision** | BF16 |
33
+ | **Quantized Size** | ~57 GB |
34
+ | **Quantization Tool** | NVIDIA ModelOpt 0.41.0 |
35
+ | **Calibration** | 512 samples (synthetic) |
36
+ | **Hardware** | NVIDIA DGX Spark GB10 (Blackwell) |
 
 
 
 
 
37
 
38
  ## Quantization Details
39
 
40
+ - **Method:** Post-training quantization via `nvidia-modelopt` with `NVFP4_DEFAULT_CFG`
41
+ - **Weights:** 4-bit NV floating point, group size 16
42
+ - **Activations:** 4-bit NV floating point, group size 16
43
+ - **KV Cache:** FP8 quantized for reduced memory during inference
44
+ - **Excluded layers:** `lm_head` and all MoE router/gate layers (48 total) — these remain in original precision to preserve routing quality
45
+ - **Export method:** HuggingFace `save_pretrained` with manual `quantization_config` injection (ModelOpt 0.41.0 native export does not yet support `Qwen3MoeExperts`)
 
 
 
 
46
 
47
  ## Usage
48
 
49
+ ### With vLLM (Recommended)
50
 
51
  ```bash
52
  vllm serve kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4 \
53
  --quantization modelopt \
54
  --trust-remote-code \
55
+ --max-model-len 32768
 
56
  ```
57
 
58
+ ### With Transformers
59
 
60
  ```python
61
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
62
 
63
  model = AutoModelForCausalLM.from_pretrained(
64
  "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
 
68
  tokenizer = AutoTokenizer.from_pretrained(
69
  "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4"
70
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ```
72
 
73
  ## Hardware Requirements
74
 
75
+ - **Minimum VRAM:** ~57 GB (unified memory or dedicated)
76
+ - **Tested on:** NVIDIA DGX Spark (GB10, 128 GB unified memory)
77
+ - **Recommended:** NVIDIA Blackwell GPUs (GB10, GB200, B200)
 
 
 
78
 
79
  ## Provenance
80
 
 
 
81
  ```json
82
  {
83
  "source_model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
84
  "quantization": "NVFP4",
85
  "tool": "nvidia-modelopt 0.41.0",
86
  "export_method": "save_pretrained_manual",
87
+ "calib_size": 512,
88
+ "calib_dataset": "synthetic-random",
89
+ "hardware": "NVIDIA GB10 (Blackwell)",
90
+ "elapsed_sec": 472
91
  }
92
  ```
93
 
94
  ## Limitations
95
 
96
+ - This quantization uses **synthetic calibration data** (random tokens) because the container runs in offline mode. Production-grade quantization with real calibration data (e.g., C4, RedPajama) may yield slightly better quality.
97
+ - The export uses `save_pretrained` fallback rather than ModelOpt's native HF checkpoint exporter, since `Qwen3MoeExperts` is not yet in ModelOpt 0.41.0's export allowlist. The quantization math is identical only the serialization path differs.
98
+ - MoE gate/router layers are preserved in original precision by design.
99
 
100
+ ## License
101
 
102
+ This model inherits the [Apache 2.0 license](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE) from the base Qwen3-Coder-30B-A3B-Instruct model.
 
 
 
 
 
 
 
 
103
 
104
  ## Acknowledgments
105
 
106
+ - [Qwen Team](https://huggingface.co/Qwen) for the base model
107
+ - [NVIDIA](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for ModelOpt quantization toolkit
 
chat_template.jinja ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {% macro render_extra_keys(json_dict, handled_keys) %}
2
+ {%- if json_dict is mapping %}
3
+ {%- for json_key in json_dict if json_key not in handled_keys %}
4
+ {%- if json_dict[json_key] is mapping or (json_dict[json_key] is sequence and json_dict[json_key] is not string) %}
5
+ {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' }}
6
+ {%- else %}
7
+ {{-'\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' }}
8
+ {%- endif %}
9
+ {%- endfor %}
10
+ {%- endif %}
11
+ {% endmacro %}
12
+
13
+ {%- if messages[0]["role"] == "system" %}
14
+ {%- set system_message = messages[0]["content"] %}
15
+ {%- set loop_messages = messages[1:] %}
16
+ {%- else %}
17
+ {%- set loop_messages = messages %}
18
+ {%- endif %}
19
+
20
+ {%- if not tools is defined %}
21
+ {%- set tools = [] %}
22
+ {%- endif %}
23
+
24
+ {%- if system_message is defined %}
25
+ {{- "<|im_start|>system\n" + system_message }}
26
+ {%- else %}
27
+ {%- if tools is iterable and tools | length > 0 %}
28
+ {{- "<|im_start|>system\nYou are Qwen, a helpful AI assistant that can interact with a computer to solve tasks." }}
29
+ {%- endif %}
30
+ {%- endif %}
31
+ {%- if tools is iterable and tools | length > 0 %}
32
+ {{- "\n\n# Tools\n\nYou have access to the following functions:\n\n" }}
33
+ {{- "<tools>" }}
34
+ {%- for tool in tools %}
35
+ {%- if tool.function is defined %}
36
+ {%- set tool = tool.function %}
37
+ {%- endif %}
38
+ {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
39
+ {%- if tool.description is defined %}
40
+ {{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
41
+ {%- endif %}
42
+ {{- '\n<parameters>' }}
43
+ {%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
44
+ {%- for param_name, param_fields in tool.parameters.properties|items %}
45
+ {{- '\n<parameter>' }}
46
+ {{- '\n<name>' ~ param_name ~ '</name>' }}
47
+ {%- if param_fields.type is defined %}
48
+ {{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
49
+ {%- endif %}
50
+ {%- if param_fields.description is defined %}
51
+ {{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
52
+ {%- endif %}
53
+ {%- set handled_keys = ['name', 'type', 'description'] %}
54
+ {{- render_extra_keys(param_fields, handled_keys) }}
55
+ {{- '\n</parameter>' }}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {% set handled_keys = ['type', 'properties'] %}
59
+ {{- render_extra_keys(tool.parameters, handled_keys) }}
60
+ {{- '\n</parameters>' }}
61
+ {%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
62
+ {{- render_extra_keys(tool, handled_keys) }}
63
+ {{- '\n</function>' }}
64
+ {%- endfor %}
65
+ {{- "\n</tools>" }}
66
+ {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
67
+ {%- endif %}
68
+ {%- if system_message is defined %}
69
+ {{- '<|im_end|>\n' }}
70
+ {%- else %}
71
+ {%- if tools is iterable and tools | length > 0 %}
72
+ {{- '<|im_end|>\n' }}
73
+ {%- endif %}
74
+ {%- endif %}
75
+ {%- for message in loop_messages %}
76
+ {%- if message.role == "assistant" and message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
77
+ {{- '<|im_start|>' + message.role }}
78
+ {%- if message.content is defined and message.content is string and message.content | trim | length > 0 %}
79
+ {{- '\n' + message.content | trim + '\n' }}
80
+ {%- endif %}
81
+ {%- for tool_call in message.tool_calls %}
82
+ {%- if tool_call.function is defined %}
83
+ {%- set tool_call = tool_call.function %}
84
+ {%- endif %}
85
+ {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
86
+ {%- if tool_call.arguments is defined %}
87
+ {%- for args_name, args_value in tool_call.arguments|items %}
88
+ {{- '<parameter=' + args_name + '>\n' }}
89
+ {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
90
+ {{- args_value }}
91
+ {{- '\n</parameter>\n' }}
92
+ {%- endfor %}
93
+ {%- endif %}
94
+ {{- '</function>\n</tool_call>' }}
95
+ {%- endfor %}
96
+ {{- '<|im_end|>\n' }}
97
+ {%- elif message.role == "user" or message.role == "system" or message.role == "assistant" %}
98
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
99
+ {%- elif message.role == "tool" %}
100
+ {%- if loop.previtem and loop.previtem.role != "tool" %}
101
+ {{- '<|im_start|>user\n' }}
102
+ {%- endif %}
103
+ {{- '<tool_response>\n' }}
104
+ {{- message.content }}
105
+ {{- '\n</tool_response>\n' }}
106
+ {%- if not loop.last and loop.nextitem.role != "tool" %}
107
+ {{- '<|im_end|>\n' }}
108
+ {%- elif loop.last %}
109
+ {{- '<|im_end|>\n' }}
110
+ {%- endif %}
111
+ {%- else %}
112
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
113
+ {%- endif %}
114
+ {%- endfor %}
115
+ {%- if add_generation_prompt %}
116
+ {{- '<|im_start|>assistant\n' }}
117
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3MoeForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "decoder_sparse_step": 1,
9
+ "dtype": "bfloat16",
10
+ "eos_token_id": 151645,
11
+ "head_dim": 128,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 2048,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 6144,
16
+ "max_position_embeddings": 262144,
17
+ "max_window_layers": 48,
18
+ "mlp_only_layers": [],
19
+ "model_type": "qwen3_moe",
20
+ "moe_intermediate_size": 768,
21
+ "norm_topk_prob": true,
22
+ "num_attention_heads": 32,
23
+ "num_experts_per_tok": 8,
24
+ "num_hidden_layers": 48,
25
+ "num_key_value_heads": 4,
26
+ "num_local_experts": 128,
27
+ "output_router_logits": false,
28
+ "pad_token_id": null,
29
+ "rms_norm_eps": 1e-06,
30
+ "rope_parameters": {
31
+ "rope_theta": 10000000,
32
+ "rope_type": "default"
33
+ },
34
+ "router_aux_loss_coef": 0.001,
35
+ "sliding_window": null,
36
+ "tie_word_embeddings": false,
37
+ "transformers_version": "5.2.0",
38
+ "use_cache": true,
39
+ "use_sliding_window": false,
40
+ "vocab_size": 151936,
41
+ "quantization_config": {
42
+ "config_groups": {
43
+ "group_0": {
44
+ "input_activations": {
45
+ "dynamic": false,
46
+ "num_bits": 4,
47
+ "type": "float",
48
+ "group_size": 16
49
+ },
50
+ "weights": {
51
+ "dynamic": false,
52
+ "num_bits": 4,
53
+ "type": "float",
54
+ "group_size": 16
55
+ },
56
+ "targets": [
57
+ "Linear"
58
+ ]
59
+ }
60
+ },
61
+ "ignore": [
62
+ "lm_head",
63
+ "model.layers.0.mlp.gate",
64
+ "model.layers.1.mlp.gate",
65
+ "model.layers.10.mlp.gate",
66
+ "model.layers.11.mlp.gate",
67
+ "model.layers.12.mlp.gate",
68
+ "model.layers.13.mlp.gate",
69
+ "model.layers.14.mlp.gate",
70
+ "model.layers.15.mlp.gate",
71
+ "model.layers.16.mlp.gate",
72
+ "model.layers.17.mlp.gate",
73
+ "model.layers.18.mlp.gate",
74
+ "model.layers.19.mlp.gate",
75
+ "model.layers.2.mlp.gate",
76
+ "model.layers.20.mlp.gate",
77
+ "model.layers.21.mlp.gate",
78
+ "model.layers.22.mlp.gate",
79
+ "model.layers.23.mlp.gate",
80
+ "model.layers.24.mlp.gate",
81
+ "model.layers.25.mlp.gate",
82
+ "model.layers.26.mlp.gate",
83
+ "model.layers.27.mlp.gate",
84
+ "model.layers.28.mlp.gate",
85
+ "model.layers.29.mlp.gate",
86
+ "model.layers.3.mlp.gate",
87
+ "model.layers.30.mlp.gate",
88
+ "model.layers.31.mlp.gate",
89
+ "model.layers.32.mlp.gate",
90
+ "model.layers.33.mlp.gate",
91
+ "model.layers.34.mlp.gate",
92
+ "model.layers.35.mlp.gate",
93
+ "model.layers.36.mlp.gate",
94
+ "model.layers.37.mlp.gate",
95
+ "model.layers.38.mlp.gate",
96
+ "model.layers.39.mlp.gate",
97
+ "model.layers.4.mlp.gate",
98
+ "model.layers.40.mlp.gate",
99
+ "model.layers.41.mlp.gate",
100
+ "model.layers.42.mlp.gate",
101
+ "model.layers.43.mlp.gate",
102
+ "model.layers.44.mlp.gate",
103
+ "model.layers.45.mlp.gate",
104
+ "model.layers.46.mlp.gate",
105
+ "model.layers.47.mlp.gate",
106
+ "model.layers.5.mlp.gate",
107
+ "model.layers.6.mlp.gate",
108
+ "model.layers.7.mlp.gate",
109
+ "model.layers.8.mlp.gate",
110
+ "model.layers.9.mlp.gate"
111
+ ],
112
+ "quant_algo": "NVFP4",
113
+ "kv_cache_scheme": {
114
+ "dynamic": false,
115
+ "num_bits": 8,
116
+ "type": "float"
117
+ },
118
+ "producer": {
119
+ "name": "modelopt",
120
+ "version": "spark-maker-v3"
121
+ },
122
+ "quant_method": "modelopt"
123
+ }
124
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "eos_token_id": [
4
+ 151645,
5
+ 151643
6
+ ],
7
+ "pad_token_id": 151643,
8
+ "repetition_penalty": 1.05,
9
+ "temperature": 0.7,
10
+ "top_k": 20,
11
+ "top_p": 0.8,
12
+ "transformers_version": "5.2.0"
13
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a18eec4121f8a8b2995c10d70a143156a44c0ef3e8492278c00a4cfd83faa2c
3
+ size 4983536328
model-00002-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:12bc6f12aee5c3d3e687e08a2a1db718a65a9748bf453d7a0c6e7504faafd9e3
3
+ size 4985162232
model-00003-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9952f65e81af77e1a32ae9c7b6b3bf2636c655177d66a8b9826ad90aace63375
3
+ size 4985162624
model-00004-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d5582d3e191ef8b7218f13096f758d5d5f648471564556e7bc2cb1bd9522bf2
3
+ size 4985163840
model-00005-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90ce43163ec9f20758fc580c1341c4c33fd4881f5727e28ce44e8687087abb48
3
+ size 4985163840
model-00006-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3868292be9a82e9981bf9f135af80393bf266eecfc02aadc95ad61481053b94
3
+ size 4985163840
model-00007-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:37f09a9120da23b618182c14559a9938efeca6cf7e30f7005b362aed7de32a50
3
+ size 4985163840
model-00008-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a86027d8cf9cca275245be7a50ac9737af34d393fe88706fb423b3bc0cc78a7
3
+ size 4985163840
model-00009-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f36d603c3c2331ad6734af98dae4acf0482a440e37dd46f20ebcd46c8b35b822
3
+ size 4985163840
model-00010-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a20cbb84a72429236a85764af1c507f454cf867a32f3a24e2fb58ef4dfa3fff6
3
+ size 4985163840
model-00011-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eacf05ef40177d0a204c65bb1c2f2a05c33640fc10a12ee8e167663df1b67a6c
3
+ size 4985163840
model-00012-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcdb40014518a76790006917c070a371280a6440dd2165911eac25b0377f2d4e
3
+ size 4985163840
model-00013-of-00013.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64ae03adf7887f50a8358c85eba191f42005bf577faecccbd376dfff88d1e98f
3
+ size 1246290440
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
spark_quantizer_provenance.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "source_model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
3
+ "quantization": "NVFP4",
4
+ "tool": "nvidia-modelopt",
5
+ "export_method": "save_pretrained_manual",
6
+ "calib_size": 512,
7
+ "calib_dataset": "synthetic-random",
8
+ "hardware": "NVIDIA GB10 (Blackwell)",
9
+ "offload_used": true,
10
+ "elapsed_sec": 472
11
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": false,
24
+ "model_max_length": 1048576,
25
+ "pad_token": "<|endoftext|>",
26
+ "split_special_tokens": false,
27
+ "tokenizer_class": "Qwen2Tokenizer",
28
+ "unk_token": null
29
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff