festr2 commited on
Commit
9ccaedd
·
verified ·
1 Parent(s): 72ed061

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +2 -0
  2. README.md +96 -0
  3. chat_template.jinja +86 -0
  4. config.json +276 -0
  5. generation_config.json +74 -0
  6. hf_quant_config.json +121 -0
  7. input_scales.safetensors +3 -0
  8. model-00003-of-00082.safetensors +3 -0
  9. model-00005-of-00082.safetensors +3 -0
  10. model-00010-of-00082.safetensors +3 -0
  11. model-00011-of-00082.safetensors +3 -0
  12. model-00012-of-00082.safetensors +3 -0
  13. model-00014-of-00082.safetensors +3 -0
  14. model-00015-of-00082.safetensors +3 -0
  15. model-00016-of-00082.safetensors +3 -0
  16. model-00017-of-00082.safetensors +3 -0
  17. model-00020-of-00082.safetensors +3 -0
  18. model-00022-of-00082.safetensors +3 -0
  19. model-00023-of-00082.safetensors +3 -0
  20. model-00024-of-00082.safetensors +3 -0
  21. model-00026-of-00082.safetensors +3 -0
  22. model-00027-of-00082.safetensors +3 -0
  23. model-00028-of-00082.safetensors +3 -0
  24. model-00030-of-00082.safetensors +3 -0
  25. model-00031-of-00082.safetensors +3 -0
  26. model-00033-of-00082.safetensors +3 -0
  27. model-00035-of-00082.safetensors +3 -0
  28. model-00042-of-00082.safetensors +3 -0
  29. model-00043-of-00082.safetensors +3 -0
  30. model-00045-of-00082.safetensors +3 -0
  31. model-00046-of-00082.safetensors +3 -0
  32. model-00048-of-00082.safetensors +3 -0
  33. model-00049-of-00082.safetensors +3 -0
  34. model-00050-of-00082.safetensors +3 -0
  35. model-00058-of-00082.safetensors +3 -0
  36. model-00059-of-00082.safetensors +3 -0
  37. model-00062-of-00082.safetensors +3 -0
  38. model-00065-of-00082.safetensors +3 -0
  39. model-00066-of-00082.safetensors +3 -0
  40. model-00067-of-00082.safetensors +3 -0
  41. model-00068-of-00082.safetensors +3 -0
  42. model-00070-of-00082.safetensors +3 -0
  43. model-00074-of-00082.safetensors +3 -0
  44. model-00077-of-00082.safetensors +3 -0
  45. model-00079-of-00082.safetensors +3 -0
  46. model-00080-of-00082.safetensors +3 -0
  47. model-00081-of-00082.safetensors +3 -0
  48. model-00082-of-00082.safetensors +3 -0
  49. model.safetensors.index.json +3 -0
  50. tokenizer.json +3 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - zai-org/GLM-5
4
+ license: mit
5
+ tags:
6
+ - mtp
7
+ - speculative-decoding
8
+ ---
9
+
10
+ ## Model Description
11
+
12
+ **GLM-5-NVFP4-MTP** is an NVFP4-quantized version of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) with **Multi-Token Prediction (MTP) weights restored**, enabling speculative decoding with vLLM and SGLang.
13
+
14
+ This is based on [lukealonso/GLM-5-NVFP4](https://huggingface.co/lukealonso/GLM-5-NVFP4) — a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA).
15
+
16
+ Quantized directly from the full BF16 checkpoint ([zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)), *not the FP8 release*, to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer).
17
+
18
+ ### MTP Layer Addition
19
+
20
+ The original [lukealonso/GLM-5-NVFP4](https://huggingface.co/lukealonso/GLM-5-NVFP4) declares `"num_nextn_predict_layers": 1` in its config but ships without the MTP layer weights (layer 78). This repo fixes that by extracting the MTP layer directly from the original BF16 model ([zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)).
21
+
22
+ **What was done:**
23
+ - Extracted all 791 tensors for `model.layers.78.*` from the BF16 model (shards 271–274 of 282) and saved them as `mtp.safetensors` (~19 GB, full BF16 precision)
24
+ - Updated `model.safetensors.index.json` to include the 791 layer 78 → `mtp.safetensors` mappings
25
+ - Added `model.layers.78.*` glob patterns to `quantization_config.ignore` in both `config.json` and `hf_quant_config.json` so the NVFP4 dequantizer skips the MTP layer
26
+ - All other weights (layers 0–77, embeddings, lm_head) remain unchanged from the original NVFP4 quantization
27
+
28
+ The MTP layer contains:
29
+ - Special MTP components: `eh_proj`, `enorm`, `hnorm`, `shared_head.norm`
30
+ - A complete transformer block: self-attention + MoE MLP (256 experts)
31
+
32
+ ### What's quantized
33
+
34
+ Only the MoE expert MLP layers (gate, up, and down projections) for layers 0–77 are quantized to NVFP4. Attention layers are left in BF16. The MTP layer (layer 78) is entirely in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings.
35
+
36
+ Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone.
37
+
38
+ ### Calibration dataset
39
+
40
+ Three calibration passes were run:
41
+
42
+ 1. **Coding pass** — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts.
43
+ 2. **Broad pass** — Large-scale diverse samples drawn from WildChat and LMSYS-Chat covering real user conversations across a wide range of topics and languages.
44
+ 3. **Deep pass** — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns.
45
+
46
+ Merged via element-wise max across all calibration runs.
47
+
48
+ ### How to Run
49
+
50
+ NVFP4 requires Blackwell GPUs (RTX 5090, RTX Pro 6000, B100, B200, etc.). Even quantized, this is a huge model — tested on 8x RTX Pro 6000 Blackwell (96 GB each, 768 GB total).
51
+
52
+ If you experience NCCL hangs with P2P, make sure you have `iommu=pt` (and `amd_iommu=pt` on AMD platforms) in your kernel command line.
53
+
54
+ #### SGLang (with MTP speculative decoding)
55
+
56
+ ```bash
57
+ export NCCL_IB_DISABLE=1
58
+ export NCCL_P2P_LEVEL=PHB
59
+ export NCCL_ALLOC_P2P_NET_LL_BUFFERS=1
60
+ export NCCL_MIN_NCHANNELS=8
61
+ export OMP_NUM_THREADS=8
62
+ export SAFETENSORS_FAST_GPU=1
63
+
64
+ python3 -m sglang.launch_server \
65
+ --model festr2/GLM-5-NVFP4-MTP \
66
+ --served-model-name glm-5 \
67
+ --reasoning-parser glm45 \
68
+ --tool-call-parser glm47 \
69
+ --trust-remote-code \
70
+ --tp 8 \
71
+ --mem-fraction-static 0.95 \
72
+ --max-running-requests 8 \
73
+ --kv-cache-dtype fp8_e4m3 \
74
+ --quantization modelopt_fp4 \
75
+ --attention-backend flashinfer \
76
+ --moe-runner-backend flashinfer_cutlass \
77
+ --disable-custom-all-reduce \
78
+ --enable-flashinfer-allreduce-fusion \
79
+ --speculative-algorithm mtp \
80
+ --num-speculative-tokens 1 \
81
+ --host 0.0.0.0 \
82
+ --port 8000
83
+ ```
84
+
85
+ #### vLLM (with MTP speculative decoding)
86
+
87
+ ```bash
88
+ vllm serve festr2/GLM-5-NVFP4-MTP \
89
+ --speculative-config '{"method":"mtp","num_speculative_tokens":1}'
90
+ ```
91
+
92
+ ### Credits
93
+
94
+ - Original model: [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)
95
+ - NVFP4 quantization: [lukealonso/GLM-5-NVFP4](https://huggingface.co/lukealonso/GLM-5-NVFP4)
96
+ - MTP layer restoration: extracted from the original BF16 weights
chat_template.jinja ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [gMASK]<sop>
2
+ {%- if tools -%}
3
+ <|system|>
4
+ # Tools
5
+
6
+ You may call one or more functions to assist with the user query.
7
+
8
+ You are provided with function signatures within <tools></tools> XML tags:
9
+ <tools>
10
+ {% for tool in tools %}
11
+ {{ tool | tojson(ensure_ascii=False) }}
12
+ {% endfor %}
13
+ </tools>
14
+
15
+ For each function call, output the function name and arguments within the following XML format:
16
+ <tool_call>{function-name}<arg_key>{arg-key-1}</arg_key><arg_value>{arg-value-1}</arg_value><arg_key>{arg-key-2}</arg_key><arg_value>{arg-value-2}</arg_value>...</tool_call>{%- endif -%}
17
+ {%- macro visible_text(content) -%}
18
+ {%- if content is string -%}
19
+ {{- content }}
20
+ {%- elif content is iterable and content is not mapping -%}
21
+ {%- for item in content -%}
22
+ {%- if item is mapping and item.type == 'text' -%}
23
+ {{- item.text }}
24
+ {%- elif item is string -%}
25
+ {{- item }}
26
+ {%- endif -%}
27
+ {%- endfor -%}
28
+ {%- else -%}
29
+ {{- content }}
30
+ {%- endif -%}
31
+ {%- endmacro -%}
32
+ {%- set ns = namespace(last_user_index=-1) %}
33
+ {%- for m in messages %}
34
+ {%- if m.role == 'user' %}
35
+ {% set ns.last_user_index = loop.index0 -%}
36
+ {%- endif %}
37
+ {%- endfor %}
38
+ {% for m in messages %}
39
+ {%- if m.role == 'user' -%}<|user|>{{ visible_text(m.content) }}
40
+ {%- elif m.role == 'assistant' -%}
41
+ <|assistant|>
42
+ {%- set reasoning_content = '' %}
43
+ {%- set content = visible_text(m.content) %}
44
+ {%- if m.reasoning_content is string %}
45
+ {%- set reasoning_content = m.reasoning_content %}
46
+ {%- else %}
47
+ {%- if '</think>' in content %}
48
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
49
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
50
+ {%- endif %}
51
+ {%- endif %}
52
+ {%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
53
+ {{ '<think>' + reasoning_content.strip() + '</think>'}}
54
+ {%- else -%}
55
+ {{ '</think>' }}
56
+ {%- endif -%}
57
+ {%- if content.strip() -%}
58
+ {{ content.strip() }}
59
+ {%- endif -%}
60
+ {% if m.tool_calls %}
61
+ {% for tc in m.tool_calls %}
62
+ {%- if tc.function %}
63
+ {%- set tc = tc.function %}
64
+ {%- endif %}
65
+ {{- '<tool_call>' + tc.name -}}
66
+ {% set _args = tc.arguments %}{% for k, v in _args.items() %}<arg_key>{{ k }}</arg_key><arg_value>{{ v | tojson(ensure_ascii=False) if v is not string else v }}</arg_value>{% endfor %}</tool_call>{% endfor %}
67
+ {% endif %}
68
+ {%- elif m.role == 'tool' -%}
69
+ {%- if m.content is string -%}
70
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
71
+ {{- '<|observation|>' }}
72
+ {%- endif %}
73
+ {{- '<tool_response>' }}
74
+ {{- m.content }}
75
+ {{- '</tool_response>' }}
76
+ {%- else -%}
77
+ <|observation|>{% for tr in m.content %}
78
+ <tool_response>{{ tr.output if tr.output is defined else tr }}</tool_response>{% endfor -%}
79
+ {% endif -%}
80
+ {%- elif m.role == 'system' -%}
81
+ <|system|>{{ visible_text(m.content) }}
82
+ {%- endif -%}
83
+ {%- endfor -%}
84
+ {%- if add_generation_prompt -%}
85
+ <|assistant|>{{- '</think>' if (enable_thinking is defined and not enable_thinking) else '<think>' -}}
86
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 154880,
3
+ "max_position_embeddings": 202752,
4
+ "hidden_size": 6144,
5
+ "intermediate_size": 12288,
6
+ "num_hidden_layers": 78,
7
+ "mlp_layer_types": [
8
+ "dense",
9
+ "dense",
10
+ "dense",
11
+ "sparse",
12
+ "sparse",
13
+ "sparse",
14
+ "sparse",
15
+ "sparse",
16
+ "sparse",
17
+ "sparse",
18
+ "sparse",
19
+ "sparse",
20
+ "sparse",
21
+ "sparse",
22
+ "sparse",
23
+ "sparse",
24
+ "sparse",
25
+ "sparse",
26
+ "sparse",
27
+ "sparse",
28
+ "sparse",
29
+ "sparse",
30
+ "sparse",
31
+ "sparse",
32
+ "sparse",
33
+ "sparse",
34
+ "sparse",
35
+ "sparse",
36
+ "sparse",
37
+ "sparse",
38
+ "sparse",
39
+ "sparse",
40
+ "sparse",
41
+ "sparse",
42
+ "sparse",
43
+ "sparse",
44
+ "sparse",
45
+ "sparse",
46
+ "sparse",
47
+ "sparse",
48
+ "sparse",
49
+ "sparse",
50
+ "sparse",
51
+ "sparse",
52
+ "sparse",
53
+ "sparse",
54
+ "sparse",
55
+ "sparse",
56
+ "sparse",
57
+ "sparse",
58
+ "sparse",
59
+ "sparse",
60
+ "sparse",
61
+ "sparse",
62
+ "sparse",
63
+ "sparse",
64
+ "sparse",
65
+ "sparse",
66
+ "sparse",
67
+ "sparse",
68
+ "sparse",
69
+ "sparse",
70
+ "sparse",
71
+ "sparse",
72
+ "sparse",
73
+ "sparse",
74
+ "sparse",
75
+ "sparse",
76
+ "sparse",
77
+ "sparse",
78
+ "sparse",
79
+ "sparse",
80
+ "sparse",
81
+ "sparse",
82
+ "sparse",
83
+ "sparse",
84
+ "sparse",
85
+ "sparse"
86
+ ],
87
+ "moe_intermediate_size": 2048,
88
+ "num_attention_heads": 64,
89
+ "n_shared_experts": 1,
90
+ "n_routed_experts": 256,
91
+ "routed_scaling_factor": 2.5,
92
+ "kv_lora_rank": 512,
93
+ "q_lora_rank": 2048,
94
+ "qk_rope_head_dim": 64,
95
+ "v_head_dim": 256,
96
+ "qk_nope_head_dim": 192,
97
+ "qk_head_dim": 256,
98
+ "head_dim": 64,
99
+ "n_group": 1,
100
+ "topk_group": 1,
101
+ "num_experts_per_tok": 8,
102
+ "norm_topk_prob": true,
103
+ "rope_interleave": true,
104
+ "num_key_value_heads": 64,
105
+ "hidden_act": "silu",
106
+ "initializer_range": 0.02,
107
+ "index_topk": 2048,
108
+ "rms_norm_eps": 1e-05,
109
+ "use_cache": true,
110
+ "attention_bias": false,
111
+ "attention_dropout": 0.0,
112
+ "rope_parameters": {
113
+ "rope_theta": 1000000,
114
+ "rope_type": "default"
115
+ },
116
+ "pad_token_id": 154820,
117
+ "bos_token_id": 0,
118
+ "eos_token_id": [
119
+ 154820,
120
+ 154827,
121
+ 154829
122
+ ],
123
+ "tie_word_embeddings": false,
124
+ "return_dict": true,
125
+ "output_hidden_states": false,
126
+ "dtype": "bfloat16",
127
+ "chunk_size_feed_forward": 0,
128
+ "is_encoder_decoder": false,
129
+ "architectures": [
130
+ "GlmMoeDsaForCausalLM"
131
+ ],
132
+ "id2label": {
133
+ "0": "LABEL_0",
134
+ "1": "LABEL_1"
135
+ },
136
+ "label2id": {
137
+ "LABEL_0": 0,
138
+ "LABEL_1": 1
139
+ },
140
+ "problem_type": null,
141
+ "_name_or_path": "zai-org/GLM-5",
142
+ "transformers_version": "5.2.0.dev0",
143
+ "ep_size": 1,
144
+ "first_k_dense_replace": 3,
145
+ "index_head_dim": 128,
146
+ "index_n_heads": 32,
147
+ "indexer_rope_interleave": true,
148
+ "moe_layer_freq": 1,
149
+ "model_type": "glm_moe_dsa",
150
+ "num_nextn_predict_layers": 1,
151
+ "pretraining_tp": 1,
152
+ "scoring_func": "sigmoid",
153
+ "topk_method": "noaux_tc",
154
+ "output_attentions": false,
155
+ "quantization_config": {
156
+ "config_groups": {
157
+ "group_0": {
158
+ "input_activations": {
159
+ "dynamic": false,
160
+ "num_bits": 4,
161
+ "type": "float",
162
+ "group_size": 16
163
+ },
164
+ "weights": {
165
+ "dynamic": false,
166
+ "num_bits": 4,
167
+ "type": "float",
168
+ "group_size": 16
169
+ },
170
+ "targets": [
171
+ "Linear"
172
+ ]
173
+ }
174
+ },
175
+ "ignore": [
176
+ "lm_head",
177
+ "model.layers.0.self_attn*",
178
+ "model.layers.1.self_attn*",
179
+ "model.layers.10.self_attn*",
180
+ "model.layers.11.self_attn*",
181
+ "model.layers.12.self_attn*",
182
+ "model.layers.13.self_attn*",
183
+ "model.layers.14.self_attn*",
184
+ "model.layers.15.self_attn*",
185
+ "model.layers.16.self_attn*",
186
+ "model.layers.17.self_attn*",
187
+ "model.layers.18.self_attn*",
188
+ "model.layers.19.self_attn*",
189
+ "model.layers.2.self_attn*",
190
+ "model.layers.20.self_attn*",
191
+ "model.layers.21.self_attn*",
192
+ "model.layers.22.self_attn*",
193
+ "model.layers.23.self_attn*",
194
+ "model.layers.24.self_attn*",
195
+ "model.layers.25.self_attn*",
196
+ "model.layers.26.self_attn*",
197
+ "model.layers.27.self_attn*",
198
+ "model.layers.28.self_attn*",
199
+ "model.layers.29.self_attn*",
200
+ "model.layers.3.self_attn*",
201
+ "model.layers.30.self_attn*",
202
+ "model.layers.31.self_attn*",
203
+ "model.layers.32.self_attn*",
204
+ "model.layers.33.self_attn*",
205
+ "model.layers.34.self_attn*",
206
+ "model.layers.35.self_attn*",
207
+ "model.layers.36.self_attn*",
208
+ "model.layers.37.self_attn*",
209
+ "model.layers.38.self_attn*",
210
+ "model.layers.39.self_attn*",
211
+ "model.layers.4.self_attn*",
212
+ "model.layers.40.self_attn*",
213
+ "model.layers.41.self_attn*",
214
+ "model.layers.42.self_attn*",
215
+ "model.layers.43.self_attn*",
216
+ "model.layers.44.self_attn*",
217
+ "model.layers.45.self_attn*",
218
+ "model.layers.46.self_attn*",
219
+ "model.layers.47.self_attn*",
220
+ "model.layers.48.self_attn*",
221
+ "model.layers.49.self_attn*",
222
+ "model.layers.5.self_attn*",
223
+ "model.layers.50.self_attn*",
224
+ "model.layers.51.self_attn*",
225
+ "model.layers.52.self_attn*",
226
+ "model.layers.53.self_attn*",
227
+ "model.layers.54.self_attn*",
228
+ "model.layers.55.self_attn*",
229
+ "model.layers.56.self_attn*",
230
+ "model.layers.57.self_attn*",
231
+ "model.layers.58.self_attn*",
232
+ "model.layers.59.self_attn*",
233
+ "model.layers.6.self_attn*",
234
+ "model.layers.60.self_attn*",
235
+ "model.layers.61.self_attn*",
236
+ "model.layers.62.self_attn*",
237
+ "model.layers.63.self_attn*",
238
+ "model.layers.64.self_attn*",
239
+ "model.layers.65.self_attn*",
240
+ "model.layers.66.self_attn*",
241
+ "model.layers.67.self_attn*",
242
+ "model.layers.68.self_attn*",
243
+ "model.layers.69.self_attn*",
244
+ "model.layers.7.self_attn*",
245
+ "model.layers.70.self_attn*",
246
+ "model.layers.71.self_attn*",
247
+ "model.layers.72.self_attn*",
248
+ "model.layers.73.self_attn*",
249
+ "model.layers.74.self_attn*",
250
+ "model.layers.75.self_attn*",
251
+ "model.layers.76.self_attn*",
252
+ "model.layers.77.self_attn*",
253
+ "model.layers.78.eh_proj*",
254
+ "model.layers.78.enorm*",
255
+ "model.layers.78.hnorm*",
256
+ "model.layers.78.input_layernorm*",
257
+ "model.layers.78.mlp*",
258
+ "model.layers.78.post_attention_layernorm*",
259
+ "model.layers.78.self_attn*",
260
+ "model.layers.78.shared_head*",
261
+ "model.layers.8.self_attn*",
262
+ "model.layers.9.self_attn*"
263
+ ],
264
+ "quant_algo": "NVFP4",
265
+ "kv_cache_scheme": {
266
+ "dynamic": false,
267
+ "num_bits": 8,
268
+ "type": "float"
269
+ },
270
+ "producer": {
271
+ "name": "modelopt",
272
+ "version": "0.39.0.dev290+gf9d9a71de.d20260214"
273
+ },
274
+ "quant_method": "modelopt"
275
+ }
276
+ }
generation_config.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "max_length": null,
3
+ "max_new_tokens": null,
4
+ "min_length": null,
5
+ "min_new_tokens": null,
6
+ "early_stopping": null,
7
+ "max_time": null,
8
+ "stop_strings": null,
9
+ "do_sample": null,
10
+ "num_beams": null,
11
+ "use_cache": true,
12
+ "cache_implementation": null,
13
+ "cache_config": null,
14
+ "temperature": null,
15
+ "top_k": null,
16
+ "top_p": null,
17
+ "min_p": null,
18
+ "top_h": null,
19
+ "typical_p": null,
20
+ "epsilon_cutoff": null,
21
+ "eta_cutoff": null,
22
+ "repetition_penalty": null,
23
+ "encoder_repetition_penalty": null,
24
+ "length_penalty": null,
25
+ "no_repeat_ngram_size": null,
26
+ "bad_words_ids": null,
27
+ "renormalize_logits": null,
28
+ "forced_bos_token_id": null,
29
+ "forced_eos_token_id": null,
30
+ "remove_invalid_values": null,
31
+ "exponential_decay_length_penalty": null,
32
+ "suppress_tokens": null,
33
+ "begin_suppress_tokens": null,
34
+ "sequence_bias": null,
35
+ "token_healing": null,
36
+ "guidance_scale": null,
37
+ "watermarking_config": null,
38
+ "num_return_sequences": null,
39
+ "output_attentions": false,
40
+ "output_hidden_states": false,
41
+ "output_scores": null,
42
+ "output_logits": null,
43
+ "return_dict_in_generate": null,
44
+ "pad_token_id": 154820,
45
+ "bos_token_id": 0,
46
+ "eos_token_id": [
47
+ 154820,
48
+ 154827,
49
+ 154829
50
+ ],
51
+ "encoder_no_repeat_ngram_size": null,
52
+ "decoder_start_token_id": null,
53
+ "is_assistant": null,
54
+ "num_assistant_tokens": null,
55
+ "num_assistant_tokens_schedule": null,
56
+ "assistant_confidence_threshold": null,
57
+ "prompt_lookup_num_tokens": null,
58
+ "max_matching_ngram_size": null,
59
+ "assistant_early_exit": null,
60
+ "assistant_lookbehind": null,
61
+ "target_lookbehind": null,
62
+ "compile_config": null,
63
+ "disable_compile": null,
64
+ "low_memory": null,
65
+ "penalty_alpha": null,
66
+ "dola_layers": null,
67
+ "diversity_penalty": null,
68
+ "num_beam_groups": null,
69
+ "constraints": null,
70
+ "force_words_ids": null,
71
+ "prefill_chunk_size": null,
72
+ "_from_model_config": true,
73
+ "transformers_version": "5.2.0.dev0"
74
+ }
hf_quant_config.json ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config_groups": {
3
+ "group_0": {
4
+ "input_activations": {
5
+ "dynamic": false,
6
+ "num_bits": 4,
7
+ "type": "float",
8
+ "group_size": 16
9
+ },
10
+ "weights": {
11
+ "dynamic": false,
12
+ "num_bits": 4,
13
+ "type": "float",
14
+ "group_size": 16
15
+ },
16
+ "targets": [
17
+ "Linear"
18
+ ]
19
+ }
20
+ },
21
+ "ignore": [
22
+ "lm_head",
23
+ "model.layers.0.self_attn*",
24
+ "model.layers.1.self_attn*",
25
+ "model.layers.10.self_attn*",
26
+ "model.layers.11.self_attn*",
27
+ "model.layers.12.self_attn*",
28
+ "model.layers.13.self_attn*",
29
+ "model.layers.14.self_attn*",
30
+ "model.layers.15.self_attn*",
31
+ "model.layers.16.self_attn*",
32
+ "model.layers.17.self_attn*",
33
+ "model.layers.18.self_attn*",
34
+ "model.layers.19.self_attn*",
35
+ "model.layers.2.self_attn*",
36
+ "model.layers.20.self_attn*",
37
+ "model.layers.21.self_attn*",
38
+ "model.layers.22.self_attn*",
39
+ "model.layers.23.self_attn*",
40
+ "model.layers.24.self_attn*",
41
+ "model.layers.25.self_attn*",
42
+ "model.layers.26.self_attn*",
43
+ "model.layers.27.self_attn*",
44
+ "model.layers.28.self_attn*",
45
+ "model.layers.29.self_attn*",
46
+ "model.layers.3.self_attn*",
47
+ "model.layers.30.self_attn*",
48
+ "model.layers.31.self_attn*",
49
+ "model.layers.32.self_attn*",
50
+ "model.layers.33.self_attn*",
51
+ "model.layers.34.self_attn*",
52
+ "model.layers.35.self_attn*",
53
+ "model.layers.36.self_attn*",
54
+ "model.layers.37.self_attn*",
55
+ "model.layers.38.self_attn*",
56
+ "model.layers.39.self_attn*",
57
+ "model.layers.4.self_attn*",
58
+ "model.layers.40.self_attn*",
59
+ "model.layers.41.self_attn*",
60
+ "model.layers.42.self_attn*",
61
+ "model.layers.43.self_attn*",
62
+ "model.layers.44.self_attn*",
63
+ "model.layers.45.self_attn*",
64
+ "model.layers.46.self_attn*",
65
+ "model.layers.47.self_attn*",
66
+ "model.layers.48.self_attn*",
67
+ "model.layers.49.self_attn*",
68
+ "model.layers.5.self_attn*",
69
+ "model.layers.50.self_attn*",
70
+ "model.layers.51.self_attn*",
71
+ "model.layers.52.self_attn*",
72
+ "model.layers.53.self_attn*",
73
+ "model.layers.54.self_attn*",
74
+ "model.layers.55.self_attn*",
75
+ "model.layers.56.self_attn*",
76
+ "model.layers.57.self_attn*",
77
+ "model.layers.58.self_attn*",
78
+ "model.layers.59.self_attn*",
79
+ "model.layers.6.self_attn*",
80
+ "model.layers.60.self_attn*",
81
+ "model.layers.61.self_attn*",
82
+ "model.layers.62.self_attn*",
83
+ "model.layers.63.self_attn*",
84
+ "model.layers.64.self_attn*",
85
+ "model.layers.65.self_attn*",
86
+ "model.layers.66.self_attn*",
87
+ "model.layers.67.self_attn*",
88
+ "model.layers.68.self_attn*",
89
+ "model.layers.69.self_attn*",
90
+ "model.layers.7.self_attn*",
91
+ "model.layers.70.self_attn*",
92
+ "model.layers.71.self_attn*",
93
+ "model.layers.72.self_attn*",
94
+ "model.layers.73.self_attn*",
95
+ "model.layers.74.self_attn*",
96
+ "model.layers.75.self_attn*",
97
+ "model.layers.76.self_attn*",
98
+ "model.layers.77.self_attn*",
99
+ "model.layers.78.eh_proj*",
100
+ "model.layers.78.enorm*",
101
+ "model.layers.78.hnorm*",
102
+ "model.layers.78.input_layernorm*",
103
+ "model.layers.78.mlp*",
104
+ "model.layers.78.post_attention_layernorm*",
105
+ "model.layers.78.self_attn*",
106
+ "model.layers.78.shared_head*",
107
+ "model.layers.8.self_attn*",
108
+ "model.layers.9.self_attn*"
109
+ ],
110
+ "quant_algo": "NVFP4",
111
+ "kv_cache_scheme": {
112
+ "dynamic": false,
113
+ "num_bits": 8,
114
+ "type": "float"
115
+ },
116
+ "producer": {
117
+ "name": "modelopt",
118
+ "version": "0.39.0.dev290+gf9d9a71de.d20260214"
119
+ },
120
+ "quant_method": "modelopt"
121
+ }
input_scales.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3dc72ddf5e0a470c46bad790020c5bd48b0c75c3f12afaa688a9ed23bef30656
3
+ size 6700728
model-00003-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:31a4f8568520c2243295633e5652272d86356102b0f842457a7c852a81168bd5
3
+ size 5370445012
model-00005-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1debaedd04b77f1a42bd61c2f565b3087c84299969bbf1b6fd585cdb7907f265
3
+ size 5370445220
model-00010-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:07e0a19471da0515e042f386aa0d5cf23d768209ae04dd3d75f074b09fcf006d
3
+ size 5369253496
model-00011-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa750340797b11cbbc56dfc6f41c8a1277459c559c3a36847f78896422c722c6
3
+ size 5373591496
model-00012-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e44e99405d81bd4e6410668c6e40426d64bfd87ba5b0fbbf9384c95ebc6e984a
3
+ size 5370447164
model-00014-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9552d2c38b93b44624663b6250077f0fe3460eeb44c2270fa01e6991ff1b0432
3
+ size 5370447100
model-00015-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7569d4539b3e236654dc28ebeba18761066236f6d960e58257374c4c0207cf79
3
+ size 5370447156
model-00016-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c4aad9b9b1f1a786ae5a5bca7730b63c05f9afa21d2b562b88f2b5f5c50fb58
3
+ size 5370447156
model-00017-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3370803228952f375f3bd9e958ffc3605269cac200a836678bb01cb7014bdea5
3
+ size 5370447212
model-00020-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf2bad486df10c4f91d21a0e7f4724c019cd7bdd948b2ed28c3b19927b9ac496
3
+ size 5370446780
model-00022-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f92d9c2d9994a97c023aa92e24eff9866778a28abf832182b14d077bdade509d
3
+ size 5370446972
model-00023-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70abd450ad1bdd6f8ed13dbb34916a89d20992f06f1ab5c80e156a2d59a70223
3
+ size 5369540220
model-00024-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da3b2a3034106ad4824468ec20d382e74f68fa36eed974b1faa23aeeacb3941b
3
+ size 5373305076
model-00026-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1620bc371c7107d320d48e1e00b085fbb9b954f36e4be387f78961298fd305a2
3
+ size 5370447108
model-00027-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b92b91e2db32518d6556e77a1eaea14d00cdf7d4e59423a297cd3f4646b2c72c
3
+ size 5370447132
model-00028-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da6f18b3b4d8eb97ccd55faba99bd7ce4cdd5661ca4610f62a483682052cf0ad
3
+ size 5370447148
model-00030-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2992a773fb12e3fc5fdfaef7122234561e51d12ad178b2ba6e81436c331c2f97
3
+ size 5370447172
model-00031-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5af20934e85a52715bdff6c39f6c5bc534cd53bd57ada772be8909e634851989
3
+ size 5370447340
model-00033-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82754607522ae6ef8019e4b929e1a7ba433775382e6c3aede342572c5b3a1204
3
+ size 5370446780
model-00035-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:527ffea841f460c78cd1ef414ab638b7a4da8a7224cac3a301c315e04d27a7b9
3
+ size 5370446956
model-00042-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:388b591778853ce156a7ba5dd950e4bc5c080e9fecc4d64b0ae7a03ba2d6a557
3
+ size 5370447156
model-00043-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eeebcebacaafbc9d7790ef2b0053c5cd0078393cebd0b7903672a5caa0e08270
3
+ size 5370447156
model-00045-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c702dfcf632c7fa67eb71bab94894d7e8879599e91652b948ac6e0e93b65270
3
+ size 5370447364
model-00046-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:130ef0e80ad96d9a6ec0841f53af68abc5a39fd5dd2d086b9513528c593e3115
3
+ size 5370446820
model-00048-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:589265f505d8ecf84e848782e4b6cdcfe3653a38a3bc6d642ed77caa00d89777
3
+ size 5370446860
model-00049-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be50e922b74c33334cdec93a10ff4172d6508bada39ced8e085a220dadc8e1df
3
+ size 5370446972
model-00050-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b908325b29e44afcab5f8cf228108da0457d8901a96eec89e0947157c9ce61c
3
+ size 5369253496
model-00058-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:298e5b03b88cc7453754489118e850ae5e28e658d9a82e66e360a799c6227a2d
3
+ size 5370447340
model-00059-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bed696bcf98029fd4f8c945743a5887d5faa24a52f66c03dd65b9056378abecd
3
+ size 5370446980
model-00062-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6269d6956ab08eb64fdae9110e9efcd074d1b3fd33cbc038bb225c478454d4be
3
+ size 5370446972
model-00065-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:272a0de8407ba91a31340797285f793b7800b872dffb66256d5f725bac4eb67f
3
+ size 5370447188
model-00066-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:24be4bab71a064a462d2ac9a5c87eb669c42f258fe6256f266e26545f510fc5f
3
+ size 5370447108
model-00067-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:673c901ca97241d24d87cdd436802ac68522f5aa2eb510db7100111414d5ebb5
3
+ size 5370447132
model-00068-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f5d4b951d93fc6cc47fe1205cf71bf706c311eed668c812b9c1e3a4cf74ef29
3
+ size 5370447148
model-00070-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42ddf71da81e1cf7d5332c5a0d5abc8b0c92e58c991330cd478f348278930f52
3
+ size 5370447172
model-00074-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38d0c5e6ca04e97aa0b94dcc0e9b0a78aa8668cfa195f56f2327d293c6f941de
3
+ size 5370446780
model-00077-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7136d2f01d39a2a38a4d3eda9c7c8be45667a9682e2401fe6856b0f23ce159da
3
+ size 5369121124
model-00079-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfb19bdc09c4042909a359b229626d48020673187b2ed22ed4b9201d358a224a
3
+ size 5370447132
model-00080-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03c89b0cee28f2df86ae0f040c1f2b98961c2fed148d316f2b768064df8ca100
3
+ size 5370447132
model-00081-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2cb8eb8160a3757cc2ec9cd676653e78f57674c229897624d55b56797d93d28
3
+ size 5370447116
model-00082-of-00082.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c972e6fad24d956a14a33aedae33e19bc02d57105888af0d4a6fa40cea34076
3
+ size 5671884736
model.safetensors.index.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6b05371bcc7dd3383aa6064edbe9886498aeb397979d6e25eaa4888cd4f3f77
3
+ size 21814298
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19e773648cb4e65de8660ea6365e10acca112d42a854923df93db4a6f333a82d
3
+ size 20217442