mconcat commited on
Commit
d32ea7b
·
verified ·
1 Parent(s): 8194183

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: arcee-ai/Trinity-Large-Base
4
+ tags:
5
+ - moe
6
+ - nvfp4
7
+ - modelopt
8
+ - blackwell
9
+ - vllm
10
+ ---
11
+
12
+ # Trinity-Large-Base-NVFP4
13
+
14
+ NVFP4-quantized version of [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) for deployment on NVIDIA Blackwell GPUs.
15
+
16
+ ## Model Details
17
+
18
+ | | |
19
+ |---|---|
20
+ | **Base model** | [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) |
21
+ | **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
22
+ | **Parameters** | 398B total, ~13B active per token |
23
+ | **Layers** | 60 (6 dense + 54 MoE) |
24
+ | **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
25
+ | **Hidden size** | 3072 |
26
+ | **MoE intermediate size** | 3072 per expert |
27
+ | **Dense intermediate size** | 12,288 |
28
+ | **Attention** | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers |
29
+ | **Context length** | 8,192 tokens |
30
+ | **Vocabulary** | 200,192 tokens |
31
+
32
+ ## Quantization
33
+
34
+ | | |
35
+ |---|---|
36
+ | **Method** | NVFP4 (4-bit floating point) |
37
+ | **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
38
+ | **Group size** | 16 |
39
+ | **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
40
+ | **Quantized layers** | MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers) |
41
+ | **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head |
42
+ | **Source precision** | BF16 |
43
+
44
+ ### Compression
45
+
46
+ | Format | Size |
47
+ |--------|------|
48
+ | BF16 (original) | 796 GB |
49
+ | **NVFP4 (this model)** | **216 GB** |
50
+
51
+ 3.7x compression.
52
+
53
+ ## Running with vLLM
54
+
55
+ [vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.
56
+
57
+ ### Requirements
58
+
59
+ - **VRAM**: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
60
+ - **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).
61
+
62
+ ### Installation
63
+
64
+ ```bash
65
+ pip install "vllm>=0.15.1"
66
+ ```
67
+
68
+ ### Environment Variables
69
+
70
+ Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:
71
+
72
+ ```bash
73
+ export VLLM_USE_FLASHINFER_MOE_FP4=0
74
+ ```
75
+
76
+ ### Single-GPU (≥224 GB VRAM)
77
+
78
+ ```python
79
+ from vllm import LLM, SamplingParams
80
+
81
+ llm = LLM(
82
+ model="mconcat/Trinity-Large-Base-NVFP4",
83
+ quantization="modelopt",
84
+ max_model_len=4096,
85
+ enforce_eager=True,
86
+ gpu_memory_utilization=0.90,
87
+ )
88
+
89
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
90
+ outputs = llm.generate(["The meaning of life is"], sampling_params)
91
+ print(outputs[0].outputs[0].text)
92
+ ```
93
+
94
+ ### Multi-GPU with Pipeline Parallelism
95
+
96
+ For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:
97
+
98
+ ```python
99
+ import os
100
+ os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
101
+
102
+ from vllm import LLM, SamplingParams
103
+
104
+ llm = LLM(
105
+ model="mconcat/Trinity-Large-Base-NVFP4",
106
+ quantization="modelopt",
107
+ pipeline_parallel_size=2, # number of GPUs
108
+ cpu_offload_gb=30, # GB of weights to offload per GPU
109
+ max_model_len=512,
110
+ max_num_seqs=256,
111
+ enforce_eager=True,
112
+ gpu_memory_utilization=0.95,
113
+ )
114
+
115
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
116
+ outputs = llm.generate(["The meaning of life is"], sampling_params)
117
+ print(outputs[0].outputs[0].text)
118
+ ```
119
+
120
+ **Tuning tips:**
121
+ - `cpu_offload_gb` is **per GPU** — total pinned memory = `cpu_offload_gb × pipeline_parallel_size`. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
122
+ - For **heterogeneous GPU setups** (different VRAM sizes), set `VLLM_PP_LAYER_PARTITION` to control how many of the 60 layers each GPU gets. For example, `export VLLM_PP_LAYER_PARTITION="32,14,14"` for a 3-GPU setup where the first GPU has ~3x the VRAM.
123
+ - Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that `(layer_weights - cpu_offload_gb)` fits comfortably on each GPU with room for KV cache and overhead.
124
+ - `max_num_seqs` may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates `max_num_seqs × vocab_size × 8 bytes` of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
125
+ - Start with a low `max_model_len` (e.g., 512) and increase once loading succeeds.
126
+
127
+ ### OpenAI-Compatible API Server
128
+
129
+ ```bash
130
+ VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
131
+ --model mconcat/Trinity-Large-Base-NVFP4 \
132
+ --quantization modelopt \
133
+ --max-model-len 4096 \
134
+ --enforce-eager \
135
+ --gpu-memory-utilization 0.90 \
136
+ --port 8000
137
+ ```
138
+
139
+ For multi-GPU serving, add `--pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256` as needed.
140
+
141
+ ```bash
142
+ curl http://localhost:8000/v1/completions \
143
+ -H "Content-Type: application/json" \
144
+ -d '{"model": "mconcat/Trinity-Large-Base-NVFP4", "prompt": "Hello", "max_tokens": 64}'
145
+ ```
146
+
147
+ ## Important Notes
148
+
149
+ - **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
150
+ - **vLLM quantization flag**: Use `--quantization modelopt` (not `modelopt_fp4`). vLLM auto-detects the NVFP4 algorithm from the config.
151
+ - **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a `reorder_w1w3_to_w3w1` operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
152
+ - **vLLM cpu_offload_gb + V1 engine**: As of vLLM 0.15.x, using `cpu_offload_gb` with the V1 engine may trigger an assertion error in `may_reinitialize_input_batch` (`gpu_model_runner.py`). If you encounter `AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled`, this can be safely patched by converting the assertion to a warning. See [vLLM PR #18298](https://github.com/vllm-project/vllm/issues/18298) for status.
153
+ - **HuggingFace Transformers**: While `transformers >= 5.0` recognizes the `AfmoeForCausalLM` architecture, it does **not** support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
154
+ - **TensorRT-LLM**: As of February 2026, TensorRT-LLM does not support the `AfmoeForCausalLM` architecture.
155
+
156
+ ## Quantization Recipe
157
+
158
+ Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
159
+
160
+ - Only MLP/expert weights (`gate_proj`, `up_proj`, `down_proj`) are quantized to FP4
161
+ - All attention projections remain in BF16 to preserve quality
162
+ - Router gates (`mlp.router`) remain in BF16
163
+ - Embeddings and lm_head remain in BF16
164
+ - The default `*mlp.gate.*` exclusion was removed because Trinity uses `mlp.gate_proj` as a standard MLP projection (not a routing gate)
165
+
166
+ ### Calibration Data
167
+
168
+ | Domain | Samples | Dataset |
169
+ |--------|---------|---------|
170
+ | Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) |
171
+ | Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) |
172
+ | Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) |
173
+ | General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
174
+
175
+ ## Files
176
+
177
+ | File | Description |
178
+ |------|-------------|
179
+ | `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43-50 GB each) |
180
+ | `model.safetensors.index.json` | Weight shard index |
181
+ | `config.json` | Model configuration with `quantization_config` |
182
+ | `hf_quant_config.json` | ModelOpt quantization metadata |
183
+ | `generation_config.json` | Generation configuration |
184
+ | `tokenizer.json` | Tokenizer |
185
+ | `tokenizer_config.json` | Tokenizer configuration |
186
+ | `chat_template.jinja` | Chat template |
187
+
188
+ ## Hardware
189
+
190
+ Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.
191
+
192
+ ## Limitations
193
+
194
+ - Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
195
+ - Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
196
+ - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
197
+ - This quantization targets the MLP/expert layers only; KV cache is not quantized
198
+
199
+ ## License
200
+
201
+ Same license as the base model: [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Large-Base).
chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {{ bos_token }}{% for message in messages %}{{ message['content'] }}{% endfor %}
config.json ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "AfmoeForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_afmoe.AfmoeConfig",
9
+ "AutoModel": "modeling_afmoe.AfmoeModel",
10
+ "AutoModelForCausalLM": "modeling_afmoe.AfmoeForCausalLM"
11
+ },
12
+ "bos_token_id": null,
13
+ "dtype": "bfloat16",
14
+ "eos_token_id": null,
15
+ "global_attn_every_n_layers": 4,
16
+ "head_dim": 128,
17
+ "hidden_act": "silu",
18
+ "hidden_size": 3072,
19
+ "initializer_range": 0.02,
20
+ "intermediate_size": 12288,
21
+ "layer_types": [
22
+ "sliding_attention",
23
+ "sliding_attention",
24
+ "sliding_attention",
25
+ "full_attention",
26
+ "sliding_attention",
27
+ "sliding_attention",
28
+ "sliding_attention",
29
+ "full_attention",
30
+ "sliding_attention",
31
+ "sliding_attention",
32
+ "sliding_attention",
33
+ "full_attention",
34
+ "sliding_attention",
35
+ "sliding_attention",
36
+ "sliding_attention",
37
+ "full_attention",
38
+ "sliding_attention",
39
+ "sliding_attention",
40
+ "sliding_attention",
41
+ "full_attention",
42
+ "sliding_attention",
43
+ "sliding_attention",
44
+ "sliding_attention",
45
+ "full_attention",
46
+ "sliding_attention",
47
+ "sliding_attention",
48
+ "sliding_attention",
49
+ "full_attention",
50
+ "sliding_attention",
51
+ "sliding_attention",
52
+ "sliding_attention",
53
+ "full_attention",
54
+ "sliding_attention",
55
+ "sliding_attention",
56
+ "sliding_attention",
57
+ "full_attention",
58
+ "sliding_attention",
59
+ "sliding_attention",
60
+ "sliding_attention",
61
+ "full_attention",
62
+ "sliding_attention",
63
+ "sliding_attention",
64
+ "sliding_attention",
65
+ "full_attention",
66
+ "sliding_attention",
67
+ "sliding_attention",
68
+ "sliding_attention",
69
+ "full_attention",
70
+ "sliding_attention",
71
+ "sliding_attention",
72
+ "sliding_attention",
73
+ "full_attention",
74
+ "sliding_attention",
75
+ "sliding_attention",
76
+ "sliding_attention",
77
+ "full_attention",
78
+ "sliding_attention",
79
+ "sliding_attention",
80
+ "sliding_attention",
81
+ "full_attention"
82
+ ],
83
+ "load_balance_coeff": 5e-05,
84
+ "max_position_embeddings": 262144,
85
+ "model_type": "afmoe",
86
+ "moe_intermediate_size": 3072,
87
+ "mup_enabled": true,
88
+ "n_group": 1,
89
+ "num_attention_heads": 48,
90
+ "num_dense_layers": 6,
91
+ "num_expert_groups": 1,
92
+ "num_experts": 256,
93
+ "num_experts_per_tok": 4,
94
+ "num_hidden_layers": 60,
95
+ "num_key_value_heads": 8,
96
+ "num_limited_groups": 1,
97
+ "num_shared_experts": 1,
98
+ "pad_token_id": null,
99
+ "rms_norm_eps": 1e-05,
100
+ "rope_parameters": {
101
+ "rope_theta": 10000.0,
102
+ "rope_type": "default"
103
+ },
104
+ "rope_theta": 10000,
105
+ "route_norm": true,
106
+ "route_scale": 2.448,
107
+ "score_func": "sigmoid",
108
+ "sliding_window": 4096,
109
+ "tie_word_embeddings": false,
110
+ "topk_group": 1,
111
+ "transformers_version": "5.1.0",
112
+ "use_cache": true,
113
+ "use_grouped_mm": true,
114
+ "vocab_size": 200192,
115
+ "quantization_config": {
116
+ "config_groups": {
117
+ "group_0": {
118
+ "input_activations": {
119
+ "dynamic": false,
120
+ "num_bits": 4,
121
+ "type": "float",
122
+ "group_size": 16
123
+ },
124
+ "weights": {
125
+ "dynamic": false,
126
+ "num_bits": 4,
127
+ "type": "float",
128
+ "group_size": 16
129
+ },
130
+ "targets": [
131
+ "Linear"
132
+ ]
133
+ }
134
+ },
135
+ "ignore": [
136
+ "lm_head",
137
+ "model.layers.0.self_attn*",
138
+ "model.layers.1.self_attn*",
139
+ "model.layers.10.mlp.router*",
140
+ "model.layers.10.self_attn*",
141
+ "model.layers.11.mlp.router*",
142
+ "model.layers.11.self_attn*",
143
+ "model.layers.12.mlp.router*",
144
+ "model.layers.12.self_attn*",
145
+ "model.layers.13.mlp.router*",
146
+ "model.layers.13.self_attn*",
147
+ "model.layers.14.mlp.router*",
148
+ "model.layers.14.self_attn*",
149
+ "model.layers.15.mlp.router*",
150
+ "model.layers.15.self_attn*",
151
+ "model.layers.16.mlp.router*",
152
+ "model.layers.16.self_attn*",
153
+ "model.layers.17.mlp.router*",
154
+ "model.layers.17.self_attn*",
155
+ "model.layers.18.mlp.router*",
156
+ "model.layers.18.self_attn*",
157
+ "model.layers.19.mlp.router*",
158
+ "model.layers.19.self_attn*",
159
+ "model.layers.2.self_attn*",
160
+ "model.layers.20.mlp.router*",
161
+ "model.layers.20.self_attn*",
162
+ "model.layers.21.mlp.router*",
163
+ "model.layers.21.self_attn*",
164
+ "model.layers.22.mlp.router*",
165
+ "model.layers.22.self_attn*",
166
+ "model.layers.23.mlp.router*",
167
+ "model.layers.23.self_attn*",
168
+ "model.layers.24.mlp.router*",
169
+ "model.layers.24.self_attn*",
170
+ "model.layers.25.mlp.router*",
171
+ "model.layers.25.self_attn*",
172
+ "model.layers.26.mlp.router*",
173
+ "model.layers.26.self_attn*",
174
+ "model.layers.27.mlp.router*",
175
+ "model.layers.27.self_attn*",
176
+ "model.layers.28.mlp.router*",
177
+ "model.layers.28.self_attn*",
178
+ "model.layers.29.mlp.router*",
179
+ "model.layers.29.self_attn*",
180
+ "model.layers.3.self_attn*",
181
+ "model.layers.30.mlp.router*",
182
+ "model.layers.30.self_attn*",
183
+ "model.layers.31.mlp.router*",
184
+ "model.layers.31.self_attn*",
185
+ "model.layers.32.mlp.router*",
186
+ "model.layers.32.self_attn*",
187
+ "model.layers.33.mlp.router*",
188
+ "model.layers.33.self_attn*",
189
+ "model.layers.34.mlp.router*",
190
+ "model.layers.34.self_attn*",
191
+ "model.layers.35.mlp.router*",
192
+ "model.layers.35.self_attn*",
193
+ "model.layers.36.mlp.router*",
194
+ "model.layers.36.self_attn*",
195
+ "model.layers.37.mlp.router*",
196
+ "model.layers.37.self_attn*",
197
+ "model.layers.38.mlp.router*",
198
+ "model.layers.38.self_attn*",
199
+ "model.layers.39.mlp.router*",
200
+ "model.layers.39.self_attn*",
201
+ "model.layers.4.self_attn*",
202
+ "model.layers.40.mlp.router*",
203
+ "model.layers.40.self_attn*",
204
+ "model.layers.41.mlp.router*",
205
+ "model.layers.41.self_attn*",
206
+ "model.layers.42.mlp.router*",
207
+ "model.layers.42.self_attn*",
208
+ "model.layers.43.mlp.router*",
209
+ "model.layers.43.self_attn*",
210
+ "model.layers.44.mlp.router*",
211
+ "model.layers.44.self_attn*",
212
+ "model.layers.45.mlp.router*",
213
+ "model.layers.45.self_attn*",
214
+ "model.layers.46.mlp.router*",
215
+ "model.layers.46.self_attn*",
216
+ "model.layers.47.mlp.router*",
217
+ "model.layers.47.self_attn*",
218
+ "model.layers.48.mlp.router*",
219
+ "model.layers.48.self_attn*",
220
+ "model.layers.49.mlp.router*",
221
+ "model.layers.49.self_attn*",
222
+ "model.layers.5.self_attn*",
223
+ "model.layers.50.mlp.router*",
224
+ "model.layers.50.self_attn*",
225
+ "model.layers.51.mlp.router*",
226
+ "model.layers.51.self_attn*",
227
+ "model.layers.52.mlp.router*",
228
+ "model.layers.52.self_attn*",
229
+ "model.layers.53.mlp.router*",
230
+ "model.layers.53.self_attn*",
231
+ "model.layers.54.mlp.router*",
232
+ "model.layers.54.self_attn*",
233
+ "model.layers.55.mlp.router*",
234
+ "model.layers.55.self_attn*",
235
+ "model.layers.56.mlp.router*",
236
+ "model.layers.56.self_attn*",
237
+ "model.layers.57.mlp.router*",
238
+ "model.layers.57.self_attn*",
239
+ "model.layers.58.mlp.router*",
240
+ "model.layers.58.self_attn*",
241
+ "model.layers.59.mlp.router*",
242
+ "model.layers.59.self_attn*",
243
+ "model.layers.6.mlp.router*",
244
+ "model.layers.6.self_attn*",
245
+ "model.layers.7.mlp.router*",
246
+ "model.layers.7.self_attn*",
247
+ "model.layers.8.mlp.router*",
248
+ "model.layers.8.self_attn*",
249
+ "model.layers.9.mlp.router*",
250
+ "model.layers.9.self_attn*"
251
+ ],
252
+ "quant_algo": "NVFP4",
253
+ "producer": {
254
+ "name": "modelopt",
255
+ "version": "0.41.0"
256
+ },
257
+ "quant_method": "modelopt"
258
+ }
259
+ }
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "5.1.0",
4
+ "use_cache": true
5
+ }
hf_quant_config.json ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "producer": {
3
+ "name": "modelopt",
4
+ "version": "0.41.0"
5
+ },
6
+ "quantization": {
7
+ "quant_algo": "NVFP4",
8
+ "kv_cache_quant_algo": null,
9
+ "group_size": 16,
10
+ "exclude_modules": [
11
+ "lm_head",
12
+ "model.layers.0.self_attn*",
13
+ "model.layers.1.self_attn*",
14
+ "model.layers.10.mlp.router*",
15
+ "model.layers.10.self_attn*",
16
+ "model.layers.11.mlp.router*",
17
+ "model.layers.11.self_attn*",
18
+ "model.layers.12.mlp.router*",
19
+ "model.layers.12.self_attn*",
20
+ "model.layers.13.mlp.router*",
21
+ "model.layers.13.self_attn*",
22
+ "model.layers.14.mlp.router*",
23
+ "model.layers.14.self_attn*",
24
+ "model.layers.15.mlp.router*",
25
+ "model.layers.15.self_attn*",
26
+ "model.layers.16.mlp.router*",
27
+ "model.layers.16.self_attn*",
28
+ "model.layers.17.mlp.router*",
29
+ "model.layers.17.self_attn*",
30
+ "model.layers.18.mlp.router*",
31
+ "model.layers.18.self_attn*",
32
+ "model.layers.19.mlp.router*",
33
+ "model.layers.19.self_attn*",
34
+ "model.layers.2.self_attn*",
35
+ "model.layers.20.mlp.router*",
36
+ "model.layers.20.self_attn*",
37
+ "model.layers.21.mlp.router*",
38
+ "model.layers.21.self_attn*",
39
+ "model.layers.22.mlp.router*",
40
+ "model.layers.22.self_attn*",
41
+ "model.layers.23.mlp.router*",
42
+ "model.layers.23.self_attn*",
43
+ "model.layers.24.mlp.router*",
44
+ "model.layers.24.self_attn*",
45
+ "model.layers.25.mlp.router*",
46
+ "model.layers.25.self_attn*",
47
+ "model.layers.26.mlp.router*",
48
+ "model.layers.26.self_attn*",
49
+ "model.layers.27.mlp.router*",
50
+ "model.layers.27.self_attn*",
51
+ "model.layers.28.mlp.router*",
52
+ "model.layers.28.self_attn*",
53
+ "model.layers.29.mlp.router*",
54
+ "model.layers.29.self_attn*",
55
+ "model.layers.3.self_attn*",
56
+ "model.layers.30.mlp.router*",
57
+ "model.layers.30.self_attn*",
58
+ "model.layers.31.mlp.router*",
59
+ "model.layers.31.self_attn*",
60
+ "model.layers.32.mlp.router*",
61
+ "model.layers.32.self_attn*",
62
+ "model.layers.33.mlp.router*",
63
+ "model.layers.33.self_attn*",
64
+ "model.layers.34.mlp.router*",
65
+ "model.layers.34.self_attn*",
66
+ "model.layers.35.mlp.router*",
67
+ "model.layers.35.self_attn*",
68
+ "model.layers.36.mlp.router*",
69
+ "model.layers.36.self_attn*",
70
+ "model.layers.37.mlp.router*",
71
+ "model.layers.37.self_attn*",
72
+ "model.layers.38.mlp.router*",
73
+ "model.layers.38.self_attn*",
74
+ "model.layers.39.mlp.router*",
75
+ "model.layers.39.self_attn*",
76
+ "model.layers.4.self_attn*",
77
+ "model.layers.40.mlp.router*",
78
+ "model.layers.40.self_attn*",
79
+ "model.layers.41.mlp.router*",
80
+ "model.layers.41.self_attn*",
81
+ "model.layers.42.mlp.router*",
82
+ "model.layers.42.self_attn*",
83
+ "model.layers.43.mlp.router*",
84
+ "model.layers.43.self_attn*",
85
+ "model.layers.44.mlp.router*",
86
+ "model.layers.44.self_attn*",
87
+ "model.layers.45.mlp.router*",
88
+ "model.layers.45.self_attn*",
89
+ "model.layers.46.mlp.router*",
90
+ "model.layers.46.self_attn*",
91
+ "model.layers.47.mlp.router*",
92
+ "model.layers.47.self_attn*",
93
+ "model.layers.48.mlp.router*",
94
+ "model.layers.48.self_attn*",
95
+ "model.layers.49.mlp.router*",
96
+ "model.layers.49.self_attn*",
97
+ "model.layers.5.self_attn*",
98
+ "model.layers.50.mlp.router*",
99
+ "model.layers.50.self_attn*",
100
+ "model.layers.51.mlp.router*",
101
+ "model.layers.51.self_attn*",
102
+ "model.layers.52.mlp.router*",
103
+ "model.layers.52.self_attn*",
104
+ "model.layers.53.mlp.router*",
105
+ "model.layers.53.self_attn*",
106
+ "model.layers.54.mlp.router*",
107
+ "model.layers.54.self_attn*",
108
+ "model.layers.55.mlp.router*",
109
+ "model.layers.55.self_attn*",
110
+ "model.layers.56.mlp.router*",
111
+ "model.layers.56.self_attn*",
112
+ "model.layers.57.mlp.router*",
113
+ "model.layers.57.self_attn*",
114
+ "model.layers.58.mlp.router*",
115
+ "model.layers.58.self_attn*",
116
+ "model.layers.59.mlp.router*",
117
+ "model.layers.59.self_attn*",
118
+ "model.layers.6.mlp.router*",
119
+ "model.layers.6.self_attn*",
120
+ "model.layers.7.mlp.router*",
121
+ "model.layers.7.self_attn*",
122
+ "model.layers.8.mlp.router*",
123
+ "model.layers.8.self_attn*",
124
+ "model.layers.9.mlp.router*",
125
+ "model.layers.9.self_attn*"
126
+ ]
127
+ }
128
+ }
model-00001-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db35b3d8988eb946cd7cfcb9204045fae258fba140d4a931a3165339496691b8
3
+ size 49979822160
model-00002-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:883fbf2ae549904e1846abf853528cb3324dde9f271fd55796f9cb926e32b0db
3
+ size 50001038716
model-00003-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46bd4b62b3ee2f41ff12fb46b9f55357299e93e8a15d6c4f6b46b850006ecfe0
3
+ size 50004196600
model-00004-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5e80dc1906eccbd97192fefff4f6f7175c145a042c4775dc4833df0a47fba0f
3
+ size 50000068080
model-00005-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:411e83d2c0a4e96ca9b0556fe3ab405e2c64dcc957b50b156028151d1efce548
3
+ size 31524413620
model.safetensors.index.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a626457320b5a5245cb6e8e4113dc4c4ff697c1feeae9334e6ab0432d0f2073
3
+ size 15989867
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:55b56b3b89ba5a5f70ebff957c435c8501da7ae994b0684683511f5e94b674a8
3
+ size 14614977
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": null,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<|begin_of_text|>",
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "is_local": false,
8
+ "model_max_length": 65536,
9
+ "pad_token": "<|pad|>",
10
+ "tokenizer_class": "TokenizersBackend",
11
+ "use_default_system_prompt": false
12
+ }