GadflyII commited on
Commit
a3ad40a
·
verified ·
1 Parent(s): 8b152a1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +180 -62
README.md CHANGED
@@ -1,92 +1,210 @@
1
  ---
2
- license: other
3
- license_name: glm-4
4
- license_link: https://huggingface.co/THUDM/glm-4-9b/blob/main/LICENSE
5
- base_model: THUDM/GLM-4-9B-0414
 
6
  tags:
7
- - glm
8
- - moe
9
- - nvfp4
10
- - quantized
11
- - vllm
 
 
 
12
  library_name: transformers
 
13
  ---
 
 
14
 
15
- # GLM-4.7-Flash-MTP-NVFP4
16
 
17
- NVFP4 (Native FP4) quantized version of [THUDM/GLM-4-9B-0414](https://huggingface.co/THUDM/GLM-4-9B-0414) with MTP (Multi-Token Prediction) layers preserved in BF16.
18
 
19
- ## Model Details
20
-
21
- - **Base Model**: THUDM/GLM-4-9B-0414 (GLM-4.7-Flash)
22
- - **Architecture**: Glm4MoeLiteForCausalLM (MoE with 64 experts, top-4 routing)
23
- - **Quantization**: NVFP4 with FP8 scales, block size 16
24
- - **Size**: 20.9 GB (3.0x compression from 62.4 GB BF16)
25
- - **MTP Layers**: Preserved in BF16 for speculative decoding compatibility
26
 
27
- ## Quantization Details
 
 
 
 
 
 
28
 
29
- | Component | Precision | Notes |
30
- |-----------|-----------|-------|
31
- | MLP/FFN layers | NVFP4 | 4-bit weights, 4-bit activations |
32
- | Attention (self_attn) | BF16 | MLA architecture preserved |
33
- | MTP layers (eh_proj, shared_head) | BF16 | Speculative decoding compatible |
34
- | Embeddings | BF16 | Not quantized |
35
- | Gates | BF16 | Router gates preserved |
36
 
37
- ### Calibration Settings
38
- - **Samples**: 512 (from wikitext)
39
- - **Sequence Length**: 4096
40
- - **Strategy**: tensor_group with group_size=16
41
 
42
- ## Benchmark Results
 
 
 
 
 
 
43
 
44
- ### MMLU-Pro Accuracy
45
 
46
- | Model | Accuracy |
47
- |-------|----------|
48
- | BF16 (baseline) | 24.83% |
49
- | NVFP4-v2 (this model) | 23.91% |
 
 
 
50
 
51
  ### MTP Acceptance Rate
52
- - **BF16**: 60% acceptance, 1.60 mean accepted length
53
- - **NVFP4-v2**: 63% acceptance, 1.63 mean accepted length
54
 
55
- MTP quality is preserved after quantization.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
- ### Performance Note
 
 
 
 
 
58
 
59
- MTP speculative decoding currently shows overhead rather than speedup due to missing torch.compile support for the MTP drafter model. For best throughput, run without MTP enabled.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- ## Usage with vLLM
62
 
63
  ```bash
64
- # Standard inference (recommended for performance)
65
- VLLM_ATTENTION_BACKEND=TRITON_MLA python -m vllm.entrypoints.openai.api_server \
66
- --model GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
67
- --trust-remote-code \
68
- --max-model-len 4096 \
69
- --gpu-memory-utilization 0.95
70
 
71
  # With MTP speculative decoding (experimental)
72
- VLLM_ATTENTION_BACKEND=TRITON_MLA python -m vllm.entrypoints.openai.api_server \
73
- --model GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
74
- --trust-remote-code \
75
- --max-model-len 4096 \
76
- --gpu-memory-utilization 0.90 \
77
- --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
78
  ```
79
 
80
- ## Requirements
81
 
82
- - vLLM with NVFP4 support (v0.8.0+)
83
- - NVIDIA GPU with FP4 support (Blackwell/Ada Lovelace with appropriate kernels)
84
- - transformers >= 5.0.0
 
 
 
 
85
 
86
- ## License
87
 
88
- This model inherits the [GLM-4 License](https://huggingface.co/THUDM/glm-4-9b/blob/main/LICENSE) from the base model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
- ## Acknowledgments
91
 
92
- Quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor) with compressed-tensors format.
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model: zai-org/GLM-4.7-Flash
7
  tags:
8
+ - moe
9
+ - nvfp4
10
+ - quantized
11
+ - vllm
12
+ - glm
13
+ - 30b
14
+ - mtp
15
+ - speculative-decoding
16
  library_name: transformers
17
+ pipeline_tag: text-generation
18
  ---
19
+ # Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
20
+ https://github.com/Gadflyii/vllm/tree/main
21
 
22
+ # GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP Support)
23
 
24
+ This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves **MTP (Multi-Token Prediction) layers in BF16** for speculative decoding compatibility.
25
 
26
+ ## What's Different from GLM-4.7-Flash-NVFP4?
 
 
 
 
 
 
27
 
28
+ | Feature | GLM-4.7-Flash-NVFP4 | **This Model** |
29
+ |---------|---------------------|----------------|
30
+ | MTP Layers | Quantized (broken) | **BF16 (working)** |
31
+ | MTP Speculative Decoding | ❌ Not supported | ✅ Supported |
32
+ | Calibration Samples | 128 | **512** |
33
+ | Calibration Seq Length | 2048 | **4096** |
34
+ | MMLU-Pro Accuracy | 23.56% | **23.91%** |
35
 
36
+ ## Quantization Strategy
 
 
 
 
 
 
37
 
38
+ This model uses **mixed precision** to preserve accuracy and MTP functionality:
 
 
 
39
 
40
+ | Component | Precision | Rationale |
41
+ |-----------|-----------|-----------|
42
+ | MLP Experts | FP4 (E2M1) | 64 routed experts, 4 active per token |
43
+ | Dense MLP | FP4 (E2M1) | First layer dense MLP |
44
+ | **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive |
45
+ | **MTP Layers** | **BF16** | `eh_proj`, `shared_head.head` for speculative decoding |
46
+ | Norms, Gates, Embeddings | BF16 | Standard practice |
47
 
48
+ ## Performance
49
 
50
+ | Metric | BF16 | NVFP4-v1 | **This Model** |
51
+ |--------|------|----------|----------------|
52
+ | MMLU-Pro | 24.83% | 23.56% | **23.91%** |
53
+ | Size | 62.4 GB | 20.4 GB | **20.9 GB** |
54
+ | Compression | 1x | 3.1x | **3.0x** |
55
+ | Accuracy Loss | - | -1.27% | **-0.92%** |
56
+ | MTP Working | ✅ | ❌ | ✅ |
57
 
58
  ### MTP Acceptance Rate
 
 
59
 
60
+ | Model | Acceptance Rate | Mean Accepted Length |
61
+ |-------|-----------------|----------------------|
62
+ | BF16 (baseline) | 60% | 1.60 |
63
+ | **This Model** | **63%** | **1.63** |
64
+
65
+ MTP quality is preserved (actually slightly improved) after quantization.
66
+
67
+ ### MTP Performance Note
68
+
69
+ MTP speculative decoding currently shows overhead rather than speedup due to missing `torch.compile` support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream.
70
+
71
+ | Configuration | Tokens/sec | Recommendation |
72
+ |---------------|------------|----------------|
73
+ | Without MTP | 78.1 tok/s | ✅ **Use this** |
74
+ | With MTP (1 token) | 64.7 tok/s | ❌ |
75
+ | With MTP (2 tokens) | 56.8 tok/s | ❌ |
76
+ | With MTP (4 tokens) | 44.5 tok/s | ❌ |
77
+
78
+ ## Usage
79
+
80
+ ### Requirements
81
+
82
+ - **vLLM**: 0.8.0+ (for compressed-tensors NVFP4 support)
83
+ - **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture)
84
+ - **GPU**: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)
85
+
86
+ ### Installation
87
 
88
+ ```bash
89
+ pip install vllm>=0.8.0
90
+ pip install git+https://github.com/huggingface/transformers.git
91
+ ```
92
+
93
+ ### Inference with vLLM (Recommended)
94
 
95
+ ```python
96
+ from vllm import LLM, SamplingParams
97
+
98
+ model = LLM(
99
+ "GadflyII/GLM-4.7-Flash-MTP-NVFP4",
100
+ tensor_parallel_size=1,
101
+ max_model_len=4096,
102
+ trust_remote_code=True,
103
+ gpu_memory_utilization=0.90,
104
+ )
105
+
106
+ params = SamplingParams(temperature=0.7, max_tokens=512)
107
+ outputs = model.generate(["Explain quantum computing in simple terms."], params)
108
+ print(outputs[0].outputs[0].text)
109
+ ```
110
 
111
+ ### Serving with vLLM
112
 
113
  ```bash
114
+ # Standard serving (recommended for performance)
115
+ VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
116
+ --tensor-parallel-size 1 \
117
+ --max-model-len 4096 \
118
+ --trust-remote-code \
119
+ --gpu-memory-utilization 0.90
120
 
121
  # With MTP speculative decoding (experimental)
122
+ VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
123
+ --tensor-parallel-size 1 \
124
+ --max-model-len 4096 \
125
+ --trust-remote-code \
126
+ --gpu-memory-utilization 0.90 \
127
+ --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
128
  ```
129
 
130
+ ## Model Details
131
 
132
+ - **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
133
+ - **Architecture**: `Glm4MoeLiteForCausalLM`
134
+ - **Parameters**: 30B total, 3B active per token (30B-A3B)
135
+ - **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
136
+ - **Layers**: 47 (with 1 MTP layer)
137
+ - **Context Length**: 202,752 tokens (max)
138
+ - **Languages**: English, Chinese
139
 
140
+ ## Quantization Details
141
 
142
+ - **Format**: compressed-tensors (NVFP4)
143
+ - **Block Size**: 16
144
+ - **Scale Format**: FP8 (E4M3)
145
+ - **Calibration**: 512 samples from wikitext dataset
146
+ - **Calibration Sequence Length**: 4096
147
+ - **Full Expert Calibration**: All 64 experts calibrated per sample
148
+
149
+ ### Tensors by Precision
150
+
151
+ | Precision | Count | Description |
152
+ |-----------|-------|-------------|
153
+ | NVFP4 | 9,168 | MLP/FFN weights |
154
+ | BF16 | 240 | Attention weights (MLA) |
155
+ | BF16 | 2 | MTP layers (eh_proj, shared_head.head) |
156
+
157
+ ## Evaluation
158
+
159
+ ### MMLU-Pro Overall Results
160
+
161
+ | Model | Accuracy | Correct | Total |
162
+ |-------|----------|---------|-------|
163
+ | BF16 (baseline) | 24.83% | 2988 | 12032 |
164
+ | NVFP4-v1 | 23.56% | 2835 | 12032 |
165
+ | **This Model** | **23.91%** | **2877** | 12032 |
166
+
167
+ ### MMLU-Pro by Category
168
+
169
+ | Category | BF16 | This Model | Difference |
170
+ |----------|------|------------|------------|
171
+ | Social Sciences | 32.70% | 31.26% | -1.44% |
172
+ | Other | 31.57% | 29.85% | -1.72% |
173
+ | Humanities | 23.78% | 22.82% | -0.96% |
174
+ | STEM | 19.94% | 19.48% | -0.46% |
175
+
176
+ ### MMLU-Pro by Subject
177
+
178
+ | Subject | BF16 | This Model | Difference |
179
+ |---------|------|------------|------------|
180
+ | Biology | 50.35% | 48.12% | -2.23% |
181
+ | Psychology | 44.99% | 41.23% | -3.76% |
182
+ | History | 33.60% | 34.12% | +0.52% |
183
+ | Health | 35.21% | 34.11% | -1.10% |
184
+ | Economics | 36.37% | 33.06% | -3.31% |
185
+ | Philosophy | 31.46% | 29.26% | -2.20% |
186
+ | Other | 28.35% | 26.08% | -2.27% |
187
+ | Computer Science | 26.10% | 21.95% | -4.15% |
188
+ | Business | 16.35% | 19.26% | +2.91% |
189
+ | Law | 16.89% | 15.99% | -0.90% |
190
+ | Math | 14.06% | 14.73% | +0.67% |
191
+ | Physics | 15.32% | 15.24% | -0.08% |
192
+ | Engineering | 16.00% | 14.96% | -1.04% |
193
+ | Chemistry | 14.13% | 14.84% | +0.71% |
194
+
195
+ ## Citation
196
+
197
+ If you use this model, please cite the original GLM-4.7-Flash:
198
+
199
+ ```bibtex
200
+ @misc{glm4flash2025,
201
+ title={GLM-4.7-Flash},
202
+ author={Zhipu AI},
203
+ year={2025},
204
+ howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
205
+ }
206
+ ```
207
 
208
+ ## License
209
 
210
+ This model inherits the Apache 2.0 license from the base model.