File size: 6,735 Bytes
0cc906e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f9ffce
 
 
0cc906e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a28b80
0cc906e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a28b80
 
 
 
 
0cc906e
 
 
0a28b80
 
 
 
0cc906e
0a28b80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0cc906e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
license: apache-2.0
language:
- en
- zh
base_model: zai-org/GLM-4.7-Flash
tags:
- moe
- mxfp4
- quantized
- vllm
- glm
- 30b
library_name: transformers
pipeline_tag: text-generation
---
# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). 
# Note: If you are running this MXFP4 model on SM120 GPU's, you also will need to use my fork until PR into upstream is merged, however it is significantly slower than NVFP4.

https://github.com/Gadflyii/vllm/tree/main

# GLM-4.7-Flash MXFP4

This is a **MXFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

## Quantization Strategy

This model uses **MXFP4 (Microscaling FP4)** format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts.

| Component | Precision | Rationale |
|-----------|-----------|-----------|
| MLP Experts (gate_up, down) | MXFP4 (E2M1) | 64 routed experts, 4 active per token |
| **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive |
| Dense MLP | BF16 | First layer dense MLP |
| Norms, Gates, Embeddings | BF16 | Standard practice |

### MXFP4 vs NVFP4

| Property | MXFP4 | NVFP4 |
|----------|-------|-------|
| Weight Format | E2M1 (4-bit) | E2M1 (4-bit) |
| Scale Format | E8M0 (power-of-2) | FP8 (E4M3) |
| Block Size | 32 | 16 |
| Backend | Marlin | FlashInfer/Cutlass |

## Performance

| Metric | BF16 | **This Model** |
|--------|------|----------------|
| MMLU-Pro | 24.83% | **25.86%** |
| Size | 62.4 GB | **20.8 GB** |
| Compression | 1x | **3.0x** |
| Accuracy Δ | - | **+1.03%** |
| Throughput | 92.4 q/s | **138.7 q/s** |

## Usage

### Requirements

- **vLLM**: 0.14.0+ (for MXFP4 Marlin backend support)
- **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture)
- **GPU**: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell)

### Installation

```bash
pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git
```

### Inference with vLLM

```python
import os
os.environ["VLLM_MXFP4_USE_MARLIN"] = "1"

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-MXFP4",
    tensor_parallel_size=1,
    max_model_len=65536,  # Can go up to 202K with sufficient VRAM
    trust_remote_code=True,
    gpu_memory_utilization=0.90,
)

# Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
```

### Serving with vLLM

```bash
VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 65536 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90
```

### Chat Completions API

```python
import requests

payload = {
    "model": "GadflyII/GLM-4.7-Flash-MXFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 1024,
    "temperature": 0.7,
    # Disable thinking mode for direct responses:
    "chat_template_kwargs": {"enable_thinking": False}
    # Or enable thinking for reasoning tasks:
    # "chat_template_kwargs": {"enable_thinking": True}
}
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])
```

## Important Usage Notes

### Sampling Parameters

| Parameter | Recommended | Avoid | Reason |
|-----------|-------------|-------|--------|
| `temperature` | 0.3-0.7 | - | Standard range |
| `top_p` | 0.9-0.95 | - | Standard range |
| `repetition_penalty` | None or ≤1.05 | >1.05 | High values cause word-salad at long outputs |
| `max_tokens` | Up to 10,000+ | - | Model handles long generation well |

### Thinking Mode

This model supports a "thinking" mode where it shows its reasoning process:

- **`enable_thinking: True`** - Model outputs its reasoning process before the answer (good for math, coding, complex reasoning)
- **`enable_thinking: False`** - Model outputs the answer directly (good for chat, simple Q&A)

The model thinks in English when given English prompts.

## Model Details

- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Architecture**: `Glm4MoeLiteForCausalLM`
- **Parameters**: 30B total, 3B active per token (30B-A3B)
- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
- **Layers**: 47
- **Context Length**: 202,752 tokens (max)
- **Languages**: English, Chinese

## Quantization Details

- **Format**: MXFP4 (Microscaling FP4)
- **Weight Format**: E2M1 (4-bit floating point, range ±6.0)
- **Scale Format**: E8M0 (8-bit power-of-2 scales)
- **Block Size**: 32
- **Calibration**: 128 samples from neuralmagic/calibration dataset

## Evaluation

### MMLU-Pro Overall Results

| Model | Accuracy | Correct | Total | Throughput |
|-------|----------|---------|-------|------------|
| BF16 (baseline) | 24.83% | 2988 | 12032 | 92.4 q/s |
| **MXFP4 (this model)** | **25.86%** | **3112** | 12032 | **138.7 q/s** |
| Difference | **+1.03%** | +124 | - | **+50%** |

### MMLU-Pro by Category

| Category | BF16 | MXFP4 | Δ |
|----------|------|-------|---|
| Social Sciences | 32.70% | **34.68%** | +1.98% |
| Other | 31.57% | **32.84%** | +1.27% |
| Humanities | 23.78% | 23.78% | 0.00% |
| STEM | 19.94% | **20.86%** | +0.92% |

### MMLU-Pro by Subject (All 14 Subjects)

| Subject | BF16 | MXFP4 | Δ | Questions |
|---------|------|-------|---|-----------|
| Biology | 50.35% | **52.16%** | +1.81% | 717 |
| Psychology | 44.99% | **47.74%** | +2.75% | 798 |
| Economics | 36.37% | **38.27%** | +1.90% | 844 |
| Health | 35.21% | **36.31%** | +1.10% | 818 |
| History | **33.60%** | 32.28% | -1.32% | 381 |
| Philosophy | 31.46% | **31.86%** | +0.40% | 499 |
| Other | 28.35% | **29.76%** | +1.41% | 924 |
| Computer Science | **26.10%** | 25.85% | -0.25% | 410 |
| Business | 16.35% | **17.62%** | +1.27% | 789 |
| Law | 16.89% | **17.17%** | +0.28% | 1101 |
| Physics | 15.32% | **16.17%** | +0.85% | 1299 |
| Engineering | **16.00%** | 15.58% | -0.42% | 969 |
| Math | 14.06% | **15.54%** | +1.48% | 1351 |
| Chemistry | 14.13% | **15.46%** | +1.33% | 1132 |


## Citation

If you use this model, please cite the original GLM-4.7-Flash:

```bibtex
@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}
```

## License

This model inherits the Apache 2.0 license from the base model.