File size: 4,580 Bytes
8cd3aac
 
 
 
 
 
 
 
 
 
 
 
5fbf254
8cd3aac
 
 
3cdd7f3
4eeb122
8cd3aac
 
 
176f59f
8cd3aac
 
 
81b4f67
8cd3aac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fbf254
8cd3aac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176f59f
8cd3aac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208e33c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8cd3aac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: apache-2.0
language:
- en
- zh
base_model: zai-org/GLM-4.7-Flash
tags:
- moe
- nvfp4
- quantized
- vllm
- glm
- 30b
library_name: transformers
pipeline_tag: text-generation
---
# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). 
https://github.com/Gadflyii/vllm/tree/main

# GLM-4.7-Flash NVFP4 (Mixed Precision)

This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

## Quantization Strategy

This model was made via custom quantization and calibration (128 samples, 2048 max seq len, neuralmagic/calibration, all 64 experts) scripts based on NVIDIA's approach for DeepSeek-V3. It uses **mixed precision** to preserve accuracy:

| Component | Precision | Rationale |
|-----------|-----------|-----------|
| MLP Experts | FP4 (E2M1) | 64 routed experts, 4 active per token |
| Dense MLP | FP4 (E2M1) | First layer dense MLP |
| **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive |
| Norms, Gates, Embeddings | BF16 | Standard practice |

## Performance

| Metric | BF16 | Uniform FP4 | **This Model** |
|--------|------|-------------|----------------|
| MMLU-Pro | 24.83% | 16.84% | **23.55%** |
| Size | 62.4 GB | 18.9 GB | **20.4 GB** |
| Compression | 1x | 3.3x | **3.1x** |
| Accuracy Loss | - | -8.0% | **-1.3%** |


## Usage

### Requirements

- **vLLM**: 0.14.0+ (for compressed-tensors NVFP4 support)
- **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture)
- **GPU**: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

### Installation

```bash
pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git
```

### Inference with vLLM

```python
from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-NVFP4",
    tensor_parallel_size=1,
    max_model_len=4096,
    trust_remote_code=True,
    gpu_memory_utilization=0.85,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
```

### Serving with vLLM

```bash
vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --trust-remote-code
```

## Model Details

- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Architecture**: `Glm4MoeLiteForCausalLM`
- **Parameters**: 30B total, 3B active per token (30B-A3B)
- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
- **Layers**: 47
- **Context Length**: 202,752 tokens (max)
- **Languages**: English, Chinese

## Quantization Details

- **Format**: compressed-tensors (NVFP4)
- **Block Size**: 16
- **Scale Format**: FP8 (E4M3)
- **Calibration**: 128 samples from neuralmagic/calibration dataset
- **Full Expert Calibration**: All 64 experts calibrated per sample

## Evaluation

### MMLU-Pro Overall Results

| Model | Accuracy | Correct | Total |
|-------|----------|---------|-------|
| **BF16 (baseline)** | **24.83%** | 2988 | 12032 |
| **NVFP4 (this model)** | **23.55%** | 2834 | 12032 |
| **Difference** | **-1.28%** | -154 | - |

### MMLU-Pro by Category

| Category | BF16 | NVFP4 | Difference |
|----------|------|-------|------------|
| Social Sciences | 32.70% | 31.43% | -1.27% |
| Other | 31.57% | 30.08% | -1.49% |
| Humanities | 23.78% | 22.56% | -1.22% |
| STEM | 19.94% | 18.70% | -1.24% |

### MMLU-Pro by Subject

| Subject | BF16 | NVFP4 | Difference |
|---------|------|-------|------------|
| Biology | 50.35% | 47.42% | -2.93% |
| Psychology | 44.99% | 42.48% | -2.51% |
| Economics | 36.37% | 34.48% | -1.89% |
| Health | 35.21% | 34.84% | -0.37% |
| History | 33.60% | 30.71% | -2.89% |
| Philosophy | 31.46% | 30.06% | -1.40% |
| Other | 28.35% | 25.87% | -2.48% |
| Computer Science | 26.10% | 21.46% | -4.64% |
| Business | 16.35% | 16.98% | +0.63% |
| Law | 16.89% | 16.35% | -0.54% |
| Engineering | 16.00% | 14.04% | -1.96% |
| Physics | 15.32% | 14.70% | -0.62% |
| Math | 14.06% | 14.29% | +0.23% |
| Chemistry | 14.13% | 13.34% | -0.79% |

## Citation

If you use this model, please cite the original GLM-4.7-Flash:

```bibtex
@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}
```

## License

This model inherits the Apache 2.0 license from the base model.