|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
base_model: zai-org/GLM-4.7-Flash |
|
|
tags: |
|
|
- moe |
|
|
- nvfp4 |
|
|
- quantized |
|
|
- vllm |
|
|
- glm |
|
|
- 30b |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). |
|
|
https://github.com/Gadflyii/vllm/tree/main |
|
|
|
|
|
# GLM-4.7-Flash NVFP4 (Mixed Precision) |
|
|
|
|
|
This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. |
|
|
|
|
|
## Quantization Strategy |
|
|
|
|
|
This model was made via custom quantization and calibration (128 samples, 2048 max seq len, neuralmagic/calibration, all 64 experts) scripts based on NVIDIA's approach for DeepSeek-V3. It uses **mixed precision** to preserve accuracy: |
|
|
|
|
|
| Component | Precision | Rationale | |
|
|
|-----------|-----------|-----------| |
|
|
| MLP Experts | FP4 (E2M1) | 64 routed experts, 4 active per token | |
|
|
| Dense MLP | FP4 (E2M1) | First layer dense MLP | |
|
|
| **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive | |
|
|
| Norms, Gates, Embeddings | BF16 | Standard practice | |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | BF16 | Uniform FP4 | **This Model** | |
|
|
|--------|------|-------------|----------------| |
|
|
| MMLU-Pro | 24.83% | 16.84% | **23.55%** | |
|
|
| Size | 62.4 GB | 18.9 GB | **20.4 GB** | |
|
|
| Compression | 1x | 3.3x | **3.1x** | |
|
|
| Accuracy Loss | - | -8.0% | **-1.3%** | |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
### Requirements |
|
|
|
|
|
- **vLLM**: 0.14.0+ (for compressed-tensors NVFP4 support) |
|
|
- **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture) |
|
|
- **GPU**: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace) |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install vllm>=0.14.0 |
|
|
pip install git+https://github.com/huggingface/transformers.git |
|
|
``` |
|
|
|
|
|
### Inference with vLLM |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
model = LLM( |
|
|
"GadflyII/GLM-4.7-Flash-NVFP4", |
|
|
tensor_parallel_size=1, |
|
|
max_model_len=4096, |
|
|
trust_remote_code=True, |
|
|
gpu_memory_utilization=0.85, |
|
|
) |
|
|
|
|
|
params = SamplingParams(temperature=0.7, max_tokens=512) |
|
|
outputs = model.generate(["Explain quantum computing in simple terms."], params) |
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
### Serving with vLLM |
|
|
|
|
|
```bash |
|
|
vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \ |
|
|
--tensor-parallel-size 1 \ |
|
|
--max-model-len 4096 \ |
|
|
--trust-remote-code |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) |
|
|
- **Architecture**: `Glm4MoeLiteForCausalLM` |
|
|
- **Parameters**: 30B total, 3B active per token (30B-A3B) |
|
|
- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert |
|
|
- **Layers**: 47 |
|
|
- **Context Length**: 202,752 tokens (max) |
|
|
- **Languages**: English, Chinese |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
- **Format**: compressed-tensors (NVFP4) |
|
|
- **Block Size**: 16 |
|
|
- **Scale Format**: FP8 (E4M3) |
|
|
- **Calibration**: 128 samples from neuralmagic/calibration dataset |
|
|
- **Full Expert Calibration**: All 64 experts calibrated per sample |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### MMLU-Pro Overall Results |
|
|
|
|
|
| Model | Accuracy | Correct | Total | |
|
|
|-------|----------|---------|-------| |
|
|
| **BF16 (baseline)** | **24.83%** | 2988 | 12032 | |
|
|
| **NVFP4 (this model)** | **23.55%** | 2834 | 12032 | |
|
|
| **Difference** | **-1.28%** | -154 | - | |
|
|
|
|
|
### MMLU-Pro by Category |
|
|
|
|
|
| Category | BF16 | NVFP4 | Difference | |
|
|
|----------|------|-------|------------| |
|
|
| Social Sciences | 32.70% | 31.43% | -1.27% | |
|
|
| Other | 31.57% | 30.08% | -1.49% | |
|
|
| Humanities | 23.78% | 22.56% | -1.22% | |
|
|
| STEM | 19.94% | 18.70% | -1.24% | |
|
|
|
|
|
### MMLU-Pro by Subject |
|
|
|
|
|
| Subject | BF16 | NVFP4 | Difference | |
|
|
|---------|------|-------|------------| |
|
|
| Biology | 50.35% | 47.42% | -2.93% | |
|
|
| Psychology | 44.99% | 42.48% | -2.51% | |
|
|
| Economics | 36.37% | 34.48% | -1.89% | |
|
|
| Health | 35.21% | 34.84% | -0.37% | |
|
|
| History | 33.60% | 30.71% | -2.89% | |
|
|
| Philosophy | 31.46% | 30.06% | -1.40% | |
|
|
| Other | 28.35% | 25.87% | -2.48% | |
|
|
| Computer Science | 26.10% | 21.46% | -4.64% | |
|
|
| Business | 16.35% | 16.98% | +0.63% | |
|
|
| Law | 16.89% | 16.35% | -0.54% | |
|
|
| Engineering | 16.00% | 14.04% | -1.96% | |
|
|
| Physics | 15.32% | 14.70% | -0.62% | |
|
|
| Math | 14.06% | 14.29% | +0.23% | |
|
|
| Chemistry | 14.13% | 13.34% | -0.79% | |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original GLM-4.7-Flash: |
|
|
|
|
|
```bibtex |
|
|
@misc{glm4flash2025, |
|
|
title={GLM-4.7-Flash}, |
|
|
author={Zhipu AI}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the Apache 2.0 license from the base model. |
|
|
|