| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - zh |
| | base_model: zai-org/GLM-4.7-Flash |
| | tags: |
| | - moe |
| | - nvfp4 |
| | - quantized |
| | - vllm |
| | - glm |
| | - 30b |
| | - mtp |
| | - speculative-decoding |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | --- |
| | # Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). |
| | https://github.com/Gadflyii/vllm/tree/main |
| |
|
| | # GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP in BF16) |
| |
|
| | This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves **MTP (Multi-Token Prediction) layers in BF16** for speculative decoding compatibility. |
| |
|
| | ## What's Different from GLM-4.7-Flash-NVFP4? |
| |
|
| | | Feature | GLM-4.7-Flash-NVFP4 | **This Model** | |
| | |---------|---------------------|----------------| |
| | | MTP Layers | NVFP4 | BF16 | |
| | | Calibration Samples | 128 | **512** | |
| | | Calibration Seq Length | 2048 | **4096** | |
| | | MMLU-Pro Accuracy | 23.56% | **23.91%** | |
| |
|
| | ## Quantization Strategy |
| |
|
| | This model uses **mixed precision** to preserve accuracy and MTP functionality: |
| |
|
| | | Component | Precision | Rationale | |
| | |-----------|-----------|-----------| |
| | | MLP Experts | FP4 (E2M1) | 64 routed experts, 4 active per token | |
| | | Dense MLP | FP4 (E2M1) | First layer dense MLP | |
| | | **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive | |
| | | **MTP Layers** | **BF16** | `eh_proj`, `shared_head.head` for speculative decoding | |
| | | Norms, Gates, Embeddings | BF16 | Standard practice | |
| |
|
| | ## Performance |
| |
|
| | | Metric | BF16 | NVFP4 | **This Model** | |
| | |--------|------|----------|----------------| |
| | | MMLU-Pro | 24.83% | 23.56% | **23.91%** | |
| | | Size | 62.4 GB | 20.4 GB | **20.9 GB** | |
| | | Compression | 1x | 3.1x | **3.0x** | |
| | | Accuracy Loss | - | -1.27% | **-0.92%** | |
| |
|
| | ### MTP Acceptance Rate |
| |
|
| | | Model | Acceptance Rate | Mean Accepted Length | |
| | |-------|-----------------|----------------------| |
| | | BF16 (baseline) | 60% | 1.60 | |
| | | **This Model** | **63%** | **1.63** | |
| |
|
| | MTP quality is preserved (actually slightly improved) after quantization. |
| |
|
| | ### MTP Performance Note |
| |
|
| | MTP speculative decoding currently shows overhead rather than speedup due to missing `torch.compile` support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream. |
| |
|
| | | Configuration | Tokens/sec | |
| | |---------------|------------| |
| | | Without MTP | 78.1 tok/s | |
| | | With MTP (1 token) | 64.7 tok/s | |
| | | With MTP (2 tokens) | 56.8 tok/s | |
| | | With MTP (4 tokens) | 44.5 tok/s | |
| |
|
| | ## Usage |
| |
|
| | ### Requirements |
| |
|
| | - **vLLM**: 0.8.0+ (for compressed-tensors NVFP4 support) |
| | - **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture) |
| | - **GPU**: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace) |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install vllm>=0.8.0 |
| | pip install git+https://github.com/huggingface/transformers.git |
| | ``` |
| |
|
| | ### Inference with vLLM (Recommended) |
| |
|
| | ```python |
| | from vllm import LLM, SamplingParams |
| | |
| | model = LLM( |
| | "GadflyII/GLM-4.7-Flash-MTP-NVFP4", |
| | tensor_parallel_size=1, |
| | max_model_len=4096, |
| | trust_remote_code=True, |
| | gpu_memory_utilization=0.90, |
| | ) |
| | |
| | params = SamplingParams(temperature=0.7, max_tokens=512) |
| | outputs = model.generate(["Explain quantum computing in simple terms."], params) |
| | print(outputs[0].outputs[0].text) |
| | ``` |
| |
|
| | ### Serving with vLLM |
| |
|
| | ```bash |
| | # Standard serving (recommended for performance) |
| | VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \ |
| | --tensor-parallel-size 1 \ |
| | --max-model-len 4096 \ |
| | --trust-remote-code \ |
| | --gpu-memory-utilization 0.90 |
| | |
| | # With MTP speculative decoding (experimental) |
| | VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \ |
| | --tensor-parallel-size 1 \ |
| | --max-model-len 4096 \ |
| | --trust-remote-code \ |
| | --gpu-memory-utilization 0.90 \ |
| | --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) |
| | - **Architecture**: `Glm4MoeLiteForCausalLM` |
| | - **Parameters**: 30B total, 3B active per token (30B-A3B) |
| | - **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert |
| | - **Layers**: 47 (with 1 MTP layer) |
| | - **Context Length**: 202,752 tokens (max) |
| | - **Languages**: English, Chinese |
| |
|
| | ## Quantization Details |
| |
|
| | - **Format**: compressed-tensors (NVFP4) |
| | - **Block Size**: 16 |
| | - **Scale Format**: FP8 (E4M3) |
| | - **Calibration**: 512 samples from wikitext dataset |
| | - **Calibration Sequence Length**: 4096 |
| | - **Full Expert Calibration**: All 64 experts calibrated per sample |
| |
|
| | ### Tensors by Precision |
| |
|
| | | Precision | Count | Description | |
| | |-----------|-------|-------------| |
| | | NVFP4 | 9,168 | MLP/FFN weights | |
| | | BF16 | 240 | Attention weights (MLA) | |
| | | BF16 | 2 | MTP layers (eh_proj, shared_head.head) | |
| |
|
| | ## Evaluation |
| |
|
| | ### MMLU-Pro Overall Results |
| |
|
| | | Model | Accuracy | Correct | Total | |
| | |-------|----------|---------|-------| |
| | | BF16 (baseline) | 24.83% | 2988 | 12032 | |
| | | NVFP4-v1 | 23.56% | 2835 | 12032 | |
| | | **This Model** | **23.91%** | **2877** | 12032 | |
| |
|
| | ### MMLU-Pro by Category |
| |
|
| | | Category | BF16 | This Model | Difference | |
| | |----------|------|------------|------------| |
| | | Social Sciences | 32.70% | 31.26% | -1.44% | |
| | | Other | 31.57% | 29.85% | -1.72% | |
| | | Humanities | 23.78% | 22.82% | -0.96% | |
| | | STEM | 19.94% | 19.48% | -0.46% | |
| |
|
| | ### MMLU-Pro by Subject |
| |
|
| | | Subject | BF16 | This Model | Difference | |
| | |---------|------|------------|------------| |
| | | Biology | 50.35% | 48.12% | -2.23% | |
| | | Psychology | 44.99% | 41.23% | -3.76% | |
| | | History | 33.60% | 34.12% | +0.52% | |
| | | Health | 35.21% | 34.11% | -1.10% | |
| | | Economics | 36.37% | 33.06% | -3.31% | |
| | | Philosophy | 31.46% | 29.26% | -2.20% | |
| | | Other | 28.35% | 26.08% | -2.27% | |
| | | Computer Science | 26.10% | 21.95% | -4.15% | |
| | | Business | 16.35% | 19.26% | +2.91% | |
| | | Law | 16.89% | 15.99% | -0.90% | |
| | | Math | 14.06% | 14.73% | +0.67% | |
| | | Physics | 15.32% | 15.24% | -0.08% | |
| | | Engineering | 16.00% | 14.96% | -1.04% | |
| | | Chemistry | 14.13% | 14.84% | +0.71% | |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the original GLM-4.7-Flash: |
| |
|
| | ```bibtex |
| | @misc{glm4flash2025, |
| | title={GLM-4.7-Flash}, |
| | author={Zhipu AI}, |
| | year={2025}, |
| | howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | This model inherits the Apache 2.0 license from the base model. |
| |
|