---
license: apache-2.0
language:
- en
- zh
base_model: zai-org/GLM-4.7-Flash
tags:
- moe
- nvfp4
- quantized
- vllm
- glm
- 30b
- mtp
- speculative-decoding
library_name: transformers
pipeline_tag: text-generation
---
# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
https://github.com/Gadflyii/vllm/tree/main

# GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP in BF16)

This is a **mixed precision NVFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves **MTP (Multi-Token Prediction) layers in BF16** for speculative decoding compatibility.

## What's Different from GLM-4.7-Flash-NVFP4?

| Feature | GLM-4.7-Flash-NVFP4 | **This Model** |
|---------|---------------------|----------------|
| MTP Layers | NVFP4 | BF16 |
| Calibration Samples | 128 | **512** |
| Calibration Seq Length | 2048 | **4096** |
| MMLU-Pro Accuracy | 23.56% | **23.91%** |

## Quantization Strategy

This model uses **mixed precision** to preserve accuracy and MTP functionality:

| Component | Precision | Rationale |
|-----------|-----------|-----------|
| MLP Experts | FP4 (E2M1) | 64 routed experts, 4 active per token |
| Dense MLP | FP4 (E2M1) | First layer dense MLP |
| **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive |
| **MTP Layers** | **BF16** | `eh_proj`, `shared_head.head` for speculative decoding |
| Norms, Gates, Embeddings | BF16 | Standard practice |

## Performance

| Metric | BF16 | NVFP4 | **This Model** |
|--------|------|----------|----------------|
| MMLU-Pro | 24.83% | 23.56% | **23.91%** |
| Size | 62.4 GB | 20.4 GB | **20.9 GB** |
| Compression | 1x | 3.1x | **3.0x** |
| Accuracy Loss | - | -1.27% | **-0.92%** |

### MTP Acceptance Rate

| Model | Acceptance Rate | Mean Accepted Length |
|-------|-----------------|----------------------|
| BF16 (baseline) | 60% | 1.60 |
| **This Model** | **63%** | **1.63** |

MTP quality is preserved (actually slightly improved) after quantization.

### MTP Performance Note

MTP speculative decoding currently shows overhead rather than speedup due to missing `torch.compile` support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream.

| Configuration | Tokens/sec |
|---------------|------------|
| Without MTP | 78.1 tok/s |
| With MTP (1 token) | 64.7 tok/s |
| With MTP (2 tokens) | 56.8 tok/s |
| With MTP (4 tokens) | 44.5 tok/s | 

## Usage

### Requirements

- **vLLM**: 0.8.0+ (for compressed-tensors NVFP4 support)
- **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture)
- **GPU**: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

### Installation

```bash
pip install vllm>=0.8.0
pip install git+https://github.com/huggingface/transformers.git
```

### Inference with vLLM (Recommended)

```python
from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-MTP-NVFP4",
    tensor_parallel_size=1,
    max_model_len=4096,
    trust_remote_code=True,
    gpu_memory_utilization=0.90,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
```

### Serving with vLLM

```bash
# Standard serving (recommended for performance)
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90

# With MTP speculative decoding (experimental)
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
```

## Model Details

- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Architecture**: `Glm4MoeLiteForCausalLM`
- **Parameters**: 30B total, 3B active per token (30B-A3B)
- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
- **Layers**: 47 (with 1 MTP layer)
- **Context Length**: 202,752 tokens (max)
- **Languages**: English, Chinese

## Quantization Details

- **Format**: compressed-tensors (NVFP4)
- **Block Size**: 16
- **Scale Format**: FP8 (E4M3)
- **Calibration**: 512 samples from wikitext dataset
- **Calibration Sequence Length**: 4096
- **Full Expert Calibration**: All 64 experts calibrated per sample

### Tensors by Precision

| Precision | Count | Description |
|-----------|-------|-------------|
| NVFP4 | 9,168 | MLP/FFN weights |
| BF16 | 240 | Attention weights (MLA) |
| BF16 | 2 | MTP layers (eh_proj, shared_head.head) |

## Evaluation

### MMLU-Pro Overall Results

| Model | Accuracy | Correct | Total |
|-------|----------|---------|-------|
| BF16 (baseline) | 24.83% | 2988 | 12032 |
| NVFP4-v1 | 23.56% | 2835 | 12032 |
| **This Model** | **23.91%** | **2877** | 12032 |

### MMLU-Pro by Category

| Category | BF16 | This Model | Difference |
|----------|------|------------|------------|
| Social Sciences | 32.70% | 31.26% | -1.44% |
| Other | 31.57% | 29.85% | -1.72% |
| Humanities | 23.78% | 22.82% | -0.96% |
| STEM | 19.94% | 19.48% | -0.46% |

### MMLU-Pro by Subject

| Subject | BF16 | This Model | Difference |
|---------|------|------------|------------|
| Biology | 50.35% | 48.12% | -2.23% |
| Psychology | 44.99% | 41.23% | -3.76% |
| History | 33.60% | 34.12% | +0.52% |
| Health | 35.21% | 34.11% | -1.10% |
| Economics | 36.37% | 33.06% | -3.31% |
| Philosophy | 31.46% | 29.26% | -2.20% |
| Other | 28.35% | 26.08% | -2.27% |
| Computer Science | 26.10% | 21.95% | -4.15% |
| Business | 16.35% | 19.26% | +2.91% |
| Law | 16.89% | 15.99% | -0.90% |
| Math | 14.06% | 14.73% | +0.67% |
| Physics | 15.32% | 15.24% | -0.08% |
| Engineering | 16.00% | 14.96% | -1.04% |
| Chemistry | 14.13% | 14.84% | +0.71% |

## Citation

If you use this model, please cite the original GLM-4.7-Flash:

```bibtex
@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}
```

## License

This model inherits the Apache 2.0 license from the base model.