README.md · GadflyII/GLM-4.7-Flash-MXFP4 at main

File size: 6,735 Bytes

---
license: apache-2.0
language:
- en
- zh
base_model: zai-org/GLM-4.7-Flash
tags:
- moe
- mxfp4
- quantized
- vllm
- glm
- 30b
library_name: transformers
pipeline_tag: text-generation
---
# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). 
# Note: If you are running this MXFP4 model on SM120 GPU's, you also will need to use my fork until PR into upstream is merged, however it is significantly slower than NVFP4.

https://github.com/Gadflyii/vllm/tree/main

# GLM-4.7-Flash MXFP4

This is a **MXFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

## Quantization Strategy

This model uses **MXFP4 (Microscaling FP4)** format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts.

| Component | Precision | Rationale |
|-----------|-----------|-----------|
| MLP Experts (gate_up, down) | MXFP4 (E2M1) | 64 routed experts, 4 active per token |
| **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive |
| Dense MLP | BF16 | First layer dense MLP |
| Norms, Gates, Embeddings | BF16 | Standard practice |

### MXFP4 vs NVFP4

| Property | MXFP4 | NVFP4 |
|----------|-------|-------|
| Weight Format | E2M1 (4-bit) | E2M1 (4-bit) |
| Scale Format | E8M0 (power-of-2) | FP8 (E4M3) |
| Block Size | 32 | 16 |
| Backend | Marlin | FlashInfer/Cutlass |

## Performance

| Metric | BF16 | **This Model** |
|--------|------|----------------|
| MMLU-Pro | 24.83% | **25.86%** |
| Size | 62.4 GB | **20.8 GB** |
| Compression | 1x | **3.0x** |
| Accuracy Δ | - | **+1.03%** |
| Throughput | 92.4 q/s | **138.7 q/s** |

## Usage

### Requirements

- **vLLM**: 0.14.0+ (for MXFP4 Marlin backend support)
- **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture)
- **GPU**: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell)

### Installation

```bash
pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git
```

### Inference with vLLM

```python
import os
os.environ["VLLM_MXFP4_USE_MARLIN"] = "1"

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-MXFP4",
    tensor_parallel_size=1,
    max_model_len=65536,  # Can go up to 202K with sufficient VRAM
    trust_remote_code=True,
    gpu_memory_utilization=0.90,
)

# Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
```

### Serving with vLLM

```bash
VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 65536 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90
```

### Chat Completions API

```python
import requests

payload = {
    "model": "GadflyII/GLM-4.7-Flash-MXFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 1024,
    "temperature": 0.7,
    # Disable thinking mode for direct responses:
    "chat_template_kwargs": {"enable_thinking": False}
    # Or enable thinking for reasoning tasks:
    # "chat_template_kwargs": {"enable_thinking": True}
}
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])
```

## Important Usage Notes

### Sampling Parameters

| Parameter | Recommended | Avoid | Reason |
|-----------|-------------|-------|--------|
| `temperature` | 0.3-0.7 | - | Standard range |
| `top_p` | 0.9-0.95 | - | Standard range |
| `repetition_penalty` | None or ≤1.05 | >1.05 | High values cause word-salad at long outputs |
| `max_tokens` | Up to 10,000+ | - | Model handles long generation well |

### Thinking Mode

This model supports a "thinking" mode where it shows its reasoning process:

- **`enable_thinking: True`** - Model outputs its reasoning process before the answer (good for math, coding, complex reasoning)
- **`enable_thinking: False`** - Model outputs the answer directly (good for chat, simple Q&A)

The model thinks in English when given English prompts.

## Model Details

- **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Architecture**: `Glm4MoeLiteForCausalLM`
- **Parameters**: 30B total, 3B active per token (30B-A3B)
- **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert
- **Layers**: 47
- **Context Length**: 202,752 tokens (max)
- **Languages**: English, Chinese

## Quantization Details

- **Format**: MXFP4 (Microscaling FP4)
- **Weight Format**: E2M1 (4-bit floating point, range ±6.0)
- **Scale Format**: E8M0 (8-bit power-of-2 scales)
- **Block Size**: 32
- **Calibration**: 128 samples from neuralmagic/calibration dataset

## Evaluation

### MMLU-Pro Overall Results

| Model | Accuracy | Correct | Total | Throughput |
|-------|----------|---------|-------|------------|
| BF16 (baseline) | 24.83% | 2988 | 12032 | 92.4 q/s |
| **MXFP4 (this model)** | **25.86%** | **3112** | 12032 | **138.7 q/s** |
| Difference | **+1.03%** | +124 | - | **+50%** |

### MMLU-Pro by Category

| Category | BF16 | MXFP4 | Δ |
|----------|------|-------|---|
| Social Sciences | 32.70% | **34.68%** | +1.98% |
| Other | 31.57% | **32.84%** | +1.27% |
| Humanities | 23.78% | 23.78% | 0.00% |
| STEM | 19.94% | **20.86%** | +0.92% |

### MMLU-Pro by Subject (All 14 Subjects)

| Subject | BF16 | MXFP4 | Δ | Questions |
|---------|------|-------|---|-----------|
| Biology | 50.35% | **52.16%** | +1.81% | 717 |
| Psychology | 44.99% | **47.74%** | +2.75% | 798 |
| Economics | 36.37% | **38.27%** | +1.90% | 844 |
| Health | 35.21% | **36.31%** | +1.10% | 818 |
| History | **33.60%** | 32.28% | -1.32% | 381 |
| Philosophy | 31.46% | **31.86%** | +0.40% | 499 |
| Other | 28.35% | **29.76%** | +1.41% | 924 |
| Computer Science | **26.10%** | 25.85% | -0.25% | 410 |
| Business | 16.35% | **17.62%** | +1.27% | 789 |
| Law | 16.89% | **17.17%** | +0.28% | 1101 |
| Physics | 15.32% | **16.17%** | +0.85% | 1299 |
| Engineering | **16.00%** | 15.58% | -0.42% | 969 |
| Math | 14.06% | **15.54%** | +1.48% | 1351 |
| Chemistry | 14.13% | **15.46%** | +1.33% | 1132 |


## Citation

If you use this model, please cite the original GLM-4.7-Flash:

```bibtex
@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}
```

## License

This model inherits the Apache 2.0 license from the base model.