| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - zh |
| | base_model: zai-org/GLM-4.7-Flash |
| | tags: |
| | - moe |
| | - mxfp4 |
| | - quantized |
| | - vllm |
| | - glm |
| | - 30b |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | --- |
| | # Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). |
| | # Note: If you are running this MXFP4 model on SM120 GPU's, you also will need to use my fork until PR into upstream is merged, however it is significantly slower than NVFP4. |
| |
|
| | https://github.com/Gadflyii/vllm/tree/main |
| |
|
| | # GLM-4.7-Flash MXFP4 |
| |
|
| | This is a **MXFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. |
| |
|
| | ## Quantization Strategy |
| |
|
| | This model uses **MXFP4 (Microscaling FP4)** format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts. |
| |
|
| | | Component | Precision | Rationale | |
| | |-----------|-----------|-----------| |
| | | MLP Experts (gate_up, down) | MXFP4 (E2M1) | 64 routed experts, 4 active per token | |
| | | **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive | |
| | | Dense MLP | BF16 | First layer dense MLP | |
| | | Norms, Gates, Embeddings | BF16 | Standard practice | |
| | |
| | ### MXFP4 vs NVFP4 |
| | |
| | | Property | MXFP4 | NVFP4 | |
| | |----------|-------|-------| |
| | | Weight Format | E2M1 (4-bit) | E2M1 (4-bit) | |
| | | Scale Format | E8M0 (power-of-2) | FP8 (E4M3) | |
| | | Block Size | 32 | 16 | |
| | | Backend | Marlin | FlashInfer/Cutlass | |
| | |
| | ## Performance |
| | |
| | | Metric | BF16 | **This Model** | |
| | |--------|------|----------------| |
| | | MMLU-Pro | 24.83% | **25.86%** | |
| | | Size | 62.4 GB | **20.8 GB** | |
| | | Compression | 1x | **3.0x** | |
| | | Accuracy Δ | - | **+1.03%** | |
| | | Throughput | 92.4 q/s | **138.7 q/s** | |
| | |
| | ## Usage |
| | |
| | ### Requirements |
| | |
| | - **vLLM**: 0.14.0+ (for MXFP4 Marlin backend support) |
| | - **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture) |
| | - **GPU**: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell) |
| | |
| | ### Installation |
| | |
| | ```bash |
| | pip install vllm>=0.14.0 |
| | pip install git+https://github.com/huggingface/transformers.git |
| | ``` |
| | |
| | ### Inference with vLLM |
| | |
| | ```python |
| | import os |
| | os.environ["VLLM_MXFP4_USE_MARLIN"] = "1" |
| |
|
| | from vllm import LLM, SamplingParams |
| |
|
| | model = LLM( |
| | "GadflyII/GLM-4.7-Flash-MXFP4", |
| | tensor_parallel_size=1, |
| | max_model_len=65536, # Can go up to 202K with sufficient VRAM |
| | trust_remote_code=True, |
| | gpu_memory_utilization=0.90, |
| | ) |
| | |
| | # Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs |
| | params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048) |
| | outputs = model.generate(["Explain quantum computing in simple terms."], params) |
| | print(outputs[0].outputs[0].text) |
| | ``` |
| | |
| | ### Serving with vLLM |
| | |
| | ```bash |
| | VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \ |
| | --tensor-parallel-size 1 \ |
| | --max-model-len 65536 \ |
| | --trust-remote-code \ |
| | --gpu-memory-utilization 0.90 |
| | ``` |
| | |
| | ### Chat Completions API |
| |
|
| | ```python |
| | import requests |
| | |
| | payload = { |
| | "model": "GadflyII/GLM-4.7-Flash-MXFP4", |
| | "messages": [{"role": "user", "content": "Hello!"}], |
| | "max_tokens": 1024, |
| | "temperature": 0.7, |
| | # Disable thinking mode for direct responses: |
| | "chat_template_kwargs": {"enable_thinking": False} |
| | # Or enable thinking for reasoning tasks: |
| | # "chat_template_kwargs": {"enable_thinking": True} |
| | } |
| | response = requests.post("http://localhost:8000/v1/chat/completions", json=payload) |
| | print(response.json()["choices"][0]["message"]["content"]) |
| | ``` |
| |
|
| | ## Important Usage Notes |
| |
|
| | ### Sampling Parameters |
| |
|
| | | Parameter | Recommended | Avoid | Reason | |
| | |-----------|-------------|-------|--------| |
| | | `temperature` | 0.3-0.7 | - | Standard range | |
| | | `top_p` | 0.9-0.95 | - | Standard range | |
| | | `repetition_penalty` | None or ≤1.05 | >1.05 | High values cause word-salad at long outputs | |
| | | `max_tokens` | Up to 10,000+ | - | Model handles long generation well | |
| |
|
| | ### Thinking Mode |
| |
|
| | This model supports a "thinking" mode where it shows its reasoning process: |
| |
|
| | - **`enable_thinking: True`** - Model outputs its reasoning process before the answer (good for math, coding, complex reasoning) |
| | - **`enable_thinking: False`** - Model outputs the answer directly (good for chat, simple Q&A) |
| |
|
| | The model thinks in English when given English prompts. |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) |
| | - **Architecture**: `Glm4MoeLiteForCausalLM` |
| | - **Parameters**: 30B total, 3B active per token (30B-A3B) |
| | - **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert |
| | - **Layers**: 47 |
| | - **Context Length**: 202,752 tokens (max) |
| | - **Languages**: English, Chinese |
| |
|
| | ## Quantization Details |
| |
|
| | - **Format**: MXFP4 (Microscaling FP4) |
| | - **Weight Format**: E2M1 (4-bit floating point, range ±6.0) |
| | - **Scale Format**: E8M0 (8-bit power-of-2 scales) |
| | - **Block Size**: 32 |
| | - **Calibration**: 128 samples from neuralmagic/calibration dataset |
| |
|
| | ## Evaluation |
| |
|
| | ### MMLU-Pro Overall Results |
| |
|
| | | Model | Accuracy | Correct | Total | Throughput | |
| | |-------|----------|---------|-------|------------| |
| | | BF16 (baseline) | 24.83% | 2988 | 12032 | 92.4 q/s | |
| | | **MXFP4 (this model)** | **25.86%** | **3112** | 12032 | **138.7 q/s** | |
| | | Difference | **+1.03%** | +124 | - | **+50%** | |
| |
|
| | ### MMLU-Pro by Category |
| |
|
| | | Category | BF16 | MXFP4 | Δ | |
| | |----------|------|-------|---| |
| | | Social Sciences | 32.70% | **34.68%** | +1.98% | |
| | | Other | 31.57% | **32.84%** | +1.27% | |
| | | Humanities | 23.78% | 23.78% | 0.00% | |
| | | STEM | 19.94% | **20.86%** | +0.92% | |
| |
|
| | ### MMLU-Pro by Subject (All 14 Subjects) |
| |
|
| | | Subject | BF16 | MXFP4 | Δ | Questions | |
| | |---------|------|-------|---|-----------| |
| | | Biology | 50.35% | **52.16%** | +1.81% | 717 | |
| | | Psychology | 44.99% | **47.74%** | +2.75% | 798 | |
| | | Economics | 36.37% | **38.27%** | +1.90% | 844 | |
| | | Health | 35.21% | **36.31%** | +1.10% | 818 | |
| | | History | **33.60%** | 32.28% | -1.32% | 381 | |
| | | Philosophy | 31.46% | **31.86%** | +0.40% | 499 | |
| | | Other | 28.35% | **29.76%** | +1.41% | 924 | |
| | | Computer Science | **26.10%** | 25.85% | -0.25% | 410 | |
| | | Business | 16.35% | **17.62%** | +1.27% | 789 | |
| | | Law | 16.89% | **17.17%** | +0.28% | 1101 | |
| | | Physics | 15.32% | **16.17%** | +0.85% | 1299 | |
| | | Engineering | **16.00%** | 15.58% | -0.42% | 969 | |
| | | Math | 14.06% | **15.54%** | +1.48% | 1351 | |
| | | Chemistry | 14.13% | **15.46%** | +1.33% | 1132 | |
| |
|
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the original GLM-4.7-Flash: |
| |
|
| | ```bibtex |
| | @misc{glm4flash2025, |
| | title={GLM-4.7-Flash}, |
| | author={Zhipu AI}, |
| | year={2025}, |
| | howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | This model inherits the Apache 2.0 license from the base model. |
| |
|