Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).

https://github.com/Gadflyii/vllm/tree/main

GLM-4.7-Flash MXFP4

This is a MXFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

Quantization Strategy

This model uses MXFP4 (Microscaling FP4) format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts.

Component Precision Rationale
MLP Experts (gate_up, down) MXFP4 (E2M1) 64 routed experts, 4 active per token
Attention (MLA) BF16 Low-rank compressed Q/KV projections are sensitive
Dense MLP BF16 First layer dense MLP
Norms, Gates, Embeddings BF16 Standard practice

MXFP4 vs NVFP4

Property MXFP4 NVFP4
Weight Format E2M1 (4-bit) E2M1 (4-bit)
Scale Format E8M0 (power-of-2) FP8 (E4M3)
Block Size 32 16
Backend Marlin FlashInfer/Cutlass

Performance

Metric BF16 This Model
MMLU-Pro 24.83% 25.86%
Size 62.4 GB 20.8 GB
Compression 1x 3.0x
Accuracy Δ - +1.03%
Throughput 92.4 q/s 138.7 q/s

Usage

Requirements

  • vLLM: 0.14.0+ (for MXFP4 Marlin backend support)
  • transformers: 5.0.0+ (for glm4_moe_lite architecture)
  • GPU: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell)

Installation

pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git

Inference with vLLM

import os
os.environ["VLLM_MXFP4_USE_MARLIN"] = "1"

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-MXFP4",
    tensor_parallel_size=1,
    max_model_len=65536,  # Can go up to 202K with sufficient VRAM
    trust_remote_code=True,
    gpu_memory_utilization=0.90,
)

# Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)

Serving with vLLM

VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 65536 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90

Chat Completions API

import requests

payload = {
    "model": "GadflyII/GLM-4.7-Flash-MXFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 1024,
    "temperature": 0.7,
    # Disable thinking mode for direct responses:
    "chat_template_kwargs": {"enable_thinking": False}
    # Or enable thinking for reasoning tasks:
    # "chat_template_kwargs": {"enable_thinking": True}
}
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])

Important Usage Notes

Sampling Parameters

Parameter Recommended Avoid Reason
temperature 0.3-0.7 - Standard range
top_p 0.9-0.95 - Standard range
repetition_penalty None or ≤1.05 >1.05 High values cause word-salad at long outputs
max_tokens Up to 10,000+ - Model handles long generation well

Thinking Mode

This model supports a "thinking" mode where it shows its reasoning process:

  • enable_thinking: True - Model outputs its reasoning process before the answer (good for math, coding, complex reasoning)
  • enable_thinking: False - Model outputs the answer directly (good for chat, simple Q&A)

The model thinks in English when given English prompts.

Model Details

  • Base Model: zai-org/GLM-4.7-Flash
  • Architecture: Glm4MoeLiteForCausalLM
  • Parameters: 30B total, 3B active per token (30B-A3B)
  • MoE Configuration: 64 routed experts, 4 active, 1 shared expert
  • Layers: 47
  • Context Length: 202,752 tokens (max)
  • Languages: English, Chinese

Quantization Details

  • Format: MXFP4 (Microscaling FP4)
  • Weight Format: E2M1 (4-bit floating point, range ±6.0)
  • Scale Format: E8M0 (8-bit power-of-2 scales)
  • Block Size: 32
  • Calibration: 128 samples from neuralmagic/calibration dataset

Evaluation

MMLU-Pro Overall Results

Model Accuracy Correct Total Throughput
BF16 (baseline) 24.83% 2988 12032 92.4 q/s
MXFP4 (this model) 25.86% 3112 12032 138.7 q/s
Difference +1.03% +124 - +50%

MMLU-Pro by Category

Category BF16 MXFP4 Δ
Social Sciences 32.70% 34.68% +1.98%
Other 31.57% 32.84% +1.27%
Humanities 23.78% 23.78% 0.00%
STEM 19.94% 20.86% +0.92%

MMLU-Pro by Subject (All 14 Subjects)

Subject BF16 MXFP4 Δ Questions
Biology 50.35% 52.16% +1.81% 717
Psychology 44.99% 47.74% +2.75% 798
Economics 36.37% 38.27% +1.90% 844
Health 35.21% 36.31% +1.10% 818
History 33.60% 32.28% -1.32% 381
Philosophy 31.46% 31.86% +0.40% 499
Other 28.35% 29.76% +1.41% 924
Computer Science 26.10% 25.85% -0.25% 410
Business 16.35% 17.62% +1.27% 789
Law 16.89% 17.17% +0.28% 1101
Physics 15.32% 16.17% +0.85% 1299
Engineering 16.00% 15.58% -0.42% 969
Math 14.06% 15.54% +1.48% 1351
Chemistry 14.13% 15.46% +1.33% 1132

Citation

If you use this model, please cite the original GLM-4.7-Flash:

@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
59
Safetensors
Model size
18B params
Tensor type
BF16
·
U8
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GadflyII/GLM-4.7-Flash-MXFP4

Quantized
(47)
this model