--- license: apache-2.0 language: - en - zh base_model: zai-org/GLM-4.7-Flash tags: - moe - mxfp4 - quantized - vllm - glm - 30b library_name: transformers pipeline_tag: text-generation --- # Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). # Note: If you are running this MXFP4 model on SM120 GPU's, you also will need to use my fork until PR into upstream is merged, however it is significantly slower than NVFP4. https://github.com/Gadflyii/vllm/tree/main # GLM-4.7-Flash MXFP4 This is a **MXFP4 quantization** of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. ## Quantization Strategy This model uses **MXFP4 (Microscaling FP4)** format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts. | Component | Precision | Rationale | |-----------|-----------|-----------| | MLP Experts (gate_up, down) | MXFP4 (E2M1) | 64 routed experts, 4 active per token | | **Attention (MLA)** | **BF16** | Low-rank compressed Q/KV projections are sensitive | | Dense MLP | BF16 | First layer dense MLP | | Norms, Gates, Embeddings | BF16 | Standard practice | ### MXFP4 vs NVFP4 | Property | MXFP4 | NVFP4 | |----------|-------|-------| | Weight Format | E2M1 (4-bit) | E2M1 (4-bit) | | Scale Format | E8M0 (power-of-2) | FP8 (E4M3) | | Block Size | 32 | 16 | | Backend | Marlin | FlashInfer/Cutlass | ## Performance | Metric | BF16 | **This Model** | |--------|------|----------------| | MMLU-Pro | 24.83% | **25.86%** | | Size | 62.4 GB | **20.8 GB** | | Compression | 1x | **3.0x** | | Accuracy Δ | - | **+1.03%** | | Throughput | 92.4 q/s | **138.7 q/s** | ## Usage ### Requirements - **vLLM**: 0.14.0+ (for MXFP4 Marlin backend support) - **transformers**: 5.0.0+ (for `glm4_moe_lite` architecture) - **GPU**: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell) ### Installation ```bash pip install vllm>=0.14.0 pip install git+https://github.com/huggingface/transformers.git ``` ### Inference with vLLM ```python import os os.environ["VLLM_MXFP4_USE_MARLIN"] = "1" from vllm import LLM, SamplingParams model = LLM( "GadflyII/GLM-4.7-Flash-MXFP4", tensor_parallel_size=1, max_model_len=65536, # Can go up to 202K with sufficient VRAM trust_remote_code=True, gpu_memory_utilization=0.90, ) # Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048) outputs = model.generate(["Explain quantum computing in simple terms."], params) print(outputs[0].outputs[0].text) ``` ### Serving with vLLM ```bash VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \ --tensor-parallel-size 1 \ --max-model-len 65536 \ --trust-remote-code \ --gpu-memory-utilization 0.90 ``` ### Chat Completions API ```python import requests payload = { "model": "GadflyII/GLM-4.7-Flash-MXFP4", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 1024, "temperature": 0.7, # Disable thinking mode for direct responses: "chat_template_kwargs": {"enable_thinking": False} # Or enable thinking for reasoning tasks: # "chat_template_kwargs": {"enable_thinking": True} } response = requests.post("http://localhost:8000/v1/chat/completions", json=payload) print(response.json()["choices"][0]["message"]["content"]) ``` ## Important Usage Notes ### Sampling Parameters | Parameter | Recommended | Avoid | Reason | |-----------|-------------|-------|--------| | `temperature` | 0.3-0.7 | - | Standard range | | `top_p` | 0.9-0.95 | - | Standard range | | `repetition_penalty` | None or ≤1.05 | >1.05 | High values cause word-salad at long outputs | | `max_tokens` | Up to 10,000+ | - | Model handles long generation well | ### Thinking Mode This model supports a "thinking" mode where it shows its reasoning process: - **`enable_thinking: True`** - Model outputs its reasoning process before the answer (good for math, coding, complex reasoning) - **`enable_thinking: False`** - Model outputs the answer directly (good for chat, simple Q&A) The model thinks in English when given English prompts. ## Model Details - **Base Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - **Architecture**: `Glm4MoeLiteForCausalLM` - **Parameters**: 30B total, 3B active per token (30B-A3B) - **MoE Configuration**: 64 routed experts, 4 active, 1 shared expert - **Layers**: 47 - **Context Length**: 202,752 tokens (max) - **Languages**: English, Chinese ## Quantization Details - **Format**: MXFP4 (Microscaling FP4) - **Weight Format**: E2M1 (4-bit floating point, range ±6.0) - **Scale Format**: E8M0 (8-bit power-of-2 scales) - **Block Size**: 32 - **Calibration**: 128 samples from neuralmagic/calibration dataset ## Evaluation ### MMLU-Pro Overall Results | Model | Accuracy | Correct | Total | Throughput | |-------|----------|---------|-------|------------| | BF16 (baseline) | 24.83% | 2988 | 12032 | 92.4 q/s | | **MXFP4 (this model)** | **25.86%** | **3112** | 12032 | **138.7 q/s** | | Difference | **+1.03%** | +124 | - | **+50%** | ### MMLU-Pro by Category | Category | BF16 | MXFP4 | Δ | |----------|------|-------|---| | Social Sciences | 32.70% | **34.68%** | +1.98% | | Other | 31.57% | **32.84%** | +1.27% | | Humanities | 23.78% | 23.78% | 0.00% | | STEM | 19.94% | **20.86%** | +0.92% | ### MMLU-Pro by Subject (All 14 Subjects) | Subject | BF16 | MXFP4 | Δ | Questions | |---------|------|-------|---|-----------| | Biology | 50.35% | **52.16%** | +1.81% | 717 | | Psychology | 44.99% | **47.74%** | +2.75% | 798 | | Economics | 36.37% | **38.27%** | +1.90% | 844 | | Health | 35.21% | **36.31%** | +1.10% | 818 | | History | **33.60%** | 32.28% | -1.32% | 381 | | Philosophy | 31.46% | **31.86%** | +0.40% | 499 | | Other | 28.35% | **29.76%** | +1.41% | 924 | | Computer Science | **26.10%** | 25.85% | -0.25% | 410 | | Business | 16.35% | **17.62%** | +1.27% | 789 | | Law | 16.89% | **17.17%** | +0.28% | 1101 | | Physics | 15.32% | **16.17%** | +0.85% | 1299 | | Engineering | **16.00%** | 15.58% | -0.42% | 969 | | Math | 14.06% | **15.54%** | +1.48% | 1351 | | Chemistry | 14.13% | **15.46%** | +1.33% | 1132 | ## Citation If you use this model, please cite the original GLM-4.7-Flash: ```bibtex @misc{glm4flash2025, title={GLM-4.7-Flash}, author={Zhipu AI}, year={2025}, howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}} } ``` ## License This model inherits the Apache 2.0 license from the base model.