GLM-4.7-Flash-NVFP4 / README.md
GadflyII's picture
Update README.md
4eeb122 verified
metadata
license: apache-2.0
language:
  - en
  - zh
base_model: zai-org/GLM-4.7-Flash
tags:
  - moe
  - nvfp4
  - quantized
  - vllm
  - glm
  - 30b
library_name: transformers
pipeline_tag: text-generation

Note if you have a multi-GPU Blackwell system (RTX 50/Pro), try my vLLM fork if you are having issues with P2P / TP=2: (Pending PR into main).

https://github.com/Gadflyii/vllm/tree/main

GLM-4.7-Flash NVFP4 (Mixed Precision)

This is a mixed precision NVFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

Quantization Strategy

This model was made via custom quantization and calibration (128 samples, 2048 max seq len, neuralmagic/calibration, all 64 experts) scripts based on NVIDIA's approach for DeepSeek-V3. It uses mixed precision to preserve accuracy:

Component Precision Rationale
MLP Experts FP4 (E2M1) 64 routed experts, 4 active per token
Dense MLP FP4 (E2M1) First layer dense MLP
Attention (MLA) BF16 Low-rank compressed Q/KV projections are sensitive
Norms, Gates, Embeddings BF16 Standard practice

Performance

Metric BF16 Uniform FP4 This Model
MMLU-Pro 24.83% 16.84% 23.55%
Size 62.4 GB 18.9 GB 20.4 GB
Compression 1x 3.3x 3.1x
Accuracy Loss - -8.0% -1.3%

Usage

Requirements

  • vLLM: 0.14.0+ (for compressed-tensors NVFP4 support)
  • transformers: 5.0.0+ (for glm4_moe_lite architecture)
  • GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

Installation

pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git

Inference with vLLM

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-NVFP4",
    tensor_parallel_size=1,
    max_model_len=4096,
    trust_remote_code=True,
    gpu_memory_utilization=0.85,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)

Serving with vLLM

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --trust-remote-code

Model Details

  • Base Model: zai-org/GLM-4.7-Flash
  • Architecture: Glm4MoeLiteForCausalLM
  • Parameters: 30B total, 3B active per token (30B-A3B)
  • MoE Configuration: 64 routed experts, 4 active, 1 shared expert
  • Layers: 47
  • Context Length: 202,752 tokens (max)
  • Languages: English, Chinese

Quantization Details

  • Format: compressed-tensors (NVFP4)
  • Block Size: 16
  • Scale Format: FP8 (E4M3)
  • Calibration: 128 samples from neuralmagic/calibration dataset
  • Full Expert Calibration: All 64 experts calibrated per sample

Evaluation

MMLU-Pro Overall Results

Model Accuracy Correct Total
BF16 (baseline) 24.83% 2988 12032
NVFP4 (this model) 23.55% 2834 12032
Difference -1.28% -154 -

MMLU-Pro by Category

Category BF16 NVFP4 Difference
Social Sciences 32.70% 31.43% -1.27%
Other 31.57% 30.08% -1.49%
Humanities 23.78% 22.56% -1.22%
STEM 19.94% 18.70% -1.24%

MMLU-Pro by Subject

Subject BF16 NVFP4 Difference
Biology 50.35% 47.42% -2.93%
Psychology 44.99% 42.48% -2.51%
Economics 36.37% 34.48% -1.89%
Health 35.21% 34.84% -0.37%
History 33.60% 30.71% -2.89%
Philosophy 31.46% 30.06% -1.40%
Other 28.35% 25.87% -2.48%
Computer Science 26.10% 21.46% -4.64%
Business 16.35% 16.98% +0.63%
Law 16.89% 16.35% -0.54%
Engineering 16.00% 14.04% -1.96%
Physics 15.32% 14.70% -0.62%
Math 14.06% 14.29% +0.23%
Chemistry 14.13% 13.34% -0.79%

Citation

If you use this model, please cite the original GLM-4.7-Flash:

@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}

License

This model inherits the Apache 2.0 license from the base model.