GLM-4.7-Flash-NVFP4 / README.md

GadflyII

Update README.md

4eeb122 verified 32 minutes ago

preview code

raw

history blame contribute delete

4.58 kB

metadata

license: apache-2.0
language:
  - en
  - zh
base_model: zai-org/GLM-4.7-Flash
tags:
  - moe
  - nvfp4
  - quantized
  - vllm
  - glm
  - 30b
library_name: transformers
pipeline_tag: text-generation

Note if you have a multi-GPU Blackwell system (RTX 50/Pro), try my vLLM fork if you are having issues with P2P / TP=2: (Pending PR into main).

https://github.com/Gadflyii/vllm/tree/main

GLM-4.7-Flash NVFP4 (Mixed Precision)

This is a mixed precision NVFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

Quantization Strategy

This model was made via custom quantization and calibration (128 samples, 2048 max seq len, neuralmagic/calibration, all 64 experts) scripts based on NVIDIA's approach for DeepSeek-V3. It uses mixed precision to preserve accuracy:

Component	Precision	Rationale
MLP Experts	FP4 (E2M1)	64 routed experts, 4 active per token
Dense MLP	FP4 (E2M1)	First layer dense MLP
Attention (MLA)	BF16	Low-rank compressed Q/KV projections are sensitive
Norms, Gates, Embeddings	BF16	Standard practice

Performance

Metric	BF16	Uniform FP4	This Model
MMLU-Pro	24.83%	16.84%	23.55%
Size	62.4 GB	18.9 GB	20.4 GB
Compression	1x	3.3x	3.1x
Accuracy Loss	-	-8.0%	-1.3%

Usage

Requirements

vLLM: 0.14.0+ (for compressed-tensors NVFP4 support)
transformers: 5.0.0+ (for glm4_moe_lite architecture)
GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

Installation

pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git

Inference with vLLM

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-NVFP4",
    tensor_parallel_size=1,
    max_model_len=4096,
    trust_remote_code=True,
    gpu_memory_utilization=0.85,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)

Serving with vLLM

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --trust-remote-code

Model Details

Base Model: zai-org/GLM-4.7-Flash
Architecture: Glm4MoeLiteForCausalLM
Parameters: 30B total, 3B active per token (30B-A3B)
MoE Configuration: 64 routed experts, 4 active, 1 shared expert
Layers: 47
Context Length: 202,752 tokens (max)
Languages: English, Chinese

Quantization Details

Format: compressed-tensors (NVFP4)
Block Size: 16
Scale Format: FP8 (E4M3)
Calibration: 128 samples from neuralmagic/calibration dataset
Full Expert Calibration: All 64 experts calibrated per sample

Evaluation

MMLU-Pro Overall Results

Model	Accuracy	Correct	Total
BF16 (baseline)	24.83%	2988	12032
NVFP4 (this model)	23.55%	2834	12032
Difference	-1.28%	-154	-

MMLU-Pro by Category

Category	BF16	NVFP4	Difference
Social Sciences	32.70%	31.43%	-1.27%
Other	31.57%	30.08%	-1.49%
Humanities	23.78%	22.56%	-1.22%
STEM	19.94%	18.70%	-1.24%

MMLU-Pro by Subject

Subject	BF16	NVFP4	Difference
Biology	50.35%	47.42%	-2.93%
Psychology	44.99%	42.48%	-2.51%
Economics	36.37%	34.48%	-1.89%
Health	35.21%	34.84%	-0.37%
History	33.60%	30.71%	-2.89%
Philosophy	31.46%	30.06%	-1.40%
Other	28.35%	25.87%	-2.48%
Computer Science	26.10%	21.46%	-4.64%
Business	16.35%	16.98%	+0.63%
Law	16.89%	16.35%	-0.54%
Engineering	16.00%	14.04%	-1.96%
Physics	15.32%	14.70%	-0.62%
Math	14.06%	14.29%	+0.23%
Chemistry	14.13%	13.34%	-0.79%

Citation

If you use this model, please cite the original GLM-4.7-Flash:

@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}

License

This model inherits the Apache 2.0 license from the base model.