Instructions to use GadflyII/GLM-4.7-Flash-MTP-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use GadflyII/GLM-4.7-Flash-MTP-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="GadflyII/GLM-4.7-Flash-MTP-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("GadflyII/GLM-4.7-Flash-MTP-NVFP4") model = AutoModelForCausalLM.from_pretrained("GadflyII/GLM-4.7-Flash-MTP-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use GadflyII/GLM-4.7-Flash-MTP-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "GadflyII/GLM-4.7-Flash-MTP-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GadflyII/GLM-4.7-Flash-MTP-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/GadflyII/GLM-4.7-Flash-MTP-NVFP4
- SGLang
How to use GadflyII/GLM-4.7-Flash-MTP-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "GadflyII/GLM-4.7-Flash-MTP-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GadflyII/GLM-4.7-Flash-MTP-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "GadflyII/GLM-4.7-Flash-MTP-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GadflyII/GLM-4.7-Flash-MTP-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use GadflyII/GLM-4.7-Flash-MTP-NVFP4 with Docker Model Runner:
docker model run hf.co/GadflyII/GLM-4.7-Flash-MTP-NVFP4
- Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
- GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP in BF16)
Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
https://github.com/Gadflyii/vllm/tree/main
GLM-4.7-Flash-MTP-NVFP4 (Mixed Precision with MTP in BF16)
This is a mixed precision NVFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model. This version preserves MTP (Multi-Token Prediction) layers in BF16 for speculative decoding compatibility.
What's Different from GLM-4.7-Flash-NVFP4?
| Feature | GLM-4.7-Flash-NVFP4 | This Model |
|---|---|---|
| MTP Layers | NVFP4 | BF16 |
| Calibration Samples | 128 | 512 |
| Calibration Seq Length | 2048 | 4096 |
| MMLU-Pro Accuracy | 23.56% | 23.91% |
Quantization Strategy
This model uses mixed precision to preserve accuracy and MTP functionality:
| Component | Precision | Rationale |
|---|---|---|
| MLP Experts | FP4 (E2M1) | 64 routed experts, 4 active per token |
| Dense MLP | FP4 (E2M1) | First layer dense MLP |
| Attention (MLA) | BF16 | Low-rank compressed Q/KV projections are sensitive |
| MTP Layers | BF16 | eh_proj, shared_head.head for speculative decoding |
| Norms, Gates, Embeddings | BF16 | Standard practice |
Performance
| Metric | BF16 | NVFP4 | This Model |
|---|---|---|---|
| MMLU-Pro | 24.83% | 23.56% | 23.91% |
| Size | 62.4 GB | 20.4 GB | 20.9 GB |
| Compression | 1x | 3.1x | 3.0x |
| Accuracy Loss | - | -1.27% | -0.92% |
MTP Acceptance Rate
| Model | Acceptance Rate | Mean Accepted Length |
|---|---|---|
| BF16 (baseline) | 60% | 1.60 |
| This Model | 63% | 1.63 |
MTP quality is preserved (actually slightly improved) after quantization.
MTP Performance Note
MTP speculative decoding currently shows overhead rather than speedup due to missing torch.compile support for the MTP drafter model in vLLM. For best throughput, run without MTP enabled until this is resolved upstream.
| Configuration | Tokens/sec |
|---|---|
| Without MTP | 78.1 tok/s |
| With MTP (1 token) | 64.7 tok/s |
| With MTP (2 tokens) | 56.8 tok/s |
| With MTP (4 tokens) | 44.5 tok/s |
Usage
Requirements
- vLLM: 0.8.0+ (for compressed-tensors NVFP4 support)
- transformers: 5.0.0+ (for
glm4_moe_litearchitecture) - GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)
Installation
pip install vllm>=0.8.0
pip install git+https://github.com/huggingface/transformers.git
Inference with vLLM (Recommended)
from vllm import LLM, SamplingParams
model = LLM(
"GadflyII/GLM-4.7-Flash-MTP-NVFP4",
tensor_parallel_size=1,
max_model_len=4096,
trust_remote_code=True,
gpu_memory_utilization=0.90,
)
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)
Serving with vLLM
# Standard serving (recommended for performance)
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.90
# With MTP speculative decoding (experimental)
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve GadflyII/GLM-4.7-Flash-MTP-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
Model Details
- Base Model: zai-org/GLM-4.7-Flash
- Architecture:
Glm4MoeLiteForCausalLM - Parameters: 30B total, 3B active per token (30B-A3B)
- MoE Configuration: 64 routed experts, 4 active, 1 shared expert
- Layers: 47 (with 1 MTP layer)
- Context Length: 202,752 tokens (max)
- Languages: English, Chinese
Quantization Details
- Format: compressed-tensors (NVFP4)
- Block Size: 16
- Scale Format: FP8 (E4M3)
- Calibration: 512 samples from wikitext dataset
- Calibration Sequence Length: 4096
- Full Expert Calibration: All 64 experts calibrated per sample
Tensors by Precision
| Precision | Count | Description |
|---|---|---|
| NVFP4 | 9,168 | MLP/FFN weights |
| BF16 | 240 | Attention weights (MLA) |
| BF16 | 2 | MTP layers (eh_proj, shared_head.head) |
Evaluation
MMLU-Pro Overall Results
| Model | Accuracy | Correct | Total |
|---|---|---|---|
| BF16 (baseline) | 24.83% | 2988 | 12032 |
| NVFP4-v1 | 23.56% | 2835 | 12032 |
| This Model | 23.91% | 2877 | 12032 |
MMLU-Pro by Category
| Category | BF16 | This Model | Difference |
|---|---|---|---|
| Social Sciences | 32.70% | 31.26% | -1.44% |
| Other | 31.57% | 29.85% | -1.72% |
| Humanities | 23.78% | 22.82% | -0.96% |
| STEM | 19.94% | 19.48% | -0.46% |
MMLU-Pro by Subject
| Subject | BF16 | This Model | Difference |
|---|---|---|---|
| Biology | 50.35% | 48.12% | -2.23% |
| Psychology | 44.99% | 41.23% | -3.76% |
| History | 33.60% | 34.12% | +0.52% |
| Health | 35.21% | 34.11% | -1.10% |
| Economics | 36.37% | 33.06% | -3.31% |
| Philosophy | 31.46% | 29.26% | -2.20% |
| Other | 28.35% | 26.08% | -2.27% |
| Computer Science | 26.10% | 21.95% | -4.15% |
| Business | 16.35% | 19.26% | +2.91% |
| Law | 16.89% | 15.99% | -0.90% |
| Math | 14.06% | 14.73% | +0.67% |
| Physics | 15.32% | 15.24% | -0.08% |
| Engineering | 16.00% | 14.96% | -1.04% |
| Chemistry | 14.13% | 14.84% | +0.71% |
Citation
If you use this model, please cite the original GLM-4.7-Flash:
@misc{glm4flash2025,
title={GLM-4.7-Flash},
author={Zhipu AI},
year={2025},
howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}
License
This model inherits the Apache 2.0 license from the base model.
- Downloads last month
- 1,385
Model tree for GadflyII/GLM-4.7-Flash-MTP-NVFP4
Base model
zai-org/GLM-4.7-Flash