Instructions to use GadflyII/GLM-4.7-Flash-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GadflyII/GLM-4.7-Flash-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GadflyII/GLM-4.7-Flash-MXFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GadflyII/GLM-4.7-Flash-MXFP4")
model = AutoModelForCausalLM.from_pretrained("GadflyII/GLM-4.7-Flash-MXFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use GadflyII/GLM-4.7-Flash-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GadflyII/GLM-4.7-Flash-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GadflyII/GLM-4.7-Flash-MXFP4

SGLang

How to use GadflyII/GLM-4.7-Flash-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GadflyII/GLM-4.7-Flash-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GadflyII/GLM-4.7-Flash-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GadflyII/GLM-4.7-Flash-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GadflyII/GLM-4.7-Flash-MXFP4 with Docker Model Runner:
```
docker model run hf.co/GadflyII/GLM-4.7-Flash-MXFP4
```

Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).

Note: If you are running this MXFP4 model on SM120 GPU's, you also will need to use my fork until PR into upstream is merged, however it is significantly slower than NVFP4.

https://github.com/Gadflyii/vllm/tree/main

GLM-4.7-Flash MXFP4

This is a MXFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

Quantization Strategy

This model uses MXFP4 (Microscaling FP4) format with the Marlin backend for inference. Custom quantization with calibration (128 samples, 2048 max seq len) applied to MoE experts.

Component	Precision	Rationale
MLP Experts (gate_up, down)	MXFP4 (E2M1)	64 routed experts, 4 active per token
Attention (MLA)	BF16	Low-rank compressed Q/KV projections are sensitive
Dense MLP	BF16	First layer dense MLP
Norms, Gates, Embeddings	BF16	Standard practice

MXFP4 vs NVFP4

Property	MXFP4	NVFP4
Weight Format	E2M1 (4-bit)	E2M1 (4-bit)
Scale Format	E8M0 (power-of-2)	FP8 (E4M3)
Block Size	32	16
Backend	Marlin	FlashInfer/Cutlass

Performance

Metric	BF16	This Model
MMLU-Pro	24.83%	25.86%
Size	62.4 GB	20.8 GB
Compression	1x	3.0x
Accuracy Δ	-	+1.03%
Throughput	92.4 q/s	138.7 q/s

Usage

Requirements

vLLM: 0.14.0+ (for MXFP4 Marlin backend support)
transformers: 5.0.0+ (for glm4_moe_lite architecture)
GPU: NVIDIA GPU with compute capability 8.0+ (Ampere/Hopper/Blackwell)

Installation

pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git

Inference with vLLM

import os
os.environ["VLLM_MXFP4_USE_MARLIN"] = "1"

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-MXFP4",
    tensor_parallel_size=1,
    max_model_len=65536,  # Can go up to 202K with sufficient VRAM
    trust_remote_code=True,
    gpu_memory_utilization=0.90,
)

# Note: Do NOT use repetition_penalty > 1.05, it causes degradation at long outputs
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)

Serving with vLLM

VLLM_MXFP4_USE_MARLIN=1 vllm serve GadflyII/GLM-4.7-Flash-MXFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 65536 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90

Chat Completions API

import requests

payload = {
    "model": "GadflyII/GLM-4.7-Flash-MXFP4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 1024,
    "temperature": 0.7,
    # Disable thinking mode for direct responses:
    "chat_template_kwargs": {"enable_thinking": False}
    # Or enable thinking for reasoning tasks:
    # "chat_template_kwargs": {"enable_thinking": True}
}
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
print(response.json()["choices"][0]["message"]["content"])

Important Usage Notes

Sampling Parameters

Parameter	Recommended	Avoid	Reason
`temperature`	0.3-0.7	-	Standard range
`top_p`	0.9-0.95	-	Standard range
`repetition_penalty`	None or ≤1.05	>1.05	High values cause word-salad at long outputs
`max_tokens`	Up to 10,000+	-	Model handles long generation well

Thinking Mode

This model supports a "thinking" mode where it shows its reasoning process:

enable_thinking: True - Model outputs its reasoning process before the answer (good for math, coding, complex reasoning)
enable_thinking: False - Model outputs the answer directly (good for chat, simple Q&A)

The model thinks in English when given English prompts.

Model Details

Base Model: zai-org/GLM-4.7-Flash
Architecture: Glm4MoeLiteForCausalLM
Parameters: 30B total, 3B active per token (30B-A3B)
MoE Configuration: 64 routed experts, 4 active, 1 shared expert
Layers: 47
Context Length: 202,752 tokens (max)
Languages: English, Chinese

Quantization Details

Format: MXFP4 (Microscaling FP4)
Weight Format: E2M1 (4-bit floating point, range ±6.0)
Scale Format: E8M0 (8-bit power-of-2 scales)
Block Size: 32
Calibration: 128 samples from neuralmagic/calibration dataset

Evaluation

MMLU-Pro Overall Results

Model	Accuracy	Correct	Total	Throughput
BF16 (baseline)	24.83%	2988	12032	92.4 q/s
MXFP4 (this model)	25.86%	3112	12032	138.7 q/s
Difference	+1.03%	+124	-	+50%

MMLU-Pro by Category

Category	BF16	MXFP4	Δ
Social Sciences	32.70%	34.68%	+1.98%
Other	31.57%	32.84%	+1.27%
Humanities	23.78%	23.78%	0.00%
STEM	19.94%	20.86%	+0.92%

MMLU-Pro by Subject (All 14 Subjects)

Subject	BF16	MXFP4	Δ	Questions
Biology	50.35%	52.16%	+1.81%	717
Psychology	44.99%	47.74%	+2.75%	798
Economics	36.37%	38.27%	+1.90%	844
Health	35.21%	36.31%	+1.10%	818
History	33.60%	32.28%	-1.32%	381
Philosophy	31.46%	31.86%	+0.40%	499
Other	28.35%	29.76%	+1.41%	924
Computer Science	26.10%	25.85%	-0.25%	410
Business	16.35%	17.62%	+1.27%	789
Law	16.89%	17.17%	+0.28%	1101
Physics	15.32%	16.17%	+0.85%	1299
Engineering	16.00%	15.58%	-0.42%	969
Math	14.06%	15.54%	+1.48%	1351
Chemistry	14.13%	15.46%	+1.33%	1132

Citation

If you use this model, please cite the original GLM-4.7-Flash:

@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month: 65

Safetensors

Model size

18B params

Tensor type

BF16

F32

Model tree for GadflyII/GLM-4.7-Flash-MXFP4

Base model

zai-org/GLM-4.7-Flash

Quantized

(84)

this model