Instructions to use mconcat/GLM-5.1-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mconcat/GLM-5.1-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mconcat/GLM-5.1-FP8-Dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mconcat/GLM-5.1-FP8-Dynamic") model = AutoModelForCausalLM.from_pretrained("mconcat/GLM-5.1-FP8-Dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mconcat/GLM-5.1-FP8-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mconcat/GLM-5.1-FP8-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/GLM-5.1-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mconcat/GLM-5.1-FP8-Dynamic
- SGLang
How to use mconcat/GLM-5.1-FP8-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mconcat/GLM-5.1-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/GLM-5.1-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mconcat/GLM-5.1-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mconcat/GLM-5.1-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mconcat/GLM-5.1-FP8-Dynamic with Docker Model Runner:
docker model run hf.co/mconcat/GLM-5.1-FP8-Dynamic
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mconcat/GLM-5.1-FP8-Dynamic")
model = AutoModelForCausalLM.from_pretrained("mconcat/GLM-5.1-FP8-Dynamic")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))GLM-5.1-FP8-Dynamic
FP8 dynamic quantized version of zai-org/GLM-5.1.
This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, with all Linear weights quantized to FP8 E4M3 for ~2x compression.
Quantization Strategy
Per-channel FP8 E4M3 weight quantization with dynamic per-token activation scaling:
| Precision | Layers |
|---|---|
| FP8 E4M3 | All Linear weights: MLA projections, MLP gate/up/down, expert projections, DSA indexer |
| BF16 | lm_head, embed_tokens, MoE router gates, norms |
Architecture match with the BF16 source:
model_type=glm_moe_dsa78layers (3 dense + 75 MoE,first_k_dense_replace=3)n_routed_experts=256,num_experts_per_tok=8,n_shared_experts=1max_position_embeddings=202752hidden_size=6144,moe_intermediate_size=2048vocab_size=154880
Calibration
- 512 self-calibration samples generated from GLM-5.1 via OpenRouter (top-tier provider routing)
- 8 diverse categories: math, code, logic, analysis, creative writing, general knowledge, agentic/tool-calling, Korean
- Activation statistics collected layer-by-layer for per-channel FP8 scale computation
Usage
SGLang
python3 -m sglang.launch_server --model mconcat/GLM-5.1-FP8-Dynamic \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--trust-remote-code \
--mem-fraction-static 0.80
vLLM
vllm serve mconcat/GLM-5.1-FP8-Dynamic \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--trust-remote-code
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.19.0 | Yes | Requires glm_moe_dsa + compressed-tensors support |
| SGLang >= 0.5.10 | Yes | Requires GLM-5.1 architecture support |
| transformers >= 5.4.0 | Yes | Direct loading with device_map="auto" |
Notes
- This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
- FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
- Compatible with Hopper (SM90) and Blackwell GPUs.
- Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
- GLM-5.1 does not ship MTP weights despite
num_nextn_predict_layers=1in config.
Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)
If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support:
# Patch 1: FlashMLA ops - add SM120 to sparse support check
FLASHMLA_OPS=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/ops/flashmla.py'))") && \
sed -i 's/is_device_capability_family(90)\s*or current_platform.is_device_capability_family(100)/is_device_capability_family(90) or current_platform.is_device_capability_family(100) or current_platform.is_device_capability_family(120)/' "$FLASHMLA_OPS"
# Patch 2: FlashMLA sparse backend - add SM12 to capability check
FLASHMLA_SPARSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla_sparse.py'))") && \
sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_SPARSE"
# Patch 3: FlashMLA dense backend (if exists)
FLASHMLA_DENSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla.py'))") && \
sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_DENSE" 2>/dev/null || true
These patches add SM120 (Blackwell workstation) to the supported compute capability list for GLM-5.1's DSA sparse attention.
Quantization Process
- Tool: Custom layer-by-layer pipeline with native
torch.float8_e4m3fndtype - Hardware: Single NVIDIA RTX PRO 6000 Blackwell (96 GB), processed one layer at a time
- Time: ~319 minutes for 78 layers
- Calibration: 256 samples, per-module activation statistics with MoE expert input hooks
- Downloads last month
- 104
Model tree for mconcat/GLM-5.1-FP8-Dynamic
Base model
zai-org/GLM-5.1
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mconcat/GLM-5.1-FP8-Dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)