Upload README.md with huggingface_hub

9ef68c3 verified 1 day ago

4.32 kB

	---
	language:
	- en
	- zh
	- ko
	- ja
	license: mit
	base_model: zai-org/GLM-5.1
	tags:
	- glm
	- glm-5.1
	- moe
	- quantized
	- fp8
	- float8

	pipeline_tag: text-generation
	library_name: transformers
	model_name: GLM-5.1-FP8-Dynamic
	quantized_by: mconcat
	---

	# GLM-5.1-FP8-Dynamic

	FP8 dynamic quantized version of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1).

	This checkpoint preserves the GLM-5.1 MoE + MLA + DSA architecture from the BF16 source, with all Linear weights quantized to FP8 E4M3 for ~2x compression.

	## Quantization Strategy

	Per-channel FP8 E4M3 weight quantization with dynamic per-token activation scaling:

	\| Precision \| Layers \|
	\|-----------\|--------\|
	\| FP8 E4M3 \| All Linear weights: MLA projections, MLP gate/up/down, expert projections, DSA indexer \|
	\| BF16 \| `lm_head`, `embed_tokens`, MoE router gates, norms \|

	Architecture match with the BF16 source:

	- `model_type=glm_moe_dsa`
	- `78` layers (3 dense + 75 MoE, `first_k_dense_replace=3`)
	- `n_routed_experts=256`, `num_experts_per_tok=8`, `n_shared_experts=1`

	- `max_position_embeddings=202752`
	- `hidden_size=6144`, `moe_intermediate_size=2048`
	- `vocab_size=154880`

	## Calibration

	- 512 self-calibration samples generated from GLM-5.1 via OpenRouter (top-tier provider routing)
	- 8 diverse categories: math, code, logic, analysis, creative writing, general knowledge, agentic/tool-calling, Korean
	- Activation statistics collected layer-by-layer for per-channel FP8 scale computation

	## Usage

	### SGLang

	```bash
	python3 -m sglang.launch_server --model mconcat/GLM-5.1-FP8-Dynamic \
	--tensor-parallel-size 8 \
	--dtype bfloat16 \
	--trust-remote-code \
	--mem-fraction-static 0.80
	```

	### vLLM

	```bash
	vllm serve mconcat/GLM-5.1-FP8-Dynamic \
	--tensor-parallel-size 8 \
	--dtype bfloat16 \
	--trust-remote-code
	```


	## Compatibility

	\| Framework \| Supported \| Notes \|
	\|-----------\|-----------\|-------\|
	\| vLLM >= 0.19.0 \| Yes \| Requires `glm_moe_dsa` + compressed-tensors support \|
	\| SGLang >= 0.5.10 \| Yes \| Requires GLM-5.1 architecture support \|
	\| transformers >= 5.4.0 \| Yes \| Direct loading with `device_map="auto"` \|

	## Notes

	- This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
	- FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
	- Compatible with Hopper (SM90) and Blackwell GPUs.
	- Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
	- GLM-5.1 does not ship MTP weights despite `num_nextn_predict_layers=1` in config.

	## Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)

	If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support:

	```bash
	# Patch 1: FlashMLA ops - add SM120 to sparse support check
	FLASHMLA_OPS=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/ops/flashmla.py'))") && \
	sed -i 's/is_device_capability_family(90)\s*or current_platform.is_device_capability_family(100)/is_device_capability_family(90) or current_platform.is_device_capability_family(100) or current_platform.is_device_capability_family(120)/' "$FLASHMLA_OPS"

	# Patch 2: FlashMLA sparse backend - add SM12 to capability check
	FLASHMLA_SPARSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla_sparse.py'))") && \
	sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_SPARSE"

	# Patch 3: FlashMLA dense backend (if exists)
	FLASHMLA_DENSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla.py'))") && \
	sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_DENSE" 2>/dev/null \|\| true
	```

	These patches add SM120 (Blackwell workstation) to the supported compute capability list for GLM-5.1's DSA sparse attention.

	## Quantization Process

	- Tool: Custom layer-by-layer pipeline with native `torch.float8_e4m3fn` dtype
	- Hardware: Single NVIDIA RTX PRO 6000 Blackwell (96 GB), processed one layer at a time
	- Time: ~319 minutes for 78 layers
	- Calibration: 256 samples, per-module activation statistics with MoE expert input hooks