mconcat
/

GLM-5.1-FP8-Dynamic

Text Generation

Mixture of Experts

compressed-tensors

Model card Files Files and versions

mconcat commited on 7 days ago

Commit

9ef68c3

·

verified ·

1 Parent(s): 40b97bf

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +22 -2

README.md CHANGED Viewed

@@ -83,12 +83,32 @@ vllm serve mconcat/GLM-5.1-FP8-Dynamic \
 ## Notes
-- This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference.
 - FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
-- Compatible with Hopper (SM90) and Blackwell (SM100+) GPUs.
 - Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
 - GLM-5.1 does not ship MTP weights despite `num_nextn_predict_layers=1` in config.
 ## Quantization Process
 - **Tool**: Custom layer-by-layer pipeline with native `torch.float8_e4m3fn` dtype

 ## Notes
+- This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
 - FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
+- Compatible with Hopper (SM90) and Blackwell GPUs.
 - Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
 - GLM-5.1 does not ship MTP weights despite `num_nextn_predict_layers=1` in config.
+## Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)
+If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support:
+```bash
+# Patch 1: FlashMLA ops - add SM120 to sparse support check
+FLASHMLA_OPS=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/ops/flashmla.py'))") && \
+sed -i 's/is_device_capability_family(90)\s*or current_platform.is_device_capability_family(100)/is_device_capability_family(90) or current_platform.is_device_capability_family(100) or current_platform.is_device_capability_family(120)/' "$FLASHMLA_OPS"
+# Patch 2: FlashMLA sparse backend - add SM12 to capability check
+FLASHMLA_SPARSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla_sparse.py'))") && \
+sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_SPARSE"
+# Patch 3: FlashMLA dense backend (if exists)
+FLASHMLA_DENSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla.py'))") && \
+sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_DENSE" 2>/dev/null || true
+```
+These patches add SM120 (Blackwell workstation) to the supported compute capability list for GLM-5.1's DSA sparse attention.
 ## Quantization Process
 - **Tool**: Custom layer-by-layer pipeline with native `torch.float8_e4m3fn` dtype