Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -83,12 +83,32 @@ vllm serve mconcat/GLM-5.1-FP8-Dynamic \
|
|
| 83 |
|
| 84 |
## Notes
|
| 85 |
|
| 86 |
-
- This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference.
|
| 87 |
- FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
|
| 88 |
-
- Compatible with Hopper (SM90) and Blackwell
|
| 89 |
- Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
|
| 90 |
- GLM-5.1 does not ship MTP weights despite `num_nextn_predict_layers=1` in config.
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
## Quantization Process
|
| 93 |
|
| 94 |
- **Tool**: Custom layer-by-layer pipeline with native `torch.float8_e4m3fn` dtype
|
|
|
|
| 83 |
|
| 84 |
## Notes
|
| 85 |
|
| 86 |
+
- This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
|
| 87 |
- FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
|
| 88 |
+
- Compatible with Hopper (SM90) and Blackwell GPUs.
|
| 89 |
- Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
|
| 90 |
- GLM-5.1 does not ship MTP weights despite `num_nextn_predict_layers=1` in config.
|
| 91 |
|
| 92 |
+
## Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)
|
| 93 |
+
|
| 94 |
+
If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support:
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
# Patch 1: FlashMLA ops - add SM120 to sparse support check
|
| 98 |
+
FLASHMLA_OPS=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/ops/flashmla.py'))") && \
|
| 99 |
+
sed -i 's/is_device_capability_family(90)\s*or current_platform.is_device_capability_family(100)/is_device_capability_family(90) or current_platform.is_device_capability_family(100) or current_platform.is_device_capability_family(120)/' "$FLASHMLA_OPS"
|
| 100 |
+
|
| 101 |
+
# Patch 2: FlashMLA sparse backend - add SM12 to capability check
|
| 102 |
+
FLASHMLA_SPARSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla_sparse.py'))") && \
|
| 103 |
+
sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_SPARSE"
|
| 104 |
+
|
| 105 |
+
# Patch 3: FlashMLA dense backend (if exists)
|
| 106 |
+
FLASHMLA_DENSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla.py'))") && \
|
| 107 |
+
sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_DENSE" 2>/dev/null || true
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
These patches add SM120 (Blackwell workstation) to the supported compute capability list for GLM-5.1's DSA sparse attention.
|
| 111 |
+
|
| 112 |
## Quantization Process
|
| 113 |
|
| 114 |
- **Tool**: Custom layer-by-layer pipeline with native `torch.float8_e4m3fn` dtype
|