mconcat commited on
Commit
9ef68c3
·
verified ·
1 Parent(s): 40b97bf

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +22 -2
README.md CHANGED
@@ -83,12 +83,32 @@ vllm serve mconcat/GLM-5.1-FP8-Dynamic \
83
 
84
  ## Notes
85
 
86
- - This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference.
87
  - FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
88
- - Compatible with Hopper (SM90) and Blackwell (SM100+) GPUs.
89
  - Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
90
  - GLM-5.1 does not ship MTP weights despite `num_nextn_predict_layers=1` in config.
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ## Quantization Process
93
 
94
  - **Tool**: Custom layer-by-layer pipeline with native `torch.float8_e4m3fn` dtype
 
83
 
84
  ## Notes
85
 
86
+ - This is a 754B MoE model (~40B active per token). Requires multi-GPU setup for inference (8x 80GB+ GPUs recommended).
87
  - FP8 E4M3 provides ~2x compression over BF16 with minimal quality degradation.
88
+ - Compatible with Hopper (SM90) and Blackwell GPUs.
89
  - Dynamic activation scaling — scales computed at inference time, not baked into the checkpoint.
90
  - GLM-5.1 does not ship MTP weights despite `num_nextn_predict_layers=1` in config.
91
 
92
+ ## Blackwell SM120 Patch (RTX PRO 6000 / Workstation GPUs)
93
+
94
+ If running on Blackwell workstation GPUs (SM 12.0), vLLM 0.19.0 requires patches for FlashMLA sparse attention support:
95
+
96
+ ```bash
97
+ # Patch 1: FlashMLA ops - add SM120 to sparse support check
98
+ FLASHMLA_OPS=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/ops/flashmla.py'))") && \
99
+ sed -i 's/is_device_capability_family(90)\s*or current_platform.is_device_capability_family(100)/is_device_capability_family(90) or current_platform.is_device_capability_family(100) or current_platform.is_device_capability_family(120)/' "$FLASHMLA_OPS"
100
+
101
+ # Patch 2: FlashMLA sparse backend - add SM12 to capability check
102
+ FLASHMLA_SPARSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla_sparse.py'))") && \
103
+ sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_SPARSE"
104
+
105
+ # Patch 3: FlashMLA dense backend (if exists)
106
+ FLASHMLA_DENSE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/attention/backends/mla/flashmla.py'))") && \
107
+ sed -i 's/return capability.major in \[9, 10\]/return capability.major in [9, 10, 12]/' "$FLASHMLA_DENSE" 2>/dev/null || true
108
+ ```
109
+
110
+ These patches add SM120 (Blackwell workstation) to the supported compute capability list for GLM-5.1's DSA sparse attention.
111
+
112
  ## Quantization Process
113
 
114
  - **Tool**: Custom layer-by-layer pipeline with native `torch.float8_e4m3fn` dtype