Update README.md

#10
by UnicornChan - opened
Files changed (1) hide show
  1. README.md +55 -3
README.md CHANGED
@@ -73,14 +73,14 @@ See footnote for more evaluation details.
73
 
74
  ### Prepare environment
75
 
76
- vLLM, SGLang, and xLLM all support local deployment of GLM-5. A simple deployment guide is provided here.
77
 
78
  + vLLM
79
 
80
  Using Docker as:
81
 
82
  ```shell
83
- docker pull vllm/vllm-openai:nightly
84
  ```
85
 
86
  or using pip:
@@ -103,6 +103,24 @@ vLLM, SGLang, and xLLM all support local deployment of GLM-5. A simple deploymen
103
  docker pull lmsysorg/sglang:glm5-blackwell # For Blackwell GPU
104
  ```
105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  ### Deploy
107
 
108
  + vLLM
@@ -136,9 +154,43 @@ vLLM, SGLang, and xLLM all support local deployment of GLM-5. A simple deploymen
136
  --mem-fraction-static 0.85 \
137
  --served-model-name glm-5-fp8
138
  ```
139
-
140
  Check the [sglang cookbook](https://cookbook.sglang.io/autoregressive/GLM/GLM-5) for more details.
141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  + xLLM and other Ascend NPU
143
 
144
  Please check the deployment guide [here](https://github.com/zai-org/GLM-5/blob/main/example/ascend.md).
 
73
 
74
  ### Prepare environment
75
 
76
+ vLLM, SGLang, KTransformers, and xLLM all support local deployment of GLM-5. A simple deployment guide is provided here.
77
 
78
  + vLLM
79
 
80
  Using Docker as:
81
 
82
  ```shell
83
+ docker pull vllm/vllm-openai:nightly
84
  ```
85
 
86
  or using pip:
 
103
  docker pull lmsysorg/sglang:glm5-blackwell # For Blackwell GPU
104
  ```
105
 
106
+ + KTransformers (SGLang + KT-Kernel, CPU-GPU heterogeneous inference)
107
+
108
+ Install SGLang:
109
+
110
+ ```bash
111
+ git clone https://github.com/kvcache-ai/sglang.git
112
+ cd sglang
113
+ pip install -e "python[all]"
114
+ ```
115
+
116
+ Install KT-Kernel:
117
+
118
+ ```bash
119
+ git clone https://github.com/kvcache-ai/ktransformers.git
120
+ git submodule update --init --recursive
121
+ cd kt-kernel && ./install.sh
122
+ ```
123
+
124
  ### Deploy
125
 
126
  + vLLM
 
154
  --mem-fraction-static 0.85 \
155
  --served-model-name glm-5-fp8
156
  ```
157
+
158
  Check the [sglang cookbook](https://cookbook.sglang.io/autoregressive/GLM/GLM-5) for more details.
159
 
160
+ + KTransformers (SGLang + KT-Kernel)
161
+
162
+ ```bash
163
+ export PYTORCH_ALLOC_CONF=expandable_segments:True
164
+ export SGLANG_ENABLE_JIT_DEEPGEMM=0
165
+
166
+ python -m sglang.launch_server \
167
+ --host 0.0.0.0 \
168
+ --port 30000 \
169
+ --model zai-org/GLM-5 \
170
+ --kt-weight-path /path/to/GLM-5 \
171
+ --kt-cpuinfer 96 \
172
+ --kt-threadpool-count 2 \
173
+ --kt-num-gpu-experts 10 \
174
+ --kt-method BF16 \
175
+ --kt-gpu-prefill-token-threshold 1024 \
176
+ --kt-enable-dynamic-expert-update \
177
+ --kt-expert-placement-strategy uniform \
178
+ --trust-remote-code \
179
+ --mem-fraction-static 0.75 \
180
+ --served-model-name GLM5 \
181
+ --enable-mixed-chunk \
182
+ --tensor-parallel-size 8 \
183
+ --enable-p2p-check \
184
+ --disable-shared-experts-fusion \
185
+ --chunked-prefill-size 16384 \
186
+ --max-running-requests 4 \
187
+ --max-total-tokens 128000 \
188
+ --attention-backend flashinfer \
189
+ --watchdog-timeout 3000
190
+ ```
191
+
192
+ Check the [KTransformers GLM-5 tutorial](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/GLM-5-Tutorial.md) for more details.
193
+
194
  + xLLM and other Ascend NPU
195
 
196
  Please check the deployment guide [here](https://github.com/zai-org/GLM-5/blob/main/example/ascend.md).