Possible to run on six RTX Pro 6000 Blackwell with vLLM oder SGLang?
Is it possible to run this model with 6 RTX Pro 6000 Blackwell GPUs? I think generally it should work with a combination of --tensor-parallel-size 2 and --pipeline-parallel-size 3, but I am not sure.
would be interesting to see if it does
Not on the latest vllm:
NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the SupportsPP interface.
For me, the following works now with 6 RTX Pro 6000 Blackwell in the current "glm5-blackwell" SGLang Docker image:
python -m sglang.launch_server --model-path $MODEL_PATH --served-model-name $MODEL_NAME --reasoning-parser glm45 --tool-call-parser glm47 --tp 2 --pp-size 3 --mem-fraction-static 0.95 --max-running-requests 8 --kv-cache-dtype bf16 --quantization modelopt_fp4 --attention-backend flashinfer --moe-runner-backend flashinfer_cutlass --host 0.0.0.0 --port $HTTP_PORT_SGLANG
Important is --kv-cache-dtype bf16, else it crashes after the first generated token.
I get 31 tokens/s generation speed. I did not test different --attention-backend and --moe-runner-backend options
Maybe that can be added to the Readme?
Wonder if we can run Kimi 2.5 like that too
I don't know, but I think 6 96 GB GPUs are too small for Kimi K2.5 at int4/NVFP4. To find out if you can use your number of GPUs in tensor parallel mode, you have to check if the number of attention heads in config.json is dividable by the number of GPUs. GLM-5 and Kimi K2.5 have both 64 attention heads, therefore 6 doesn't work, but 2 works, so we have to use additionally pipeline parallel. Pipeline parallel is often used for inference with multiple servers and I probably less tested for new models, so the chance that you are running a broken code code path could be higher.
I had the problem that after some good output, only garbage was produced. I have therefore updated to the latest "dev-cu13" Docker image of SGLang and have set the recommended parameters for
SWE Bench Verified in LiteLLM proxy (which I have in front of SGLang):
api_key: "randomKey12345"
temperature: 0.7
top_p: 1.0
max_tokens: 32768
# min_p: 0.01 # see below
Setting these parameters seem to have fixed the problem.
Edit: I found that setting min_p to != 0 has an impact on generation speed. Without min_p, I get now around 31 tokens/s according to the SGLang output, with min_p =0.01, it is only around 28.
Edit 2: At the moment, I have downgraded to GLM-4.7. After many good output tokens, SGLang was crashing with a CUDA exception. Enabling --enable-nan-detection prevents the crash and shows instead "NaN in the logits" and the output from this point on is garbage. The problem is easy to reproduce by e.g. using opencode when around 50k tokens were used. I don't know what the problem is, hopefully it is just an SGLang bug which gets fixed and I can try it again in a few weeks.
Here are my findings with running 6x cards:
- SGLang does not support PP with speculative decoding:
AssertionError: Pipeline parallelism is not compatible with overlap schedule, speculative decoding, mixed chunked prefill.(Which is a big loss) - You need to increase the memory to a minimum of
--mem-fraction-static=0.94 - I used festr's docker image: https://github.com/voipmonitor/rtx6kpro/blob/master/models/glm5.md
- It runs stable without speculative decoding at 24 toks / second single request. So its very slow.
It is not worth running this on 6 cards. Upgrade to 8 and you can get 100 tokens per second.