Possible to run on six RTX Pro 6000 Blackwell with vLLM oder SGLang?
Is it possible to run this model with 6 RTX Pro 6000 Blackwell GPUs? I think generally it should work with a combination of --tensor-parallel-size 2 and --pipeline-parallel-size 3, but I am not sure.
would be interesting to see if it does
Not on the latest vllm:
NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the SupportsPP interface.
For me, the following works now with 6 RTX Pro 6000 Blackwell in the current "glm5-blackwell" SGLang Docker image:
python -m sglang.launch_server --model-path $MODEL_PATH --served-model-name $MODEL_NAME --reasoning-parser glm45 --tool-call-parser glm47 --tp 2 --pp-size 3 --mem-fraction-static 0.95 --max-running-requests 8 --kv-cache-dtype bf16 --quantization modelopt_fp4 --attention-backend flashinfer --moe-runner-backend flashinfer_cutlass --host 0.0.0.0 --port $HTTP_PORT_SGLANG
Important is --kv-cache-dtype bf16, else it crashes after the first generated token.
I get 31 tokens/s generation speed. I did not test different --attention-backend and --moe-runner-backend options
Maybe that can be added to the Readme?
Wonder if we can run Kimi 2.5 like that too
I don't know, but I think 6 96 GB GPUs are too small for Kimi K2.5 at int4/NVFP4. To find out if you can use your number of GPUs in tensor parallel mode, you have to check if the number of attention heads in config.json is dividable by the number of GPUs. GLM-5 and Kimi K2.5 have both 64 attention heads, therefore 6 doesn't work, but 2 works, so we have to use additionally pipeline parallel. Pipeline parallel is often used for inference with multiple servers and I probably less tested for new models, so the chance that you are running a broken code code path could be higher.
I had the problem that after some good output, only garbage was produced. I have therefore updated to the latest "dev-cu13" Docker image of SGLang and have set the recommended parameters for
SWE Bench Verified in LiteLLM proxy (which I have in front of SGLang):
api_key: "randomKey12345"
temperature: 0.7
top_p: 1.0
max_tokens: 32768
# min_p: 0.01 # see below
Setting these parameters seem to have fixed the problem.
Edit: I found that setting min_p to != 0 has an impact on generation speed. Without min_p, I get now around 31 tokens/s according to the SGLang output, with min_p =0.01, it is only around 28.