Less Context Length than Expected(600k)

#57
by Forcewithme - opened

Deploying with lmsysorg/sglang:glm5-hopper on 8*H20-3e(141G), with the official command:

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-5-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47  \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.85 \
  --served-model-name glm-5-fp8

I found that as as soon as the prefill token reach 600k, the servers return empty content, indicating the context length exceeds the limit. But it shouldn't.

On the same machine, With sglang0.5.9, qwen3.5-397b, kimi-k2.5, minimax-m2.5 can all reach the max context length ,which are 196k and 256k.


用官方给的镜像和部署命令部署服务,发现最多只支持到600k的上下文,一旦达到600k,response里content字段就为空。根据先前的经验,这通常提示显存不够。但是我们用相同的设备部署了qwen3.5b、kimi-k2.5和minimax-m2.5,发现都能支持到最长的上下文。其中kimi-k2.5是1T的模型。

所以不太理解为什么本地部署的glm5只支持到600k,希望官方或者社区大佬给出答案~

And I found a new image on docker hub: docker pull lmsysorg/sglang:glm5-hopper-patched. What is this image used for ?

GLM-5 Only support 200K context

ZHANGYUXUAN-zR changed discussion status to closed

GLM-5 Only support 200K context

It's a typo, it can only supports 60k, not 600k.


打错字了,我实测是发现最多只支持到60k。

Sign up or log in to comment