Observing (no content) response from the model randomly
#18
by
shivamashtikar
- opened
Observing (no content) in response from the model randomly in between text, ,reasoning and in tool calls when running the model with vllm. Observed similar behavior while using sglang too where tool calls itself was not working
{"role":"assistant","content":[{"type":"text","text":"(no content)"},{"type":"tool_use","id":"functions.Read:2","name":"Read","input":{"file_path":"/Users/shivam.ashtikar/workspace/opencode/README.md"}},{"type":"tool_use","id":"functions.Bash:3","name":"Bash","input":{"command":"ls -la /Users/shivam.ashtikar/workspace/opencode/packages","description":"List packages directory structure"}},{"type":"tool_use","id":"functions.Read:4","name":"Read","input":{"file_path":"/Users/shivam.ashtikar/workspace/opencode/package.json"},"cache_control":{"type":"ephemeral"}}]},
here is the code modification that I had to do in vllm to get rid of (no content) tokens but it leads to data loss
https://github.com/vllm-project/vllm/pull/33248
sharing here the vllm configuration too
.venv/bin/vllm serve moonshotai/Kimi-K2.5 \
--host 0.0.0.0 \
--port 8000 \
--chat-template ./chat_template.jinja \
--tokenizer-mode auto \
--mm-encoder-tp-mode data \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--gpu-memory-utilization 0.9 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--max-num-seqs 64 \
--trust-remote-code \
--safetensors-load-strategy eager \
--decode-context-parallel-size 8 \
--served-model-name kimi-k2-5 \
--cudagraph-metrics \
--enable-mfu-metrics \
--kv-cache-metrics \
--kv-cache-metrics-sample 0.05 \
--max-cudagraph-capture-size 1024 \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--enable-chunked-prefill \
--enable-prefix-caching \
--override-generation-config '{"temperature": 1, "top_p": 0.95, "repetition_penalty": 1.05, "top_k": 25, "max_new_tokens": 32384}'