🐛 Bug Report: Truncated JSON / Unterminated String in Tool Calling with `kimi-k2.6` + EAGLE3 Speculative Decoding
Summary
When serving kimi-k2.6 with the kimi-k2.6-eagle3 draft model (EAGLE3 speculative decoding / MTP) on 8× H20 using vLLM v0.19.1, the model frequently generates corrupted/truncated JSON during tool-calling turns. This causes downstream tool parsers (e.g., OpenCode's bash tool) to fail with JSON Parse error: Unterminated string.
The issue does not reproduce with kimi-k2.5 + kimi-k2.5-eagle3 on the exact same vLLM v0.19.1 image and identical hardware. This isolates the bug to the k2.6 model family (or its EAGLE3 draft model) under speculative decoding.
Note: We also verified the issue persists on the
deepseekv4-cu129pre-release image, but the primary regression is observed on the stablev0.19.1image.
Environment
- Hardware: 8× NVIDIA H20
- Serving Engine:
vllm/vllm-openai:v0.19.1 - Base Model (Fail):
/model_files/kimi-k2.6 - Draft Model (Fail):
/model_files/kimi-k2.6-eagle3(EAGLE3) - Base Model (Pass):
/model_files/Kimi-K2.5 - Draft Model (Pass):
/model_files/Kimi-K2.5-Eagle3 - Tensor Parallelism: 8
- Client / Agent: OpenCode (coding agents with tool calling)
Reproduction — Failing Case (k2.6 + EAGLE3) on vLLM v0.19.1
docker run --rm -it --name vllm-kimi-k2.6 \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
-p 8000:8000 \
-v /model_files:/model_files \
vllm/vllm-openai:v0.19.1 \
--model /model_files/kimi-k2.6 \
--tensor-parallel-size 8 \
--served-model-name /model_files/h20/Kimi-K2.5 \
--compilation-config '{"pass_config": {"fuse_allreduce_rms": true}}' \
--speculative-config '{"model": "/model_files/kimi-k2.6-eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 16384 \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--mm-processor-cache-gb 128 \
--max-num-seqs 25 \
--enable-chunked-prefill \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--trust-remote-code \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--safetensors-load-strategy=prefetch
Result: Tool calling fails with corrupted JSON output.
Control Experiment — Passing Case (k2.5 + EAGLE3) on vLLM v0.19.1
docker run --rm -it --name vllm-kimi-mtp \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--gpus all \
-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
-p 8000:8000 \
-v /ssd1:/model_files \
vllm/vllm-openai:v0.19.1 \
--model /model_files/Kimi-K2.5 \
--tensor-parallel-size 8 \
--served-model-name /model_files/h20/Kimi-K2.5 \
--compilation_config.pass_config.fuse_allreduce_rms true \
--speculative-config '{"model": "/model_files/Kimi-K2.5-Eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
--gpu-memory-utilization 0.92 \
--max-num-batched-tokens 8192 \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--mm-processor-cache-gb 128 \
--enable-chunked-prefill \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--trust-remote-code \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--safetensors-load-strategy=prefetch
Result: Tool calling works correctly; JSON is well-formed and complete.
Observed Error
During tool invocation, the model output is truncated inside a string value, causing the parser to fail:
⚙ invalid [tool=bash, error=Invalid input for tool bash: JSON parsing failed:
Text: {"command": "bash ~/.claude/skills/web-access/scripts/check-deps.sh", "description": "检查 CDP 依赖环境.
Error message: JSON Parse error: Unterminated string]
Key observation: The truncation occurs mid-Unicode-string (inside a Chinese text value). The closing quote is missing, suggesting the token stream was cut at an invalid byte boundary rather than at a valid JSON boundary.
Analysis & Hypothesis
The only meaningful differences between the Pass and Fail configurations are:
- Base model:
k2.5→k2.6 - Draft model:
k2.5-eagle3→k2.6-eagle3
Everything else — vLLM version (v0.19.1), hardware (8× H20), TP size, EAGLE3 method, and tool-calling flags — is held constant. This strongly suggests the issue is not a generic vLLM speculative-decoding regression, but rather specific to the kimi-k2.6 model (or its EAGLE3 draft model).
We also reproduced the failure on the deepseekv4-cu129 pre-release image, indicating the bug survives across vLLM versions.
Possible cause: EAGLE3 performs token-level speculation. If the draft model and target model disagree on token boundaries—particularly around multi-byte UTF-8 characters (e.g., Chinese text)—the verification/rollback mechanism may truncate the output stream at an invalid byte offset. The k2.6 tokenizer or output distribution may have changed in a way that amplifies this boundary misalignment compared to k2.5.
Requests for Maintainers / Model Authors
- Has
kimi-k2.6-eagle3been internally validated with structured generation / tool calling / JSON mode under EAGLE3 speculative decoding? - Have you observed similar truncation issues where speculative decoding cuts UTF-8 strings mid-character?
- Is there a recommended generation config or vLLM flag (e.g., adjusting
--num-speculative-tokens, disabling chunked prefill, changingacceptance_sampler) to mitigate this until a fix is available? - Could there be a mismatch between the k2.6 base tokenizer and the k2.6-eagle3 draft tokenizer/vocabulary that breaks the speculative path?
Additional Context
- The issue is reliable (not flaky) when using k2.6 + eagle3 with coding agents.
- Disabling speculative decoding (removing
--speculative-config) prevents the issue, confirming the bug is triggered by the MTP path.
Labels: bug, speculative-decoding, eagle3, tool-calling, k2.6
Could you give a try to https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla and see if that resolves your issue? Thanks!
Hi @rogerwyf , thanks for the follow-up!
I can confirm that switching to the kimi-k2.6-eagle3-mla draft model resolves the issue.
Could you give a try to https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla and see if that resolves your issue? Thanks!
Hi @rogerwyf , thanks for the follow-up!
I can confirm that switching to the kimi-k2.6-eagle3-mla draft model resolves the issue.