🐛 Bug Report: Truncated JSON / Unterminated String in Tool Calling with `kimi-k2.6` + EAGLE3 Speculative Decoding

by bencswong2 - opened Apr 26

Apr 26

Summary

When serving kimi-k2.6 with the kimi-k2.6-eagle3 draft model (EAGLE3 speculative decoding / MTP) on 8× H20 using vLLM v0.19.1, the model frequently generates corrupted/truncated JSON during tool-calling turns. This causes downstream tool parsers (e.g., OpenCode's bash tool) to fail with JSON Parse error: Unterminated string.

The issue does not reproduce with kimi-k2.5 + kimi-k2.5-eagle3 on the exact same vLLM v0.19.1 image and identical hardware. This isolates the bug to the k2.6 model family (or its EAGLE3 draft model) under speculative decoding.

Note: We also verified the issue persists on the deepseekv4-cu129 pre-release image, but the primary regression is observed on the stable v0.19.1 image.

Environment

Hardware: 8× NVIDIA H20
Serving Engine: vllm/vllm-openai:v0.19.1
Base Model (Fail): /model_files/kimi-k2.6
Draft Model (Fail): /model_files/kimi-k2.6-eagle3 (EAGLE3)
Base Model (Pass): /model_files/Kimi-K2.5
Draft Model (Pass): /model_files/Kimi-K2.5-Eagle3
Tensor Parallelism: 8
Client / Agent: OpenCode (coding agents with tool calling)

Reproduction — Failing Case (k2.6 + EAGLE3) on vLLM v0.19.1

docker run --rm -it --name vllm-kimi-k2.6 \
    --gpus all \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
    -p 8000:8000 \
    -v /model_files:/model_files \
    vllm/vllm-openai:v0.19.1 \
    --model /model_files/kimi-k2.6 \
    --tensor-parallel-size 8 \
    --served-model-name /model_files/h20/Kimi-K2.5 \
    --compilation-config '{"pass_config": {"fuse_allreduce_rms": true}}' \
    --speculative-config '{"model": "/model_files/kimi-k2.6-eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
    --gpu-memory-utilization 0.90 \
    --max-num-batched-tokens 16384 \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-type shm \
    --mm-processor-cache-gb 128 \
    --max-num-seqs 25 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --safetensors-load-strategy=prefetch

Result: Tool calling fails with corrupted JSON output.

Control Experiment — Passing Case (k2.5 + EAGLE3) on vLLM v0.19.1

docker run --rm -it --name vllm-kimi-mtp \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --gpus all \
    -e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
    -p 8000:8000 \
    -v /ssd1:/model_files \
    vllm/vllm-openai:v0.19.1 \
    --model /model_files/Kimi-K2.5 \
    --tensor-parallel-size 8 \
    --served-model-name /model_files/h20/Kimi-K2.5 \
    --compilation_config.pass_config.fuse_allreduce_rms true \
    --speculative-config '{"model": "/model_files/Kimi-K2.5-Eagle3", "method": "eagle3", "num_speculative_tokens": 3}' \
    --gpu-memory-utilization 0.92 \
    --max-num-batched-tokens 8192 \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-type shm \
    --mm-processor-cache-gb 128 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --safetensors-load-strategy=prefetch

Result: Tool calling works correctly; JSON is well-formed and complete.

Observed Error

During tool invocation, the model output is truncated inside a string value, causing the parser to fail:

⚙ invalid [tool=bash, error=Invalid input for tool bash: JSON parsing failed: 
Text: {"command": "bash ~/.claude/skills/web-access/scripts/check-deps.sh", "description": "检查 CDP 依赖环境.
Error message: JSON Parse error: Unterminated string]

Key observation: The truncation occurs mid-Unicode-string (inside a Chinese text value). The closing quote is missing, suggesting the token stream was cut at an invalid byte boundary rather than at a valid JSON boundary.

Analysis & Hypothesis

The only meaningful differences between the Pass and Fail configurations are:

Base model: k2.5 → k2.6
Draft model: k2.5-eagle3 → k2.6-eagle3

Everything else — vLLM version (v0.19.1), hardware (8× H20), TP size, EAGLE3 method, and tool-calling flags — is held constant. This strongly suggests the issue is not a generic vLLM speculative-decoding regression, but rather specific to the kimi-k2.6 model (or its EAGLE3 draft model).

We also reproduced the failure on the deepseekv4-cu129 pre-release image, indicating the bug survives across vLLM versions.

Possible cause: EAGLE3 performs token-level speculation. If the draft model and target model disagree on token boundaries—particularly around multi-byte UTF-8 characters (e.g., Chinese text)—the verification/rollback mechanism may truncate the output stream at an invalid byte offset. The k2.6 tokenizer or output distribution may have changed in a way that amplifies this boundary misalignment compared to k2.5.

Requests for Maintainers / Model Authors

Has kimi-k2.6-eagle3 been internally validated with structured generation / tool calling / JSON mode under EAGLE3 speculative decoding?
Have you observed similar truncation issues where speculative decoding cuts UTF-8 strings mid-character?
Is there a recommended generation config or vLLM flag (e.g., adjusting --num-speculative-tokens, disabling chunked prefill, changing acceptance_sampler) to mitigate this until a fix is available?
Could there be a mismatch between the k2.6 base tokenizer and the k2.6-eagle3 draft tokenizer/vocabulary that breaks the speculative path?

Additional Context

The issue is reliable (not flaky) when using k2.6 + eagle3 with coding agents.
Disabling speculative decoding (removing --speculative-config) prevents the issue, confirming the bug is triggered by the MTP path.

Labels: bug, speculative-decoding, eagle3, tool-calling, k2.6

rogerwyf

Apr 30

Could you give a try to https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla and see if that resolves your issue? Thanks!

bencswong2

May 5

Hi @rogerwyf , thanks for the follow-up!
I can confirm that switching to the kimi-k2.6-eagle3-mla draft model resolves the issue.

bencswong2

May 5

Could you give a try to https://huggingface.co/lightseekorg/kimi-k2.6-eagle3-mla and see if that resolves your issue? Thanks!

Hi @rogerwyf , thanks for the follow-up!
I can confirm that switching to the kimi-k2.6-eagle3-mla draft model resolves the issue.

lightseek changed discussion status to closed May 27

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment