Deterministic ! token collapse at ~14,200 input-token threshold (reproducible under greedy decode)

#13
by bionexus-gunhopark - opened

Summary

Running moonshotai/Kimi-K2.6 via vLLM with the official deploy recipe, the model deterministically enters a degenerate loop that emits only the ! token (ASCII 0x21) — starting from the first reasoning token — once the input token count crosses ~14,200. The threshold is sharp: a 12-token difference flips the outcome. It is not a function of raw bytes, content, or specific words — plain-text inputs up to 60 KB (9,432 tokens) complete cleanly.

Reproduces under temperature=0 greedy decode, so this is a logit-level collapse, not a sampling pathology.

Environment

Model moonshotai/Kimi-K2.6 (native INT4, compressed-tensors, group_size=32)
Hardware 8× NVIDIA B200 (180 GB each), single node
vLLM 0.19.1 (your manually-verified version). Also reproduced on nightly 0.19.2rc1.dev.
transformers 4.57.6 (in your >=4.57.1, <5.0.0 range)
torch 2.11.0+cu130
TP 8, max_model_len 131072 (also tried 204800 — same result)
KV cache dtype auto (BF16)

Server command — exactly your recipe:

vllm serve moonshotai/Kimi-K2.6 \
  --trust-remote-code --tensor-parallel-size 8 \
  --tool-call-parser kimi_k2 --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 --mm-encoder-tp-mode data

Observation: the threshold is ~14,200 input tokens

Tokenizing every reproducer through Kimi's own chat template (tokenizer.apply_chat_template(..., tools=...)), the degeneration is cleanly separated from non-degeneration by input-token count regardless of content:

payload input tokens body (KB) result
60 KB plain English prose, no tools 9,432 60 ✓ clean
Claude-Code-style system + 28 tools + small user + brief prior tool turn, with Claude refs scrubbed 14,188 67 ✓ clean
same, +12 tokens in the system prompt 14,200 67 ✗ degenerate
same, +80 tokens 14,279 68 ✗ degenerate
same, +100 tokens 14,294 68 ✗ degenerate

A 12-token increment flips the outcome. We verified this is content-independent by injecting seven different 500-char payloads into the same structure at the same final token count — "You are Claude/GPT/Gemini/Kimi/Grok/Mistral/...", cooking instructions, weather descriptions — all seven degenerate identically. So this is not a safety guardrail, not a distillation-trace evasion, not tied to any specific keyword.

Every streaming response past the threshold looks like:

data: {"choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":" "}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":"!"}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":"!"}}]}
...                       (continues until max_tokens fires)

The ! (token 0x21) is the same every time, on every degenerating run (~30 samples).

What was ruled out

  • Sampling: reproduces identically at temperature=0 (greedy), top_p=0.95, top_p=1.0.
  • Reasoning-history format: reproduces with and without reasoning_content preserved on prior assistant turns.
  • Tool-result payload specifics: stripping injected guardrail/system-reminder blocks, JSON-escape variants — no change.
  • vLLM version: reproduces on 0.19.1 (the manually-verified build) and on the nightly 0.19.2rc1.dev wheel.
  • transformers version: verified in the stated >=4.57.1, <5.0.0 range (4.57.6).
  • Download integrity: all 64 safetensor shards' content-addressed hashes match the K2.6 remote (zero mismatches, checked via huggingface_hub.model_info(..., files_metadata=True)); index.json total_size matches on-disk (595.2 GB).
  • Pure prompt length without tools: a single-turn 60 KB plain-text user message tokenizes to 9,432 tokens and completes cleanly.
  • Content: at the same token count, replacing Claude/Anthropic/OpenClaude substrings, or injecting any other 500-char text — "You are GPT/Gemini/Kimi/Grok/Mistral/…", cooking recipes, weather descriptions — all produce identically degenerate output. Not a content filter.
  • --enable-chunked-prefill: the degeneration was observed both before this flag was added to our config and after. Not the cause.

Our deployment has not used --enable-prefix-caching or --kv-cache-dtype fp8 at any point during observations of this issue, so those are simply not variables here.

Hypotheses

The sharpness (12 tokens) and determinism (greedy-reproducible, same ! token every time) of the threshold strongly suggests a numerical-precision issue in the INT4 pack-quantized forward pass at a specific prefill shape — the MLA attention path and/or compressed-tensors INT4 dequant saturating at ~14,200-token inputs. We don't have the instrumentation to localize it further from the client side.

Questions

  1. Can Moonshot reproduce this with the official recipe on an 8×B200 at ~14,200-token prompts (any content — e.g. Claude-Code-style system + ~28 tool definitions, or equivalent)?
  2. Is the INT4 pack-quantized checkpoint known to have any prefill-length-dependent numerical instability in the MLA attention or MoE routing paths?
  3. Would a non-INT4 release (BF16 or FP8) of K2.6 be available for deployments that regularly exceed this threshold in agentic workflows?

Closing — withdrawing this report

Closing this out. Further testing showed the degeneration threshold is not stable at a fixed input-token count: the initial ~14,200-token boundary observed on a freshly-started server drifts downward as the server accumulates state over time. Prompts that were cleanly handled at 9,432 tokens hours earlier began degenerating at the same input under identical sampling params, without any config change.

That means my earlier framing ("deterministic collapse at a specific token count") is not strictly correct — the real behavior appears to be a server-side cumulative state issue (prefix cache, KV allocator fragmentation, MoE routing imbalance, or similar) that interacts with INT4 weights, not a pure quantization-vs-prefill-shape bug reproducible from a cold start.

Without a reliable way to reproduce from a known-good server state in my current setup, I can't give Moonshot a clean repro. Rather than leave a partially-correct report in the open tracker, I'm closing it. If we re-run this against a fresh server and can still reliably trigger a fixed-threshold collapse, I'll file a new, narrower report with a minimal reproducer.

Thanks to anyone who took a look.

bionexus-gunhopark changed discussion title from Deterministic degeneration into repeated ! tokens at ~15K input-token threshold (reproducible under greedy decode) to Deterministic ! token collapse at ~14,200 input-token threshold (reproducible under greedy decode)
bionexus-gunhopark changed discussion status to closed

Sign up or log in to comment