Instructions to use moonshotai/Kimi-K2.6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.6 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.6", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.6", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use moonshotai/Kimi-K2.6 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.6"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.6

SGLang

How to use moonshotai/Kimi-K2.6 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.6" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.6" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.6 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.6
```

Deterministic ! token collapse at ~14,200 input-token threshold (reproducible under greedy decode)

#13

by bionexus-gunhopark - opened Apr 21

Discussion

bionexus-gunhopark

Apr 21

•

edited Apr 21

Summary

Running moonshotai/Kimi-K2.6 via vLLM with the official deploy recipe, the model deterministically enters a degenerate loop that emits only the ! token (ASCII 0x21) — starting from the first reasoning token — once the input token count crosses ~14,200. The threshold is sharp: a 12-token difference flips the outcome. It is not a function of raw bytes, content, or specific words — plain-text inputs up to 60 KB (9,432 tokens) complete cleanly.

Reproduces under temperature=0 greedy decode, so this is a logit-level collapse, not a sampling pathology.

Environment


Model	`moonshotai/Kimi-K2.6` (native INT4, compressed-tensors, group_size=32)
Hardware	8× NVIDIA B200 (180 GB each), single node
vLLM	0.19.1 (your manually-verified version). Also reproduced on nightly 0.19.2rc1.dev.
transformers	4.57.6 (in your `>=4.57.1, <5.0.0` range)
torch	2.11.0+cu130
TP	8, `max_model_len` 131072 (also tried 204800 — same result)
KV cache dtype	`auto` (BF16)

Server command — exactly your recipe:

vllm serve moonshotai/Kimi-K2.6 \
  --trust-remote-code --tensor-parallel-size 8 \
  --tool-call-parser kimi_k2 --enable-auto-tool-choice \
  --reasoning-parser kimi_k2 --mm-encoder-tp-mode data

Observation: the threshold is ~14,200 input tokens

Tokenizing every reproducer through Kimi's own chat template (tokenizer.apply_chat_template(..., tools=...)), the degeneration is cleanly separated from non-degeneration by input-token count regardless of content:

payload	input tokens	body (KB)	result
60 KB plain English prose, no tools	9,432	60	✓ clean
Claude-Code-style system + 28 tools + small user + brief prior tool turn, with Claude refs scrubbed	14,188	67	✓ clean
same, +12 tokens in the system prompt	14,200	67	✗ degenerate
same, +80 tokens	14,279	68	✗ degenerate
same, +100 tokens	14,294	68	✗ degenerate

A 12-token increment flips the outcome. We verified this is content-independent by injecting seven different 500-char payloads into the same structure at the same final token count — "You are Claude/GPT/Gemini/Kimi/Grok/Mistral/...", cooking instructions, weather descriptions — all seven degenerate identically. So this is not a safety guardrail, not a distillation-trace evasion, not tied to any specific keyword.

Every streaming response past the threshold looks like:

data: {"choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":" "}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":"!"}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":"!"}}]}
...                       (continues until max_tokens fires)

The ! (token 0x21) is the same every time, on every degenerating run (~30 samples).

What was ruled out

Sampling: reproduces identically at temperature=0 (greedy), top_p=0.95, top_p=1.0.
Reasoning-history format: reproduces with and without reasoning_content preserved on prior assistant turns.
Tool-result payload specifics: stripping injected guardrail/system-reminder blocks, JSON-escape variants — no change.
vLLM version: reproduces on 0.19.1 (the manually-verified build) and on the nightly 0.19.2rc1.dev wheel.
transformers version: verified in the stated >=4.57.1, <5.0.0 range (4.57.6).
Download integrity: all 64 safetensor shards' content-addressed hashes match the K2.6 remote (zero mismatches, checked via huggingface_hub.model_info(..., files_metadata=True)); index.json total_size matches on-disk (595.2 GB).
Pure prompt length without tools: a single-turn 60 KB plain-text user message tokenizes to 9,432 tokens and completes cleanly.
Content: at the same token count, replacing Claude/Anthropic/OpenClaude substrings, or injecting any other 500-char text — "You are GPT/Gemini/Kimi/Grok/Mistral/…", cooking recipes, weather descriptions — all produce identically degenerate output. Not a content filter.
--enable-chunked-prefill: the degeneration was observed both before this flag was added to our config and after. Not the cause.

Our deployment has not used --enable-prefix-caching or --kv-cache-dtype fp8 at any point during observations of this issue, so those are simply not variables here.

Hypotheses

The sharpness (12 tokens) and determinism (greedy-reproducible, same ! token every time) of the threshold strongly suggests a numerical-precision issue in the INT4 pack-quantized forward pass at a specific prefill shape — the MLA attention path and/or compressed-tensors INT4 dequant saturating at ~14,200-token inputs. We don't have the instrumentation to localize it further from the client side.

Questions

Can Moonshot reproduce this with the official recipe on an 8×B200 at ~14,200-token prompts (any content — e.g. Claude-Code-style system + ~28 tool definitions, or equivalent)?
Is the INT4 pack-quantized checkpoint known to have any prefill-length-dependent numerical instability in the MLA attention or MoE routing paths?
Would a non-INT4 release (BF16 or FP8) of K2.6 be available for deployments that regularly exceed this threshold in agentic workflows?

Closing — withdrawing this report

Closing this out. Further testing showed the degeneration threshold is not stable at a fixed input-token count: the initial ~14,200-token boundary observed on a freshly-started server drifts downward as the server accumulates state over time. Prompts that were cleanly handled at 9,432 tokens hours earlier began degenerating at the same input under identical sampling params, without any config change.

That means my earlier framing ("deterministic collapse at a specific token count") is not strictly correct — the real behavior appears to be a server-side cumulative state issue (prefix cache, KV allocator fragmentation, MoE routing imbalance, or similar) that interacts with INT4 weights, not a pure quantization-vs-prefill-shape bug reproducible from a cold start.

Without a reliable way to reproduce from a known-good server state in my current setup, I can't give Moonshot a clean repro. Rather than leave a partially-correct report in the open tracker, I'm closing it. If we re-run this against a fresh server and can still reliably trigger a fixed-threshold collapse, I'll file a new, narrower report with a minimal reproducer.

Thanks to anyone who took a look.

bionexus-gunhopark changed discussion title from Deterministic degeneration into repeated ! tokens at ~15K input-token threshold (reproducible under greedy decode) to Deterministic ! token collapse at ~14,200 input-token threshold (reproducible under greedy decode) Apr 21

bionexus-gunhopark changed discussion status to closed Apr 21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment