Instructions to use moonshotai/Kimi-K2.6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2.6 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.6", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("moonshotai/Kimi-K2.6", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2.6 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2.6" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.6", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2.6
- SGLang
How to use moonshotai/Kimi-K2.6 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.6" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.6", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.6" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.6", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2.6 with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2.6
Deterministic ! token collapse at ~14,200 input-token threshold (reproducible under greedy decode)
Summary
Running moonshotai/Kimi-K2.6 via vLLM with the official deploy recipe, the model deterministically enters a degenerate loop that emits only the ! token (ASCII 0x21) — starting from the first reasoning token — once the input token count crosses ~14,200. The threshold is sharp: a 12-token difference flips the outcome. It is not a function of raw bytes, content, or specific words — plain-text inputs up to 60 KB (9,432 tokens) complete cleanly.
Reproduces under temperature=0 greedy decode, so this is a logit-level collapse, not a sampling pathology.
Environment
| Model | moonshotai/Kimi-K2.6 (native INT4, compressed-tensors, group_size=32) |
| Hardware | 8× NVIDIA B200 (180 GB each), single node |
| vLLM | 0.19.1 (your manually-verified version). Also reproduced on nightly 0.19.2rc1.dev. |
| transformers | 4.57.6 (in your >=4.57.1, <5.0.0 range) |
| torch | 2.11.0+cu130 |
| TP | 8, max_model_len 131072 (also tried 204800 — same result) |
| KV cache dtype | auto (BF16) |
Server command — exactly your recipe:
vllm serve moonshotai/Kimi-K2.6 \
--trust-remote-code --tensor-parallel-size 8 \
--tool-call-parser kimi_k2 --enable-auto-tool-choice \
--reasoning-parser kimi_k2 --mm-encoder-tp-mode data
Observation: the threshold is ~14,200 input tokens
Tokenizing every reproducer through Kimi's own chat template (tokenizer.apply_chat_template(..., tools=...)), the degeneration is cleanly separated from non-degeneration by input-token count regardless of content:
| payload | input tokens | body (KB) | result |
|---|---|---|---|
| 60 KB plain English prose, no tools | 9,432 | 60 | ✓ clean |
| Claude-Code-style system + 28 tools + small user + brief prior tool turn, with Claude refs scrubbed | 14,188 | 67 | ✓ clean |
| same, +12 tokens in the system prompt | 14,200 | 67 | ✗ degenerate |
| same, +80 tokens | 14,279 | 68 | ✗ degenerate |
| same, +100 tokens | 14,294 | 68 | ✗ degenerate |
A 12-token increment flips the outcome. We verified this is content-independent by injecting seven different 500-char payloads into the same structure at the same final token count — "You are Claude/GPT/Gemini/Kimi/Grok/Mistral/...", cooking instructions, weather descriptions — all seven degenerate identically. So this is not a safety guardrail, not a distillation-trace evasion, not tied to any specific keyword.
Every streaming response past the threshold looks like:
data: {"choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":" "}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":"!"}}]}
data: {"choices":[{"index":0,"delta":{"reasoning":"!"}}]}
... (continues until max_tokens fires)
The ! (token 0x21) is the same every time, on every degenerating run (~30 samples).
What was ruled out
- Sampling: reproduces identically at
temperature=0(greedy),top_p=0.95,top_p=1.0. - Reasoning-history format: reproduces with and without
reasoning_contentpreserved on prior assistant turns. - Tool-result payload specifics: stripping injected guardrail/system-reminder blocks, JSON-escape variants — no change.
- vLLM version: reproduces on 0.19.1 (the manually-verified build) and on the nightly 0.19.2rc1.dev wheel.
- transformers version: verified in the stated
>=4.57.1, <5.0.0range (4.57.6). - Download integrity: all 64 safetensor shards' content-addressed hashes match the K2.6 remote (zero mismatches, checked via
huggingface_hub.model_info(..., files_metadata=True));index.json total_sizematches on-disk (595.2 GB). - Pure prompt length without tools: a single-turn 60 KB plain-text user message tokenizes to 9,432 tokens and completes cleanly.
- Content: at the same token count, replacing Claude/Anthropic/OpenClaude substrings, or injecting any other 500-char text — "You are GPT/Gemini/Kimi/Grok/Mistral/…", cooking recipes, weather descriptions — all produce identically degenerate output. Not a content filter.
--enable-chunked-prefill: the degeneration was observed both before this flag was added to our config and after. Not the cause.
Our deployment has not used --enable-prefix-caching or --kv-cache-dtype fp8 at any point during observations of this issue, so those are simply not variables here.
Hypotheses
The sharpness (12 tokens) and determinism (greedy-reproducible, same ! token every time) of the threshold strongly suggests a numerical-precision issue in the INT4 pack-quantized forward pass at a specific prefill shape — the MLA attention path and/or compressed-tensors INT4 dequant saturating at ~14,200-token inputs. We don't have the instrumentation to localize it further from the client side.
Questions
- Can Moonshot reproduce this with the official recipe on an 8×B200 at ~14,200-token prompts (any content — e.g. Claude-Code-style system + ~28 tool definitions, or equivalent)?
- Is the INT4 pack-quantized checkpoint known to have any prefill-length-dependent numerical instability in the MLA attention or MoE routing paths?
- Would a non-INT4 release (BF16 or FP8) of K2.6 be available for deployments that regularly exceed this threshold in agentic workflows?
Closing — withdrawing this report
Closing this out. Further testing showed the degeneration threshold is not stable at a fixed input-token count: the initial ~14,200-token boundary observed on a freshly-started server drifts downward as the server accumulates state over time. Prompts that were cleanly handled at 9,432 tokens hours earlier began degenerating at the same input under identical sampling params, without any config change.
That means my earlier framing ("deterministic collapse at a specific token count") is not strictly correct — the real behavior appears to be a server-side cumulative state issue (prefix cache, KV allocator fragmentation, MoE routing imbalance, or similar) that interacts with INT4 weights, not a pure quantization-vs-prefill-shape bug reproducible from a cold start.
Without a reliable way to reproduce from a known-good server state in my current setup, I can't give Moonshot a clean repro. Rather than leave a partially-correct report in the open tracker, I'm closing it. If we re-run this against a fresh server and can still reliably trigger a fixed-threshold collapse, I'll file a new, narrower report with a minimal reproducer.
Thanks to anyone who took a look.