Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.5

SGLang

How to use moonshotai/Kimi-K2.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.5
```

Chat Template Logic Issue: Ambiguous Default Thinking Mode

#27

by QIN2DIM - opened Jan 29

Discussion

QIN2DIM

Jan 29

Problem

When the thinking variable is not explicitly defined by the client, the template defaults to thinking mode ON, causing OpenAI-compatible clients to be unable to distinguish reasoning_content from regular content.

Root Cause

In the add_generation_prompt section:

{%- if thinking is defined and thinking is false -%}
<think></think>
{%- else -%}
<think>
{%- endif -%}

When thinking=false: outputs <think></think> ✓
When thinking=true OR undefined: outputs <think> (unclosed) ✗

When thinking is undefined:

Template ends with <think>
Model generates: reasoning_content</think>actual_content
Client receives merged output but cannot determine if thinking was enabled
</think> appears concatenated with content without clear separation

Fix

Invert the default behavior to thinking OFF when undefined:

-{%- if thinking is defined and thinking is false -%}
-<think></think>
-{%- else -%}
+{%- if thinking is defined and thinking is true -%}
 <think>
+{%- else -%}
+<think></think>
 {%- endif -%}

Behavior After Fix

`thinking` value	Output	Meaning
`true`	`<think>`	Expect model to generate reasoning
`false`	`<think></think>`	No reasoning
undefined	`<think></think>`	No reasoning (safe default)

This ensures clients can reliably parse responses by checking for <think></think> (thinking off) vs <think>content...</think> (thinking on).

courage17340

Moonshot AI org Jan 30

Hi, we set thinking mode as default to make it compatible with our official API behavior. The problem you describe is likely a bug in the reasoning parser. For example, sglang fixed a similar issue recently: https://github.com/sgl-project/sglang/pull/17901

QIN2DIM

Jan 30

It seems to be true. 🤯

QIN2DIM changed discussion status to closed Jan 30

hongyu05

Feb 11

That's not problem. but suffix index seems wrong.

current chat_template includes:

{# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
{%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg+1] -%}
{%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg+1:] -%}

But it should be

{# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
{%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg] -%}
{%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg:] -%}

Jinxiang01

Apr 8

There is another related issue with the hist/suffix split: when all assistant messages have tool_calls (common in multi-turn tool-call conversations), last_non_tool_call_assistant_msg stays at -1, causing all messages to become suffix_msgs. This makes reasoning_content from every historical turn accumulate in the prompt, eventually causing the model to degenerate into repetitive output after ~10-18 rounds.

Fix: add a fallback after the existing loop — when no non-tool-call assistant is found, split at the last assistant message:

{%- if ns.last_non_tool_call_assistant_msg == -1 -%}
{%- for idx in range(messages|length-1, -1, -1) -%}
{%- if messages[idx]['role'] == 'assistant' -%}
{%- set ns.last_non_tool_call_assistant_msg = idx - 1 -%}
{%- break -%}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
This ensures only the latest turn's reasoning_content is preserved, while older turns get (cleared), matching the template's intended behavior for hist_msgs.

bigmoyan

Moonshot AI org Apr 8

That's not problem. but suffix index seems wrong.

current chat_template includes:

{# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
{%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg+1] -%}
{%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg+1:] -%}

But it should be

{# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
{%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg] -%}
{%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg:] -%}

Could you clarify where the bug is?

An assistant message without tool calls marks the end of a conversation turn and should be part of the history messages.

bigmoyan

Moonshot AI org Apr 8

There is another related issue with the hist/suffix split: when all assistant messages have tool_calls (common in multi-turn tool-call conversations), last_non_tool_call_assistant_msg stays at -1, causing all messages to become suffix_msgs. This makes reasoning_content from every historical turn accumulate in the prompt, eventually causing the model to degenerate into repetitive output after ~10-18 rounds.

Fix: add a fallback after the existing loop — when no non-tool-call assistant is found, split at the last assistant message:

{%- if ns.last_non_tool_call_assistant_msg == -1 -%}
{%- for idx in range(messages|length-1, -1, -1) -%}
{%- if messages[idx]['role'] == 'assistant' -%}
{%- set ns.last_non_tool_call_assistant_msg = idx - 1 -%}
{%- break -%}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
This ensures only the latest turn's reasoning_content is preserved, while older turns get (cleared), matching the template's intended behavior for hist_msgs.

When every assistant message contains tool calls, the multi-step conversation is continuing, and retaining all thinking content is the intended behavior by design.
I'm not sure if an excessive amount of thinking content in the prompt would degrade model performance (would it?), and I'm uncertain whether setting a maximum limit on the number of reserved thinking content entries would help. You're welcome to experiment with such limits, but preserving the complete thinking content for all tool calls remains the expected behavior.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment