Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.5

SGLang

How to use moonshotai/Kimi-K2.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.5
```

Can the BrowseComp results be reproduced?

#17

by Aldrich-x - opened Jan 28

Discussion

Aldrich-x

Jan 28

•

edited Jan 29

首先，肯定及感谢月之暗面团队的一系列工作 👍

其次，我想问自 kimi k2 thinking 发布之后及本次 kimi-k2.5，BrowseComp 的官方 report 结果有被哪位复现过么？dpsk/miromind/zai/tongyi 我均已进行完美复现。

再次，我给出我曾经进行的一系列复现操作，包括但不限于：

Toolsets

1. 对齐kimi-k2-thinking 官方 released trajectory 工具 format 的 toolsets

search (serper)
open (jina or jina with LLM pattern)
python

2. GPT-OSS toolsets

search (serper)
open
find
python

Strategy

1. End2End 测试 (w/o context management)

Setting:

256k / temperature 1.0 / topp 0.95
完全对齐 K2 release trajectory format & observation template
对齐 GPT-OSS format

Results:

K2-thinking BrowseComp ≈ 45
K2.5 BrowseComp ≈ 45

2. With context management

Setting:

keep recent 3 / 5 / 10 tool observations

Results:

K2-thinking BrowseComp ≈ 53
K2.5 暂未测试 with context management

然后，我想知道：

Can the BrowseComp End2End results be reproduced?

我有以下几点猜想：

K2Thinking & K2.5 在 BrowseComp format 泛化能力不足，必须严格对齐 training format
Benefits of In-House Tool Servers
BrowceComp 的 60.6，是使用了keep recent x turns tool observations 或者类似 kimi-k2-thinking 的 reach max context length之后 clean all tool observations 的策略(最可能的方案)
...

最后，再次肯定及感谢月之暗面团队的一系列工作 👍

pandemo

Jan 28

Hey @Aldrich-x , are you able or willing to share which search agent framework you are using to try and reproduce these results? Is there a GitHub repo?

嘿 @Aldrich-x ，你是否方便或愿意分享一下你正在使用哪个搜索代理（search agent）框架来尝试复现这些结果？有没有对应的 GitHub 仓库链接？

Aldrich-x

Jan 28

Hey @Aldrich-x , are you able or willing to share which search agent framework you are using to try and reproduce these results? Is there a GitHub repo?

嘿 @Aldrich-x ，你是否方便或愿意分享一下你正在使用哪个搜索代理（search agent）框架来尝试复现这些结果？有没有对应的 GitHub 仓库链接？

An extremely simple ReAct framework—just align with the observation format used by each provider.

dawnmsg

Feb 3

Hi, thank you very much for your interest in our work. The k2t and k2.5 scores you reported are far below the normal range, which suggests there may be some huge mismatches. I recommend first aligning with our w/o ctx mgm results to verify whether there is a bug in the settings — for reference, k2.5 should be 60.6. Then you can try adding context management methods.

We disclosed the system prompt used for testing in the technical report. You can try aligning with it to encourage the model to perform deep search. https://arxiv.org/pdf/2602.02276 Page 25.
You can also try reducing the search chunk size, e.g., controlling 2k per search call result, so the model has more context available for reasoning even without context management techniques.

pandemo

Feb 4

You can also try reducing the search chunk size, e.g., controlling 2k per search call result, so the model has more context available for reasoning even without context management techniques.

@dawnmsg , thank you for your great insights.

I just want to check my understanding of two things you mentioned in different discussions. In this comment you suggested “reducing the search chunk size … so the model has more context available for reasoning even without context management techniques.”

But in an earlier K2-Thinking discussion, you said “We don’t return a summary of the target pages — instead, we return the original text to the model and let the model browse the relevant information itself.”

I was wondering if you could clarify how these two points fit together. Specifically, when you reduce the search chunk size (for example to ~2k tokens per search call result), does that mean each call returns only a partial slice of the source text, and the model may not receive the full page text in-context unless it issues additional calls? Or does the system still return the full original text, just segmented differently?

Thanks again for any clarification you are able to provide on how to interpret this properly🙏

dawnmsg

Feb 5

But in an earlier K2-Thinking discussion, you said “We don’t return a summary of the target pages — instead, we return the original text to the model and let the model browse the relevant information itself.”

I was wondering if you could clarify how these two points fit together. Specifically, when you reduce the search chunk size (for example to ~2k tokens per search call result), does that mean each call returns only a partial slice of the source text, and the model may not receive the full page text in-context unless it issues additional calls? Or does the system still return the full original text, just segmented differently?

Hi, thank you for your reply. Yes, we still don’t return summaries of the target pages; we return web_url and the truncated first n k tokens (e.g., 2k in this case) for the result. Since we have the browser_use tools, if K2.5 determines the information is useful and needs a careful read, it will call browser_visit to load the full page.

I hope these details help you reproduce our reported score. I’ll be happy to answer any questions if you run into any challenges.

pandemo

Feb 5

I hope these details help you reproduce our reported score. I’ll be happy to answer any questions if you run into any challenges.

@dawnmsg , thank you again for your help and continued great insights.

I’m currently struggling to reproduce the BrowseComp results with context management. In the K2.5 tech report it says that for BrowseComp “we adopt the same discard-all strategy proposed by DeepSeek, where all history is truncated once token thresholds are exceeded.”

I also looked at the DeepSeek-V3.2 paper, and as I understand it, discard-all there is described as “when the token usage exceeds 80% of the context window length, discard-all resets the context by discarding all previous tool call history”, where it seems a bit more specific about tool call history rather than all history?

Could you shed a bit more light on how discard-all is implemented in the K2.5 context manager? For example:

In the case where "all history is truncated" once the 80% context threshold is triggered, up to which point is it truncated to? Is it only the system prompt and initial user message/question that are always retained? Are any other messages retained?
And if the entire prior context (except prompt and question) is discarded, I’m having a hard time understanding how the model avoids simply re-trying the same search trajectories that previously failed, since it no longer has the context of what was already attempted?

Thanks again so much for any clarification you are able to provide on how to interpret this properly🙏

dawnmsg

Feb 5

I also looked at the DeepSeek-V3.2 paper, and as I understand it, discard-all there is described as “when the token usage exceeds 80% of the context window length, discard-all resets the context by discarding all previous tool call history”, where it seems a bit more specific about tool call history rather than all history?

“Discard-all, which resets the context by discarding all previous tool call history (similar to the new context tool (Anthropic, 2025a)).” Based on our understanding of the DeepSeek V3.2 paper and the paper it cites (the new context tool, Anthropic, 2025a), discard-all resets the context by discarding the reasoning associated with a tool call as well as the tool call results. It provide a new context with the initial question. In our implementation, we set an upper tool-call step threshold for discard-all to avoid endless retries in this case. We set it to 500.

pandemo

Feb 5

In our implementation, we set an upper tool-call step threshold for discard-all to avoid endless retries in this case. We set it to 500.

@dawnmsg , thanks again for your insights which are really helpful towards community reproduction.

Just to confirm I’m interpreting this correctly: when you say "we set an upper tool-call step threshold for discard-all … to 500", you mean there’s a hard cap of 500 tool-invocation steps and that the tool-invocation step counter keeps accumulating across discard-all resets (it doesn’t reset when discard-all triggers), and the run is stopped once the total number of tool calls (the counter) reaches 500 to prevent endless looping. Is that the correct interpretation?

Thank you once again for any clarification you are able to provide on how to interpret this properly🙏

dawnmsg

Feb 5

Yes, the tool call count will be accumulated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment