Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.5

SGLang

How to use moonshotai/Kimi-K2.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.5
```

Context Management Reproducibility | 可复现性 ?

#13

by pandemo - opened Jan 28

Discussion

pandemo

Jan 28

Hi Moonshot team, thank you so much for continually open-sourcing such impressive models and sharing your research!

Just a question regarding reproducibility for the dev and research community: Footnote 2 in your blog post describes an HLE context management strategy where "only the latest round of tool messages is retained," while Footnote 3 references a "discard-all" strategy for BrowseComp.

Are the context management strategies you used for BrowseComp/HLE evaluation open-source, or do you plan to open-source them? If not, could you please elaborate on the HLE context management strategy or perhaps provide pseudo-code for how that specific retention logic works? Also, is this distinct from the BrowseComp "discard-all" strategy you alluded to?

Thanks again for your amazing work! 🙏

Moonshot 团队你们好，感谢你们持续开源如此出色的模型并分享研究成果！

我有一个关于面向开发与研究社区的可复现性问题：你们博客文章的脚注 2 描述了一种用于 HLE 的 context management 策略，其中提到 “only the latest round of tool messages is retained”，而脚注 3 又提到了用于 BrowseComp 的 “discard-all” 策略。

请问你们在 BrowseComp/HLE evaluation 中使用的 context management 策略是否已经开源，或计划开源？如果目前没有开源，能否请你们进一步解释一下 HLE 的 context management 策略，或者提供 pseudo-code，说明那个特定的保留（retention）逻辑具体是如何工作的？另外，这个策略是否与您提到的 BrowseComp “discard-all” 策略不同？

再次感谢你们精彩的工作！🙏

pandemo

Jan 28

@courage17340 @teowu @bigeagle @bigmoyan @dawnmsg

UPDATE: I mocked up some of the possible diagram interpretations for the HLE context management you described. Please let me know if one of the diagrams accurately represents the context management strategy that was used 🙏 or if all my interpretations are wrong:

1st interpretation: (Retaining the chain of thought, but pruning bulky outputs) In this strategy, the full history of the Assistant's "Think" and "Tool Call" steps is preserved to maintain the reasoning chain, but the intermediate "Tool Results" (which often consume the most tokens) are omitted, except for the very latest round.

2nd interpretation: (Strict "Latest Round" window) In this strategy, all intermediate history, including both the Assistant's reasoning steps and the User's tool results, is discarded. The context is strictly limited to the original User Question and the single most recent round of Think & Tool Call/Result.

3rd interpretation: (Orphaned Result retention) In this strategy, the latest "Tool Result" is retained, but the specific "Tool Call" that generated it (and all prior history) is omitted. This effectively provides the model with the latest observation state without the immediate history of the action that caused it.

yanghao1126

Feb 2

According to Tech Report, it seems Kimi-K2.5 uses the 1st interpretation here, which they call the "Hide-Tool-Result Context Management strategy".

However, the "predefined thresholds" aren’t mentioned for it—whereas DeepSeek V3.2 clearly states theirs is set to 80% of the context window length.

Longhui98

Feb 3

Thank you for pointing that out. The predefined threshold for HLE evaluation is 96k; once the task's context length exceeds this limit, Hide-Tool-Result Context Management is enabled.

yanghao1126

Feb 3

•

edited Feb 3

Thanks for your reply. And what about the token threshold for BrowseComp evaluation with context management? It seems like this point is also not clearly specified.

dawnmsg

Feb 4

In browsecomp setting, we fully reproduce the discard-all logic proposed in DeepSeek V3.2 report, so the threshold of triggering ctx managemnt is 80% of the context limit.

pandemo

Feb 4

Thank you for pointing that out. The predefined threshold for HLE evaluation is 96k; once the task's context length exceeds this limit, Hide-Tool-Result Context Management is enabled.

In browsecomp setting, we fully reproduce the discard-all logic proposed in DeepSeek V3.2 report, so the threshold of triggering ctx managemnt is 80% of the context limit.

@dawnmsg @Longhui98 , thank you for the helpful clarification about the specific token thresholds for HLE and BrowseComp evaluation.

Just to confirm the intended interpretation of Hide-Tool-Result: in my 1st diagram interpretation above, the chain of thought (all Think and Tool Call steps) is preserved fully, but only the most recent tool result is retained once the context length exceeds the threshold, with older tool results being omitted to save space. Does that accurately reflect how the Hide-Tool-Result context management strategy works in practice?

Thanks again! 😊

yiboowang

Apr 2

•

edited Apr 2

Thank you for pointing that out. The predefined threshold for HLE evaluation is 96k; once the task's context length exceeds this limit, Hide-Tool-Result Context Management is enabled.

Hello @dawnmsg @Longhui98 , thank you very much for Kimi’s excellent model and report!
I have a question regarding the predefined threshold for BrowserComp. What value would you recommend? I set it to 80% of 256K (i.e., 204K), but the performance improvement seems quite limited.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment