Instructions to use moonshotai/Kimi-K2-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Thinking", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Thinking", trust_remote_code=True, dtype="auto")

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2-Thinking

SGLang

How to use moonshotai/Kimi-K2-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2-Thinking with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2-Thinking
```

Any plan to open source the search agent framework?

#22

by CherryDurian - opened Nov 11, 2025

Discussion

CherryDurian

Nov 11, 2025

I’ve been trying to reproduce the BrowseComp and related results based on your description.
However, even with the same tool setup (search, browsing, and code tools), our in-house implementation performs much lower than what’s reported in your benchmarks.

May I ask if there are any plans to open source the search agent framework (or at least a minimal reference version)?
It would be super helpful for the community to better understand and reproduce the results.

pandemo

Nov 11, 2025

I am wondering the same.

@CherryDurian

Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?

CherryDurian

Nov 12, 2025

I am wondering the same.

@CherryDurian

Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?

Yep, exactly — I used the DeepResearch framework and just the three tools for my run.

pandemo

Nov 12, 2025

Yep, exactly — I used the DeepResearch framework and just the three tools for my run.

@CherryDurian

Ah, understood, thank you for clarifying. So if I understood correctly, the changes you made from the Tongyi repo involved limiting it to the 3 tools you described earlier, and swapping the default model for K2 Thinking? Were there any other changes you made compared to the default Tongyi implementation?

Also, did you use the official Kimi API or the OpenRouter API(which was having issues)? And, if I may ask, what were the BrowseComp results that you got?

And, from what @dawnmsg mentioned in the discussion I started ( https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/5 ), effective context management seems to be essential. From my observation, the default Tongyi inference implementation in their repo doesn’t currently include context management, though their open-sourced WebResummer ( https://github.com/Alibaba-NLP/DeepResearch/tree/main/WebAgent/WebResummer ) and the not-yet open-sourced AgentFold appear to be Tongyi’s approaches for handling it. But then again, looking at their blog https://moonshotai.github.io/Kimi-K2/thinking.html , Moonshot mention what seems to be a much more simple approach: "When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs."

CherryDurian

Nov 12, 2025

Yep, exactly — I used the DeepResearch framework and just the three tools for my run.

@CherryDurian

Ah, understood, thank you for clarifying. So if I understood correctly, the changes you made from the Tongyi repo involved limiting it to the 3 tools you described earlier, and swapping the default model for K2 Thinking? Were there any other changes you made compared to the default Tongyi implementation?

Also, did you use the official Kimi API or the OpenRouter API(which was having issues)? And, if I may ask, what were the BrowseComp results that you got?

And, from what @dawnmsg mentioned in the discussion I started ( https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/5 ), effective context management seems to be essential. From my observation, the default Tongyi inference implementation in their repo doesn’t currently include context management, though their open-sourced WebResummer ( https://github.com/Alibaba-NLP/DeepResearch/tree/main/WebAgent/WebResummer ) and the not-yet open-sourced AgentFold appear to be Tongyi’s approaches for handling it. But then again, looking at their blog https://moonshotai.github.io/Kimi-K2/thinking.html , Moonshot mention what seems to be a much more simple approach: "When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs."

Yeah, that’s possible.

Interestingly, from the Kimi trace, most tool calls on BrowseComp didn’t go over 70, and we rarely saw context-length errors. So if context isn’t the issue, I suspect the summarizer in the visit implementation might be introducing some errors. I’ll check this further.

Of course, it could also be something with the Kimi model itself — the real reason needs more digging.

pandemo

Nov 12, 2025

Yeah, that’s possible.

Interestingly, from the Kimi trace, most tool calls on BrowseComp didn’t go over 70, and we rarely saw context-length errors. So if context isn’t the issue, I suspect the summarizer in the visit implementation might be introducing some errors. I’ll check this further.

Of course, it could also be something with the Kimi model itself — the real reason needs more digging.

Yes, very good point regarding that Tongyi's summarization within the visit tool might be a source of errors. Please do let the community know if you find anything that more closely reproduces the official results🙏. Also you might have seen this already but https://github.com/prnake/kimi-deepresearch was shared in the other discussion thread and according to the author scores 50+ on browsecomp. Might be of interest.

pandemo

Nov 12, 2025

Also, @CherryDurian , in the other discussion thread, when you had said that Tongyi's implementation "could only get about 40% of the reported score", did you mean that it could only achieve on BrowseComp about:

40% of 60% (K2's official browsecomp score)= 24%
or
40% (on browsecomp) ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment