Instructions to use moonshotai/Kimi-K2-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2-Thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Thinking", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Thinking", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2-Thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2-Thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2-Thinking
- SGLang
How to use moonshotai/Kimi-K2-Thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2-Thinking with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2-Thinking
Any plan to open source the search agent framework?
I’ve been trying to reproduce the BrowseComp and related results based on your description.
However, even with the same tool setup (search, browsing, and code tools), our in-house implementation performs much lower than what’s reported in your benchmarks.
May I ask if there are any plans to open source the search agent framework (or at least a minimal reference version)?
It would be super helpful for the community to better understand and reproduce the results.
I am wondering the same.
Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?
I am wondering the same.
Was https://github.com/Alibaba-NLP/DeepResearch the search agent framework you used to try and reproduce K2's Browsecomp results? And did you only use 3 of it's tools, or did you use the full implementation in your testing?
Yep, exactly — I used the DeepResearch framework and just the three tools for my run.
Yep, exactly — I used the DeepResearch framework and just the three tools for my run.
Ah, understood, thank you for clarifying. So if I understood correctly, the changes you made from the Tongyi repo involved limiting it to the 3 tools you described earlier, and swapping the default model for K2 Thinking? Were there any other changes you made compared to the default Tongyi implementation?
Also, did you use the official Kimi API or the OpenRouter API(which was having issues)? And, if I may ask, what were the BrowseComp results that you got?
And, from what @dawnmsg mentioned in the discussion I started ( https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/5 ), effective context management seems to be essential. From my observation, the default Tongyi inference implementation in their repo doesn’t currently include context management, though their open-sourced WebResummer ( https://github.com/Alibaba-NLP/DeepResearch/tree/main/WebAgent/WebResummer ) and the not-yet open-sourced AgentFold appear to be Tongyi’s approaches for handling it. But then again, looking at their blog https://moonshotai.github.io/Kimi-K2/thinking.html , Moonshot mention what seems to be a much more simple approach: "When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs."
Yep, exactly — I used the DeepResearch framework and just the three tools for my run.
Ah, understood, thank you for clarifying. So if I understood correctly, the changes you made from the Tongyi repo involved limiting it to the 3 tools you described earlier, and swapping the default model for K2 Thinking? Were there any other changes you made compared to the default Tongyi implementation?
Also, did you use the official Kimi API or the OpenRouter API(which was having issues)? And, if I may ask, what were the BrowseComp results that you got?
And, from what @dawnmsg mentioned in the discussion I started ( https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/5 ), effective context management seems to be essential. From my observation, the default Tongyi inference implementation in their repo doesn’t currently include context management, though their open-sourced WebResummer ( https://github.com/Alibaba-NLP/DeepResearch/tree/main/WebAgent/WebResummer ) and the not-yet open-sourced AgentFold appear to be Tongyi’s approaches for handling it. But then again, looking at their blog https://moonshotai.github.io/Kimi-K2/thinking.html , Moonshot mention what seems to be a much more simple approach: "When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs."
Yeah, that’s possible.
Interestingly, from the Kimi trace, most tool calls on BrowseComp didn’t go over 70, and we rarely saw context-length errors. So if context isn’t the issue, I suspect the summarizer in the visit implementation might be introducing some errors. I’ll check this further.
Of course, it could also be something with the Kimi model itself — the real reason needs more digging.
Yeah, that’s possible.
Interestingly, from the Kimi trace, most tool calls on BrowseComp didn’t go over 70, and we rarely saw context-length errors. So if context isn’t the issue, I suspect the summarizer in the visit implementation might be introducing some errors. I’ll check this further.
Of course, it could also be something with the Kimi model itself — the real reason needs more digging.
Yes, very good point regarding that Tongyi's summarization within the visit tool might be a source of errors. Please do let the community know if you find anything that more closely reproduces the official results🙏. Also you might have seen this already but https://github.com/prnake/kimi-deepresearch was shared in the other discussion thread and according to the author scores 50+ on browsecomp. Might be of interest.
Also, @CherryDurian , in the other discussion thread, when you had said that Tongyi's implementation "could only get about 40% of the reported score", did you mean that it could only achieve on BrowseComp about:
- 40% of 60% (K2's official browsecomp score)= 24%
or - 40% (on browsecomp) ?