Instructions to use moonshotai/Kimi-K2-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2-Thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Thinking", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Thinking", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2-Thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2-Thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2-Thinking
- SGLang
How to use moonshotai/Kimi-K2-Thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2-Thinking with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2-Thinking
Kimi-K2-Thinking vLLM self host tool call fail
CUDA ARCH sm120 Driver Version: 580.95.05 CUDA Version: 13.0
OS Ubuntu 22.04 Linux 6.8.0-87-generic x86_64 x86_64
vllm 0.11.2.dev618+ga238cbd89.d20251206.cu130
8x NVIDIA RTX PRO 6000 Blackwell Workstation 96GB
vllm serve Kimi-K2-Thinking
--served-model-name llm_model
--tensor-parallel-size 8
--decode-context-parallel-size 8
--enable-auto-tool-choice
--tool-call-parser kimi_k2
--reasoning-parser kimi_k2
--trust-remote-code
--gpu-memory-utilization 0.90
--max-model-len 262144
--port 9999 \
kimi cli
{
"default_model": "",
"models": {"kimi": {
"provider": "openai",
"model": "llm_model",
"max_context_size": 160000
}},
"providers": {"openai": {
"type": "openai_legacy",
"base_url": "http://localhost:9999/v1",
"api_key": "your-secret-key"
}},
"loop_control": {
"max_steps_per_run": 100,
"max_retries_per_step": 3
},
"services": {}
}
kimi, version 0.61
kimi -m kimi
user@20251130_base_modelโจ what is this repo about
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ what is this repo about โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โข I'll help you understand what this repository is about. Let me explore the key files and structure.
โข Used Shell (dir /b)
โข Used ReadFile (client.py)
โข Used ReadFile (solar-system.html)
โข Used Shell (dir temperature_outputs_1)
โข I'll help you understand what this repository is about. Let me explore the key files and structure. <|tool_call_begin|> functions.Shell:0 <|tool_call_argument_begin|> {"command": "dir /b", "timeout": 10} <
|tool_call_end|> <|tool_call_begin|> functions.ReadFile:1 <|tool_call_argument_begin|> {"path": "D:\workspace\20251130_base_model\client.py", "n_lines": 50} <|tool_call_end|> <|tool_call_begin|> functions.R
eadFile:2 <|tool_call_argument_begin|> {"path": "D:\workspace\20251130_base_model\solar-system.html", "n_lines": 50} <|tool_call_end|> <|tool_call_begin|> functions.Shell:3 <|tool_call_argument_begin|> {"co
mmand": "dir temperature_outputs_1", "timeout": 10} <|tool_call_end|>
โข Let me look at the actual content of client.py (beyond the first 50 lines) to see what it does, and possibly look at some of the output files to understand the full context.
โข Used Grep (def )
โข Used ReadFile (temperature_outputs_1\output_temp_0.50.txt)
โข Used ReadFile
Invalid arguments
โข Let me look at the actual content of client.py (beyond the first 50 lines) to see what it does, and possibly look at some of the output files to understand the full context. <|tool_call_begin|> functions
.Grep:4 <|tool_call_argument_begin|> {"pattern": "def ", "path": "D:\workspace\20251130_base_model\client.py", "output_mode": "content", "-n": true} <|tool_call_end|> <|tool_call_begin|> functions.ReadFi
le:5 <|tool_call_argument_begin|> {"path": "D:\workspace\20251130_base_model\temperature_outputs_1\output_temp_0.50.txt"} <|tool_call_end|> <|tool_call_begin|> functions.ReadFile:6 {"path": "D:\workspac
e\20251130_base_model\Data_2013.json", "n_lines": 30} <|tool_call_end|>
LLM provider error: Error code: 400 - {'error': {'message': "1 validation error for ValidatorIterator\n2.function.arguments\n Field required \n For further information visit https://errors.pydantic.dev/2.
12/v/missing None", 'type': 'BadRequestError', 'param': None, 'code': 400}}
issue same for kilo code, cline
using this PR did fix the issue for a while
https://github.com/vllm-project/vllm/pull/24847
but now facing issue where getting (no content) tokens
Let me check all the places where webSearchEnabled is used to understand the logic flow. (no content) (no content) (no content) (no content) (no content) (no content) (no content) (no content) (no content) (no content) (no content) (no content) (no content) (no content)