Instructions to use moonshotai/Kimi-K2-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Thinking", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Thinking", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use moonshotai/Kimi-K2-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2-Thinking

SGLang

How to use moonshotai/Kimi-K2-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2-Thinking with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2-Thinking
```

Kimi-K2-Thinking vLLM self host tool call fail

#40

by CHONGYOEYAT - opened Dec 8, 2025

Discussion

CHONGYOEYAT

Dec 8, 2025

CUDA ARCH sm120 Driver Version: 580.95.05 CUDA Version: 13.0
OS Ubuntu 22.04 Linux 6.8.0-87-generic x86_64 x86_64
vllm 0.11.2.dev618+ga238cbd89.d20251206.cu130
8x NVIDIA RTX PRO 6000 Blackwell Workstation 96GB

vllm serve Kimi-K2-Thinking
--served-model-name llm_model
--tensor-parallel-size 8
--decode-context-parallel-size 8
--enable-auto-tool-choice
--tool-call-parser kimi_k2
--reasoning-parser kimi_k2
--trust-remote-code
--gpu-memory-utilization 0.90
--max-model-len 262144
--port 9999 \

kimi cli
{
"default_model": "",
"models": {"kimi": {
"provider": "openai",
"model": "llm_model",
"max_context_size": 160000
}},
"providers": {"openai": {
"type": "openai_legacy",
"base_url": "http://localhost:9999/v1",
"api_key": "your-secret-key"
}},
"loop_control": {
"max_steps_per_run": 100,
"max_retries_per_step": 3
},
"services": {}
}

kimi, version 0.61

kimi -m kimi

user@20251130_base_model✨ what is this repo about
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ what is this repo about │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
• I'll help you understand what this repository is about. Let me explore the key files and structure.
• Used Shell (dir /b)
• Used ReadFile (client.py)
• Used ReadFile (solar-system.html)
• Used Shell (dir temperature_outputs_1)
• I'll help you understand what this repository is about. Let me explore the key files and structure. <|tool_call_begin|> functions.Shell:0 <|tool_call_argument_begin|> {"command": "dir /b", "timeout": 10} <
|tool_call_end|> <|tool_call_begin|> functions.ReadFile:1 <|tool_call_argument_begin|> {"path": "D:\workspace\20251130_base_model\client.py", "n_lines": 50} <|tool_call_end|> <|tool_call_begin|> functions.R
eadFile:2 <|tool_call_argument_begin|> {"path": "D:\workspace\20251130_base_model\solar-system.html", "n_lines": 50} <|tool_call_end|> <|tool_call_begin|> functions.Shell:3 <|tool_call_argument_begin|> {"co
mmand": "dir temperature_outputs_1", "timeout": 10} <|tool_call_end|>
• Let me look at the actual content of client.py (beyond the first 50 lines) to see what it does, and possibly look at some of the output files to understand the full context.
• Used Grep (def )
• Used ReadFile (temperature_outputs_1\output_temp_0.50.txt)
• Used ReadFile
Invalid arguments
• Let me look at the actual content of client.py (beyond the first 50 lines) to see what it does, and possibly look at some of the output files to understand the full context. <|tool_call_begin|> functions
.Grep:4 <|tool_call_argument_begin|> {"pattern": "def ", "path": "D:\workspace\20251130_base_model\client.py", "output_mode": "content", "-n": true} <|tool_call_end|> <|tool_call_begin|> functions.ReadFi
le:5 <|tool_call_argument_begin|> {"path": "D:\workspace\20251130_base_model\temperature_outputs_1\output_temp_0.50.txt"} <|tool_call_end|> <|tool_call_begin|> functions.ReadFile:6 {"path": "D:\workspac
e\20251130_base_model\Data_2013.json", "n_lines": 30} <|tool_call_end|>
LLM provider error: Error code: 400 - {'error': {'message': "1 validation error for ValidatorIterator\n2.function.arguments\n Field required \n For further information visit https://errors.pydantic.dev/2.
12/v/missing None", 'type': 'BadRequestError', 'param': None, 'code': 400}}

issue same for kilo code, cline

shivamashtikar

Dec 23, 2025

using this PR did fix the issue for a while
https://github.com/vllm-project/vllm/pull/24847

but now facing issue where getting (no content) tokens

Let me check all the places where webSearchEnabled is used to understand the logic flow.     (no content) (no content) (no content)  (no content)  (no content) (no content) (no content) (no content) (no content)   (no content)  (no content)   (no content)  (no content)  (no content)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment