Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.5

SGLang

How to use moonshotai/Kimi-K2.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.5
```

Observing (no content) response from the model randomly

#18

by shivamashtikar - opened Jan 28

Discussion

shivamashtikar

Jan 28

•

edited Jan 28

Observing (no content) in response from the model randomly in between text, ,reasoning and in tool calls when running the model with vllm. Observed similar behavior while using sglang too where tool calls itself was not working

{"role":"assistant","content":[{"type":"text","text":"(no content)"},{"type":"tool_use","id":"functions.Read:2","name":"Read","input":{"file_path":"/Users/shivam.ashtikar/workspace/opencode/README.md"}},{"type":"tool_use","id":"functions.Bash:3","name":"Bash","input":{"command":"ls -la /Users/shivam.ashtikar/workspace/opencode/packages","description":"List packages directory structure"}},{"type":"tool_use","id":"functions.Read:4","name":"Read","input":{"file_path":"/Users/shivam.ashtikar/workspace/opencode/package.json"},"cache_control":{"type":"ephemeral"}}]},

shivamashtikar

Jan 28

here is the code modification that I had to do in vllm to get rid of (no content) tokens but it leads to data loss
https://github.com/vllm-project/vllm/pull/33248
sharing here the vllm configuration too

.venv/bin/vllm serve moonshotai/Kimi-K2.5 \
  --host 0.0.0.0 \
  --port 8000 \
  --chat-template ./chat_template.jinja \
  --tokenizer-mode auto \
  --mm-encoder-tp-mode data \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 262144 \
  --max-num-batched-tokens 32768 \
  --max-num-seqs 64 \
  --trust-remote-code \
  --safetensors-load-strategy eager \
  --decode-context-parallel-size 8 \
  --served-model-name kimi-k2-5 \
  --cudagraph-metrics \
  --enable-mfu-metrics \
  --kv-cache-metrics \
  --kv-cache-metrics-sample 0.05 \
  --max-cudagraph-capture-size 1024 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --override-generation-config '{"temperature": 1, "top_p": 0.95, "repetition_penalty": 1.05, "top_k": 25, "max_new_tokens": 32384}'

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment