Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use moonshotai/Kimi-K2.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.5

SGLang

How to use moonshotai/Kimi-K2.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.5
```

Guide to run Kimi K2.5 locally on your device.

#19

by shimmyshimmer - opened Jan 28

Discussion

shimmyshimmer

Jan 28

Hey guys we made a guide to run the model locally. You'll need 240GB RAM or unified memory for best results.

Note that VRAM is not required.
You can run on a Mac with 256GB unified memory with similar speeds or 256 RAM without VRAM.

You can even run with much less compute (e.g. 80GB RAM) as it'll offload but it'll be slower.

Guide: https://unsloth.ai/docs/models/kimi-k2.5
GGUFs to run: https://huggingface.co/unsloth/Kimi-K2.5-GGUF

youhanasheriff

Jan 30

•

edited Jan 30

What's the quality of the output? Does it give the same quality in writing and tool calling for Agentic works like the full model?

ThanhNguyxn

Feb 2

Hi @youhanasheriff ,

Great question! Here's what you should expect from the GGUF quantized versions:

Quality Expectations

Quantization	Size	Quality Impact
Q8_0	~530GB	Virtually identical to FP16 (<1% degradation)
Q6_K	~400GB	Excellent quality, minimal loss
Q4_K_M	~280GB	Good quality, slight degradation on complex tasks
Q3_K_M	~210GB	Noticeable quality drop, still usable
Q2_K	~150GB	Significant degradation, for testing only

For Agentic/Tool Calling

Tool calling and agentic tasks are more sensitive to quantization than general chat because:

Structured JSON output requires precise token prediction
Multi-step reasoning accumulates small errors
Code generation needs exact syntax

Recommendations:

For serious agentic work: Q6_K or Q8_0
For casual use/testing: Q4_K_M works reasonably well
Avoid Q3 and below for tool calling

Reality Check

The full FP16/INT4 model on GPU clusters will always outperform GGUF on CPU/RAM, but for local experimentation and development, the Q6_K/Q8_0 quantizations are remarkably good.

The Unsloth team has done excellent work optimizing these quantizations specifically for Kimi-K2.5.

Hope this helps!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment