Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2.5
- SGLang
How to use moonshotai/Kimi-K2.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2.5
Guide to run Kimi K2.5 locally on your device.
Hey guys we made a guide to run the model locally. You'll need 240GB RAM or unified memory for best results.
Note that VRAM is not required.
You can run on a Mac with 256GB unified memory with similar speeds or 256 RAM without VRAM.
You can even run with much less compute (e.g. 80GB RAM) as it'll offload but it'll be slower.
Guide: https://unsloth.ai/docs/models/kimi-k2.5
GGUFs to run: https://huggingface.co/unsloth/Kimi-K2.5-GGUF
What's the quality of the output? Does it give the same quality in writing and tool calling for Agentic works like the full model?
Hi @youhanasheriff ,
Great question! Here's what you should expect from the GGUF quantized versions:
Quality Expectations
| Quantization | Size | Quality Impact |
|---|---|---|
| Q8_0 | ~530GB | Virtually identical to FP16 (<1% degradation) |
| Q6_K | ~400GB | Excellent quality, minimal loss |
| Q4_K_M | ~280GB | Good quality, slight degradation on complex tasks |
| Q3_K_M | ~210GB | Noticeable quality drop, still usable |
| Q2_K | ~150GB | Significant degradation, for testing only |
For Agentic/Tool Calling
Tool calling and agentic tasks are more sensitive to quantization than general chat because:
- Structured JSON output requires precise token prediction
- Multi-step reasoning accumulates small errors
- Code generation needs exact syntax
Recommendations:
- For serious agentic work: Q6_K or Q8_0
- For casual use/testing: Q4_K_M works reasonably well
- Avoid Q3 and below for tool calling
Reality Check
The full FP16/INT4 model on GPU clusters will always outperform GGUF on CPU/RAM, but for local experimentation and development, the Q6_K/Q8_0 quantizations are remarkably good.
The Unsloth team has done excellent work optimizing these quantizations specifically for Kimi-K2.5.
Hope this helps!
