Instructions to use moonshotai/Kimi-K2.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.5", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2.5
- SGLang
How to use moonshotai/Kimi-K2.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2.5 with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2.5
performance of quantized models
I am looking at various quantized versions of K2.5 in unsloth since my hardware can only hold UD-Q2_K_XL/UD-IQ3_XXS/Q3_K_S. Is there any comparison between the quantized version of kimi k2.5 and other good open-source models like Qwen3? For example, Qwen3-235B-A22B-Instruct-2507 is 470G large which is about the same size as Q3_K_S, but I am not sure if Q3_K_S has a better performance than qwen
I've been out for a week due to life stuff, but hoping to run some perplexity values on quants soon
The best quant available is the "full size" Q4_X by AesSedai here: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF which seems too big for your desired size. Aes' smaller quants should be quite good as well with compatibility with mainline llama.cpp
Hopefully I'll get some smaller ik_llama.cpp quants out soon, which will likely offer the best quality for a given footprint. You can see some earlier perplexity graphs for Kimi-K2-Thinking here: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF#quant-collection
In my own anecdotal experience, I prefer to use a more quantized larger model (e.g. DeepSeek-V3.2-Speciale or Kimi-K2.5) over a less quantized smaller model (e.g. Qwen3-235B).
Aes' smaller quants should be quite good as well with compatibility with mainline llama.cpp
I had issues with the IQ2_XXS there. Very similar to what this guy is having with the IQ3_S: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF/discussions/4
The unsloth UD-IQ2_XXS has been stable (with the q8_0 mmproj mmproj from Aes' repo), but the embedded template is dodgy, so have to run it with the jukofyork fix here: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF/discussions/1
"full size" Q4_X by AesSedai here: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF
This is probably the perfect way to run it right now, and has the fixed template baked in. I can't run it though -_-!
some smaller ik_llama.cpp quants
😃
Thanks for the links, I'm trying to catch up on everything happened in the past week hah...
I have some small quants trickling in now here: https://huggingface.co/ubergarm/Kimi-K2.5-GGUF
I'll get perplexity graph going soon.
I haven't tried the mmproj stuff at all. I did use the most recent updated official chat template, but was having some issues with pydantic-ai tool use, but it works fine with the old Kimi-K2-Thinking chat template jinja so still need to test that more.
All my Kimi-K2.5 quants keep the active weights full q8_0 and only smashing the routed exps so hopefully no looping etc.