Instructions to use moonshotai/Kimi-K2-Instruct-0905 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2-Instruct-0905 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2-Instruct-0905 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2-Instruct-0905" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct-0905", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905
- SGLang
How to use moonshotai/Kimi-K2-Instruct-0905 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Instruct-0905" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct-0905", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Instruct-0905" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct-0905", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2-Instruct-0905 with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905
Inconsistent results when using fp8?
vllm serve /share5/projects/llm/models/weight/Kimi-K2-Instruct-0905 \
--distributed-executor-backend ray \
--tensor-parallel-size 16 \
--host 0.0.0.0 --port 8080 \
--served-model-name kimi-k2-instruct-0905 \
--trust-remote-code \
--max-model-len 131072 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.95 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--calculate-kv-scales \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2
I served kimi-k2-instruct-0905 on 16 h100 gpus. when I inference with the endpoint, I got some inconsitent reuslts. Any clues? Is my hosting the model correct.
The original prompt (11,701 tokens) consistently fails with kimi-k2:
5/5 attempts returned empty response
stop_reason: 163586 (appears to be an internal error code)
completion_tokens: 1 (only generates 1 token before stopping)
Comparison:
Prompt Type Tokens kimi-k2 Claude GPT-5
Simple (same question) 136 β
Works β
Works β
Works
Original complex 11,701 β Empty β
Works β
Works
Investigation with prompt length:
======================================================================
FINDING KIMI-K2 TOKEN THRESHOLD (8K-15K Range)
β
1,299 prompt tokens | completion: 12 | The meeting lasts 35 minutes and 25 seconds.
β
2,499 prompt tokens | completion: 35 | The meeting lasts 35 minutes and 25 seconds, c
β
3,699 prompt tokens | completion: 22 | The meeting lasts 35 minutes and 25 seconds, i
β
4,899 prompt tokens | completion: 35 | The meeting lasts 35 minutes and 25 seconds, c
β
6,099 prompt tokens | completion: 22 | The meeting lasts 35 minutes and 25 seconds, i
β
7,299 prompt tokens | completion: 30 | The meeting lasts 35 minutes and 25 seconds, c
β
8,499 prompt tokens | completion: 24 | The meeting lasts 35 minutes and 25 seconds, for a
β
9,699 prompt tokens | completion: 34 | The meeting lasts 35 minutes and 25 seconds, c
======================================================================
BINARY SEARCH FOR KIMI-K2 TOKEN THRESHOLD
β 12,099 prompt tokens | completion: 1 | (empty)
β 14,099 prompt tokens | completion: 1 | (empty)
β
16,099 prompt tokens | completion: 12 | The meeting lasts **35 minutes and 25 se
β 18,099 prompt tokens | completion: 1 | (empty)
β
20,099 prompt tokens | completion: 12 | The meeting lasts **35 minutes and 25 se
β 22,099 prompt tokens | completion: 1 | (empty)
β
24,099 prompt tokens | completion: 21 | The meeting lasts **35 minutes and 25 se