Instructions to use moonshotai/Kimi-K2-Instruct-0905 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2-Instruct-0905 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True, device_map="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2-Instruct-0905 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2-Instruct-0905"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct-0905",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905

SGLang

How to use moonshotai/Kimi-K2-Instruct-0905 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2-Instruct-0905" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct-0905",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2-Instruct-0905" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct-0905",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2-Instruct-0905 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905
```

Considering a distilled version of 80B parameters

by snapo - opened Sep 5, 2025

Discussion

snapo

Sep 5, 2025

•

edited Sep 5, 2025

Hi Moonshot,
Do you ever consider to create a 78B MoE distilled version of Kimi K2?
You might ask why 78B, its pretty simple very good quantisation of Q4_0 brings this down to 40G, which means if someone has 2 x 20-24GB GPU he could use this model at home without any trouble in Q4 with imense speed. Everything above 80GB can not be run on two local GPU's....

The comunity has models in the 0-30B range for single GPU, but for dual GPU there is not a single model unfortunately...

120B from GPT-OSS dosent fit into two 20GB gpus, same as the GLM Air model which also dosent fit into two GPU's... If they both would be 78B parameters , then dual GPU users could profit from this in Q4 a lot.

So would you consider creating a flash version that is approx. 80B parameters?

JasonLee996

Sep 5, 2025

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

davics

Sep 5, 2025

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

why why tell me why

snapo

Sep 5, 2025

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

would love to know why? ... for nvidia selling more big memory GPU's for 10k usd? Because it cant be the because new Ryzen AI PC's or nvidia sparks.... both of them dont have the compute.... thats why 2 GPU's would be much more interesting, but your opinnion realy would be interesting....

CHNtentes

Sep 6, 2025

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

why why tell me why

because moe model <100b is not powerful enough?

snapo

Sep 6, 2025

Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^

why why tell me why

because moe model <100b is not powerful enough?

wrong, layer depth (how many layers) * (active parameters) == intelligence....
Additional parameters in a MoE are just retrival of knowledge, thats why they need more. But training ultra high layer count llm's is extremely difficult.

WIDTH (dimension) is what most do, because much cheaper to do training on....
LAYER's is what they should do....

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment