Instructions to use moonshotai/Kimi-K2-Instruct-0905 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2-Instruct-0905 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moonshotai/Kimi-K2-Instruct-0905 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2-Instruct-0905"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct-0905",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905

SGLang

How to use moonshotai/Kimi-K2-Instruct-0905 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2-Instruct-0905" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct-0905",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2-Instruct-0905" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2-Instruct-0905",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2-Instruct-0905 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905
```

RuntimeError: No GPU or XPU found. A GPU or XPU is needed for FP8 quantization.

by nrowhani - opened Sep 10, 2025

Discussion

nrowhani

Sep 10, 2025

hi please help me with this error on my win11pro machine (a 4090 and some 64GB ddr5ram)
(2 errors: "You are using a model of type kimi...", and "...No GPU or XPU found...") below

from transformers import AutoConfig, AutoModel
model_name = "C:\mystuff\LLM\Models\Kimi-K2-Instruct-0905"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
You are using a model of type kimi_k2 to instantiate a model of type deepseek_v3. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "", line 1, in
File "c:\mystuff\LLM\Models.venv_py312\Lib\site-packages\transformers\models\auto\auto_factory.py", line 597, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\mystuff\LLM\Models.venv_py312\Lib\site-packages\transformers\modeling_utils.py", line 288, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "c:\mystuff\LLM\Models.venv_py312\Lib\site-packages\transformers\modeling_utils.py", line 5008, in from_pretrained
hf_quantizer, config, dtype, device_map = get_hf_quantizer(
^^^^^^^^^^^^^^^^^
File "c:\mystuff\LLM\Models.venv_py312\Lib\site-packages\transformers\quantizers\auto.py", line 319, in get_hf_quantizer
hf_quantizer.validate_environment(
File "c:\mystuff\LLM\Models.venv_py312\Lib\site-packages\transformers\quantizers\quantizer_finegrained_fp8.py", line 48, in validate_environment
raise RuntimeError("No GPU or XPU found. A GPU or XPU is needed for FP8 quantization.")
RuntimeError: No GPU or XPU found. A GPU or XPU is needed for FP8 quantization.

nrowhani

Sep 12, 2025

solved:
pytorch needs to be installed with CUDA version..

also transformers is not directly supported in python.exe...

nrowhani changed discussion status to closed Sep 12, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment