Instructions to use moonshotai/Kimi-K2-Instruct-0905 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-K2-Instruct-0905 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-K2-Instruct-0905", trust_remote_code=True, dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-K2-Instruct-0905 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-K2-Instruct-0905" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct-0905", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905
- SGLang
How to use moonshotai/Kimi-K2-Instruct-0905 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Instruct-0905" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct-0905", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-K2-Instruct-0905" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-K2-Instruct-0905", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use moonshotai/Kimi-K2-Instruct-0905 with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-K2-Instruct-0905
Considering a distilled version of 80B parameters
Hi Moonshot,
Do you ever consider to create a 78B MoE distilled version of Kimi K2?
You might ask why 78B, its pretty simple very good quantisation of Q4_0 brings this down to 40G, which means if someone has 2 x 20-24GB GPU he could use this model at home without any trouble in Q4 with imense speed. Everything above 80GB can not be run on two local GPU's....
The comunity has models in the 0-30B range for single GPU, but for dual GPU there is not a single model unfortunately...
120B from GPT-OSS dosent fit into two 20GB gpus, same as the GLM Air model which also dosent fit into two GPU's... If they both would be 78B parameters , then dual GPU users could profit from this in Q4 a lot.
So would you consider creating a flash version that is approx. 80B parameters?
Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^
Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^
why why tell me why
Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^
would love to know why? ... for nvidia selling more big memory GPU's for 10k usd? Because it cant be the because new Ryzen AI PC's or nvidia sparks.... both of them dont have the compute.... thats why 2 GPU's would be much more interesting, but your opinnion realy would be interesting....
Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^
why why tell me why
because moe model <100b is not powerful enough?
Have u ever thought why GPT-OSS and GLM Air dosen't fit 2 x 24GB GPU. ^_^
why why tell me why
because moe model <100b is not powerful enough?
wrong, layer depth (how many layers) * (active parameters) == intelligence....
Additional parameters in a MoE are just retrival of knowledge, thats why they need more. But training ultra high layer count llm's is extremely difficult.
WIDTH (dimension) is what most do, because much cheaper to do training on....
LAYER's is what they should do....