squ11z1 commited on
Commit
e6ed964
·
verified ·
1 Parent(s): 4d1caf8

Upload docs/deploy_guidance.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/deploy_guidance.md +106 -0
docs/deploy_guidance.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kimi-K2-Thinking Deployment Guide
2
+
3
+ > [!Note]
4
+ > This guide only provides some examples of deployment commands for Kimi-K2-Thinking, which may not be the optimal configuration. Since inference engines are still being updated frequenty, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
5
+
6
+ > kimi_k2 reasoning parser and other related features have been merged into vLLM/sglang and will be available in the next release. For now, please use the nightly build Docker image.
7
+
8
+
9
+ ## vLLM Deployment
10
+
11
+ The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
12
+ Running parameters for this environment are provided below. For other parallelism strategies, please refer to updates of official documents.
13
+
14
+
15
+ ### Tensor Parallelism
16
+
17
+ Here is a sample launch command with TP=8:
18
+
19
+ ``` bash
20
+
21
+ vllm serve $MODEL_PATH \
22
+ --served-model-name kimi-k2-thinking \
23
+ --trust-remote-code \
24
+ --tensor-parallel-size 8 \
25
+ --enable-auto-tool-choice \
26
+ --max-num-batched-tokens 32768 \
27
+ --tool-call-parser kimi_k2 \
28
+ --reasoning-parser kimi_k2
29
+ ```
30
+
31
+ **Key parameter notes:**
32
+ - `--enable-auto-tool-choice`: Required when enabling tool usage.
33
+ - `--tool-call-parser kimi_k2`: Required when enabling tool usage.
34
+ - `--reasoning-parser kimi_k2`: Required for correctly processing reasoning content.
35
+ - `--max-num-batched-tokens 32768`: Using chunk prefill to reduce peak memory usage.
36
+
37
+
38
+ ## SGLang Deployment
39
+
40
+ Similarly, here are the examples using TP in SGLang for Deployment.
41
+
42
+ ### Tensor Parallelism
43
+
44
+ Here is the simple example code to run TP8 on H200 in a sigle node:
45
+
46
+ ``` bash
47
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp 8 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
48
+ ```
49
+
50
+ **Key parameter notes:**
51
+ - `--tool-call-parser kimi_k2`: Required when enabling tool usage.
52
+ - `--reasoning-parser kimi_k2`: Required for correctly processing reasoning content.
53
+
54
+
55
+ ## KTransformers Deployment
56
+
57
+ ### KTransformers+SGLang Inference Deployment
58
+
59
+ Launch with KTransformers + SGLang for CPU+GPU heterogeneous inference:
60
+
61
+ ``` bash
62
+ python -m sglang.launch_server \
63
+ --model path/to/Kimi-K2-Thinking/ \
64
+ --kt-amx-weight-path path/to/Kimi-K2-Instruct-CPU-weight/ \
65
+ --kt-cpuinfer 56 \
66
+ --kt-threadpool-count 2 \
67
+ --kt-num-gpu-experts 200 \
68
+ --kt-amx-method AMXINT4 \
69
+ --trust-remote-code \
70
+ --mem-fraction-static 0.98 \
71
+ --chunked-prefill-size 4096 \
72
+ --max-running-requests 37 \
73
+ --max-total-tokens 37000 \
74
+ --enable-mixed-chunk \
75
+ --tensor-parallel-size 8 \
76
+ --enable-p2p-check \
77
+ --disable-shared-experts-fusion
78
+ ```
79
+
80
+ Achieves 577.74 tokens/s Prefill and 45.91 tokens/s Decode (37-way concurrency) on 8× NVIDIA L20 + 2× Intel 6454S.
81
+
82
+ More details: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Kimi-K2-Thinking.md
83
+
84
+
85
+ ### KTransformers+LLaMA-Factory Fine-tuning Deployment
86
+
87
+ You can use below command to run LoRA SFT with KT+llamafactory.
88
+
89
+ ``` bash
90
+ # For LoRA SFT
91
+ USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
92
+ # For Chat with model after LoRA SFT
93
+ llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
94
+ # For API with model after LoRA SFT
95
+ llamafactory-cli api examples/inference/kimik2_lora_sft_kt.yaml
96
+ ```
97
+
98
+ This achieves end-to-end LoRA SFT Throughput: 46.55 token/s on 2× NVIDIA 4090 + Intel 8488C with 1.97T RAM and 200G swap memory.
99
+
100
+ More details refer to https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/SFT_Installation_Guide_KimiK2.md.
101
+
102
+ ## Others
103
+
104
+ Kimi-K2-Thinking reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.
105
+
106
+ If you are using a framework that is not on the recommended list, you can still run the model by manually changing `model_type` to "deepseek_v3" in `config.json` as a temporary workaround. You may need to manually parse tool calls in case no tool call parser is available in your framework.