fastllm commited on
Commit
d01d90b
·
verified ·
1 Parent(s): 8a525f4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [Englist readme](#English)
2
+
3
+ # Kimi-K2 INT4MIX 模型 - FastllmEE
4
+
5
+ Fastllm 的 Kimi-K2 INT4MIX 模型
6
+
7
+ https://github.com/ztxz16/fastllm
8
+
9
+ # 安装
10
+
11
+ ``` sh
12
+ pip install ftllm
13
+ ```
14
+
15
+ # 下载模型:
16
+
17
+ ``` sh
18
+ pip download fastllm/Kimi-K2-Instruct-INT4MIX
19
+ ```
20
+
21
+ # 运行模型
22
+
23
+ ``` sh
24
+ # 假设模型下载在 /root/Kimi-K2-Instruct-INT4MIX
25
+ pip run /root/Kimi-K2-Instruct-INT4MIX # 聊天模式
26
+ pip server /root/Kimi-K2-Instruct-INT4MIX # API 服务器模式(默认模型名称 = /root/Kimi-K2-Instruct-INT4MIX,端口 = 8080)
27
+ ```
28
+
29
+ # 优化
30
+
31
+ ## 单 CPU
32
+ 如果您使用的是单个 CPU,请使用 -t 参数设置线程数(通常设置为 CPU 核心数 - 2)。
33
+
34
+ 如果速度非常慢,可能是由于线程过多——考虑减少线程数。
35
+
36
+ 例如:
37
+
38
+ ``` sh
39
+ pip server /root/Kimi-K2-Instruct-INT4MIX -t 12
40
+ ```
41
+
42
+ ## 多 CPU(多 NUMA 节点)
43
+
44
+ 如果使用多路 CPU 的机器,您需要启用 CUDA + NUMA 异构加速模式。
45
+
46
+ 使用环境变量 FASTLLM_NUMA_THREADS 设置线程数(通常设置为每个 NUMA 节点的核心数 - 2)。
47
+
48
+ 如果性能非常慢,可能是由于线程过多——考虑减少线程数。
49
+
50
+ 例如:
51
+
52
+ ``` sh
53
+ export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
54
+ ```
55
+
56
+ ---
57
+
58
+ # English
59
+
60
+ Kimi-K2 INT4MIX model for Fastllm
61
+
62
+ https://github.com/ztxz16/fastllm
63
+
64
+ # install
65
+
66
+ ``` sh
67
+ pip install ftllm
68
+ ```
69
+
70
+ # download model:
71
+
72
+ ``` sh
73
+ pip download fastllm/Kimi-K2-Instruct-INT4MIX
74
+ ```
75
+
76
+ # run model
77
+
78
+ ``` sh
79
+ # Assuming the model is downloaded in /root/Kimi-K2-Instruct-INT4MIX
80
+ pip run /root/Kimi-K2-Instruct-INT4MIX # chat
81
+ pip server /root/Kimi-K2-Instruct-INT4MIX # api server (default model_name = /root/Kimi-K2-Instruct-INT4MIX, port = 8080)
82
+ ```
83
+
84
+ # optimize
85
+
86
+ ## single CPU
87
+ If you are using a single CPU, set the number of threads with the -t parameter (generally set to CPU core count - 2).
88
+
89
+ If the speed is extremely slow, it may be due to too many threads—consider reducing them.
90
+
91
+ for example:
92
+
93
+ ``` sh
94
+ pip server /root/Kimi-K2-Instruct-INT4MIX -t 12
95
+ ```
96
+
97
+ ## multi cpu (multi numa node)
98
+
99
+ If using a multi-socket CPU machine, you need to enable CUDA + NUMA heterogeneous acceleration mode.
100
+
101
+ Set the number of threads using the environment variable FASTLLM_NUMA_THREADS (typically set to the number of cores per NUMA node - 2).
102
+
103
+ If performance is extremely slow, it may be due to excessive threads—consider reducing them.
104
+
105
+ for example:
106
+
107
+ ``` sh
108
+ export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
109
+ ```