File size: 2,573 Bytes
df51278
d01d90b
86af9b5
d01d90b
 
 
 
 
 
 
 
 
 
 
 
 
 
1fdf72b
d01d90b
 
 
 
 
 
f3779cb
 
d01d90b
 
 
 
 
 
 
 
 
 
 
 
f3779cb
d01d90b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1fdf72b
d01d90b
 
 
 
 
 
f3779cb
 
d01d90b
 
 
 
 
 
 
 
 
 
 
 
f3779cb
d01d90b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
[English readme](#english)

# Kimi-K2 INT4MIX 模型 - Fastllm

Fastllm 的 Kimi-K2 INT4MIX 模型

https://github.com/ztxz16/fastllm

# 安装

``` sh
pip install ftllm
```

# 下载模型:

``` sh
ftllm download fastllm/Kimi-K2-Instruct-INT4MIX
```

# 运行模型

``` sh
# 假设模型下载在 /root/Kimi-K2-Instruct-INT4MIX
ftllm run /root/Kimi-K2-Instruct-INT4MIX # 聊天模式
ftllm server /root/Kimi-K2-Instruct-INT4MIX # API 服务器模式(默认模型名称 = /root/Kimi-K2-Instruct-INT4MIX,端口 = 8080)
```

# 优化

## 单 CPU
如果您使用的是单个 CPU,请使用 -t 参数设置线程数(通常设置为 CPU 核心数 - 2)。

如果速度非常慢,可能是由于线程过多——考虑减少线程数。

例如:

``` sh
ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12
```

## 多 CPU(多 NUMA 节点)

如果使用多路 CPU 的机器,您需要启用 CUDA + NUMA 异构加速模式。

使用环境变量 FASTLLM_NUMA_THREADS 设置线程数(通常设置为每个 NUMA 节点的核心数 - 2)。

如果性能非常慢,可能是由于线程过多——考虑减少线程数。

例如:

``` sh
export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
```

---

# English

Kimi-K2 INT4MIX model for Fastllm

https://github.com/ztxz16/fastllm

# install

``` sh
pip install ftllm
```

# download model:

``` sh
ftllm download fastllm/Kimi-K2-Instruct-INT4MIX
```

# run model

``` sh
# Assuming the model is downloaded in /root/Kimi-K2-Instruct-INT4MIX
ftllm run /root/Kimi-K2-Instruct-INT4MIX # chat
ftllm server /root/Kimi-K2-Instruct-INT4MIX # api server (default model_name = /root/Kimi-K2-Instruct-INT4MIX, port = 8080)
```

# optimize

## single CPU
If you are using a single CPU, set the number of threads with the -t parameter (generally set to CPU core count - 2). 

If the speed is extremely slow, it may be due to too many threads—consider reducing them.

for example:

``` sh
ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12
```

## multi cpu (multi numa node)

If using a multi-socket CPU machine, you need to enable CUDA + NUMA heterogeneous acceleration mode. 

Set the number of threads using the environment variable FASTLLM_NUMA_THREADS (typically set to the number of cores per NUMA node - 2). 

If performance is extremely slow, it may be due to excessive threads—consider reducing them.

for example: 

``` sh
export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
```