File size: 2,573 Bytes
df51278 d01d90b 86af9b5 d01d90b 1fdf72b d01d90b f3779cb d01d90b f3779cb d01d90b 1fdf72b d01d90b f3779cb d01d90b f3779cb d01d90b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
[English readme](#english)
# Kimi-K2 INT4MIX 模型 - Fastllm
Fastllm 的 Kimi-K2 INT4MIX 模型
https://github.com/ztxz16/fastllm
# 安装
``` sh
pip install ftllm
```
# 下载模型:
``` sh
ftllm download fastllm/Kimi-K2-Instruct-INT4MIX
```
# 运行模型
``` sh
# 假设模型下载在 /root/Kimi-K2-Instruct-INT4MIX
ftllm run /root/Kimi-K2-Instruct-INT4MIX # 聊天模式
ftllm server /root/Kimi-K2-Instruct-INT4MIX # API 服务器模式(默认模型名称 = /root/Kimi-K2-Instruct-INT4MIX,端口 = 8080)
```
# 优化
## 单 CPU
如果您使用的是单个 CPU,请使用 -t 参数设置线程数(通常设置为 CPU 核心数 - 2)。
如果速度非常慢,可能是由于线程过多——考虑减少线程数。
例如:
``` sh
ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12
```
## 多 CPU(多 NUMA 节点)
如果使用多路 CPU 的机器,您需要启用 CUDA + NUMA 异构加速模式。
使用环境变量 FASTLLM_NUMA_THREADS 设置线程数(通常设置为每个 NUMA 节点的核心数 - 2)。
如果性能非常慢,可能是由于线程过多——考虑减少线程数。
例如:
``` sh
export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
```
---
# English
Kimi-K2 INT4MIX model for Fastllm
https://github.com/ztxz16/fastllm
# install
``` sh
pip install ftllm
```
# download model:
``` sh
ftllm download fastllm/Kimi-K2-Instruct-INT4MIX
```
# run model
``` sh
# Assuming the model is downloaded in /root/Kimi-K2-Instruct-INT4MIX
ftllm run /root/Kimi-K2-Instruct-INT4MIX # chat
ftllm server /root/Kimi-K2-Instruct-INT4MIX # api server (default model_name = /root/Kimi-K2-Instruct-INT4MIX, port = 8080)
```
# optimize
## single CPU
If you are using a single CPU, set the number of threads with the -t parameter (generally set to CPU core count - 2).
If the speed is extremely slow, it may be due to too many threads—consider reducing them.
for example:
``` sh
ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12
```
## multi cpu (multi numa node)
If using a multi-socket CPU machine, you need to enable CUDA + NUMA heterogeneous acceleration mode.
Set the number of threads using the environment variable FASTLLM_NUMA_THREADS (typically set to the number of cores per NUMA node - 2).
If performance is extremely slow, it may be due to excessive threads—consider reducing them.
for example:
``` sh
export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
```
|