| [English readme](#english) | |
| # Kimi-K2 INT4MIX 模型 - Fastllm | |
| Fastllm 的 Kimi-K2 INT4MIX 模型 | |
| https://github.com/ztxz16/fastllm | |
| # 安装 | |
| ``` sh | |
| pip install ftllm | |
| ``` | |
| # 下载模型: | |
| ``` sh | |
| ftllm download fastllm/Kimi-K2-Instruct-INT4MIX | |
| ``` | |
| # 运行模型 | |
| ``` sh | |
| # 假设模型下载在 /root/Kimi-K2-Instruct-INT4MIX | |
| ftllm run /root/Kimi-K2-Instruct-INT4MIX # 聊天模式 | |
| ftllm server /root/Kimi-K2-Instruct-INT4MIX # API 服务器模式(默认模型名称 = /root/Kimi-K2-Instruct-INT4MIX,端口 = 8080) | |
| ``` | |
| # 优化 | |
| ## 单 CPU | |
| 如果您使用的是单个 CPU,请使用 -t 参数设置线程数(通常设置为 CPU 核心数 - 2)。 | |
| 如果速度非常慢,可能是由于线程过多——考虑减少线程数。 | |
| 例如: | |
| ``` sh | |
| ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12 | |
| ``` | |
| ## 多 CPU(多 NUMA 节点) | |
| 如果使用多路 CPU 的机器,您需要启用 CUDA + NUMA 异构加速模式。 | |
| 使用环境变量 FASTLLM_NUMA_THREADS 设置线程数(通常设置为每个 NUMA 节点的核心数 - 2)。 | |
| 如果性能非常慢,可能是由于线程过多——考虑减少线程数。 | |
| 例如: | |
| ``` sh | |
| export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1 | |
| ``` | |
| --- | |
| # English | |
| Kimi-K2 INT4MIX model for Fastllm | |
| https://github.com/ztxz16/fastllm | |
| # install | |
| ``` sh | |
| pip install ftllm | |
| ``` | |
| # download model: | |
| ``` sh | |
| ftllm download fastllm/Kimi-K2-Instruct-INT4MIX | |
| ``` | |
| # run model | |
| ``` sh | |
| # Assuming the model is downloaded in /root/Kimi-K2-Instruct-INT4MIX | |
| ftllm run /root/Kimi-K2-Instruct-INT4MIX # chat | |
| ftllm server /root/Kimi-K2-Instruct-INT4MIX # api server (default model_name = /root/Kimi-K2-Instruct-INT4MIX, port = 8080) | |
| ``` | |
| # optimize | |
| ## single CPU | |
| If you are using a single CPU, set the number of threads with the -t parameter (generally set to CPU core count - 2). | |
| If the speed is extremely slow, it may be due to too many threads—consider reducing them. | |
| for example: | |
| ``` sh | |
| ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12 | |
| ``` | |
| ## multi cpu (multi numa node) | |
| If using a multi-socket CPU machine, you need to enable CUDA + NUMA heterogeneous acceleration mode. | |
| Set the number of threads using the environment variable FASTLLM_NUMA_THREADS (typically set to the number of cores per NUMA node - 2). | |
| If performance is extremely slow, it may be due to excessive threads—consider reducing them. | |
| for example: | |
| ``` sh | |
| export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1 | |
| ``` | |