fastllm
/

Kimi-K2-Instruct-INT4MIX

Model card Files Files and versions

fastllm commited on Jul 13, 2025

Commit

d01d90b

·

verified ·

1 Parent(s): 8a525f4

Create README.md

Files changed (1) hide show

README.md +109 -0

README.md ADDED Viewed

	@@ -0,0 +1,109 @@

+[Englist readme](#English)
+# Kimi-K2 INT4MIX 模型 - FastllmEE
+Fastllm 的 Kimi-K2 INT4MIX 模型
+https://github.com/ztxz16/fastllm
+# 安装
+``` sh
+pip install ftllm
+```
+# 下载模型：
+``` sh
+pip download fastllm/Kimi-K2-Instruct-INT4MIX
+```
+# 运行模型
+``` sh
+# 假设模型下载在 /root/Kimi-K2-Instruct-INT4MIX
+pip run /root/Kimi-K2-Instruct-INT4MIX # 聊天模式
+pip server /root/Kimi-K2-Instruct-INT4MIX # API 服务器模式（默认模型名称 = /root/Kimi-K2-Instruct-INT4MIX，端口 = 8080）
+```
+# 优化
+## 单 CPU
+如果您使用的是单个 CPU，请使用 -t 参数设置线程数（通常设置为 CPU 核心数 - 2）。
+如果速度非常慢，可能是由于线程过多——考虑减少线程数。
+例如：
+``` sh
+pip server /root/Kimi-K2-Instruct-INT4MIX -t 12
+```
+## 多 CPU（多 NUMA 节点）
+如果使用多路 CPU 的机器，您需要启用 CUDA + NUMA 异构加速模式。
+使用环境变量 FASTLLM_NUMA_THREADS 设置线程数（通常设置为每个 NUMA 节点的核心数 - 2）。
+如果性能非常慢，可能是由于线程过多——考虑减少线程数。
+例如：
+``` sh
+export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
+```
+---
+# English
+Kimi-K2 INT4MIX model for Fastllm
+https://github.com/ztxz16/fastllm
+# install
+``` sh
+pip install ftllm
+```
+# download model:
+``` sh
+pip download fastllm/Kimi-K2-Instruct-INT4MIX
+```
+# run model
+``` sh
+# Assuming the model is downloaded in /root/Kimi-K2-Instruct-INT4MIX
+pip run /root/Kimi-K2-Instruct-INT4MIX # chat
+pip server /root/Kimi-K2-Instruct-INT4MIX # api server (default model_name = /root/Kimi-K2-Instruct-INT4MIX, port = 8080)
+```
+# optimize
+## single CPU
+If you are using a single CPU, set the number of threads with the -t parameter (generally set to CPU core count - 2).
+If the speed is extremely slow, it may be due to too many threads—consider reducing them.
+for example:
+``` sh
+pip server /root/Kimi-K2-Instruct-INT4MIX -t 12
+```
+## multi cpu (multi numa node)
+If using a multi-socket CPU machine, you need to enable CUDA + NUMA heterogeneous acceleration mode.
+Set the number of threads using the environment variable FASTLLM_NUMA_THREADS (typically set to the number of cores per NUMA node - 2).
+If performance is extremely slow, it may be due to excessive threads—consider reducing them.
+for example:
+``` sh
+export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
+```