fastllm
/

Kimi-K2-Instruct-INT4MIX

Model card Files Files and versions

Kimi-K2-Instruct-INT4MIX / README.md

fastllm's picture

Update README.md

1fdf72b verified 6 months ago

|

history blame contribute delete

2.57 kB

	[English readme](#english)

	# Kimi-K2 INT4MIX 模型 - Fastllm

	Fastllm 的 Kimi-K2 INT4MIX 模型

	https://github.com/ztxz16/fastllm

	# 安装

	``` sh
	pip install ftllm
	```

	# 下载模型：

	``` sh
	ftllm download fastllm/Kimi-K2-Instruct-INT4MIX
	```

	# 运行模型

	``` sh
	# 假设模型下载在 /root/Kimi-K2-Instruct-INT4MIX
	ftllm run /root/Kimi-K2-Instruct-INT4MIX # 聊天模式
	ftllm server /root/Kimi-K2-Instruct-INT4MIX # API 服务器模式（默认模型名称 = /root/Kimi-K2-Instruct-INT4MIX，端口 = 8080）
	```

	# 优化

	## 单 CPU
	如果您使用的是单个 CPU，请使用 -t 参数设置线程数（通常设置为 CPU 核心数 - 2）。

	如果速度非常慢，可能是由于线程过多——考虑减少线程数。

	例如：

	``` sh
	ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12
	```

	## 多 CPU（多 NUMA 节点）

	如果使用多路 CPU 的机器，您需要启用 CUDA + NUMA 异构加速模式。

	使用环境变量 FASTLLM_NUMA_THREADS 设置线程数（通常设置为每个 NUMA 节点的核心数 - 2）。

	如果性能非常慢，可能是由于线程过多——考虑减少线程数。

	例如：

	``` sh
	export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
	```

	---

	# English

	Kimi-K2 INT4MIX model for Fastllm

	https://github.com/ztxz16/fastllm

	# install

	``` sh
	pip install ftllm
	```

	# download model:

	``` sh
	ftllm download fastllm/Kimi-K2-Instruct-INT4MIX
	```

	# run model

	``` sh
	# Assuming the model is downloaded in /root/Kimi-K2-Instruct-INT4MIX
	ftllm run /root/Kimi-K2-Instruct-INT4MIX # chat
	ftllm server /root/Kimi-K2-Instruct-INT4MIX # api server (default model_name = /root/Kimi-K2-Instruct-INT4MIX, port = 8080)
	```

	# optimize

	## single CPU
	If you are using a single CPU, set the number of threads with the -t parameter (generally set to CPU core count - 2).

	If the speed is extremely slow, it may be due to too many threads—consider reducing them.

	for example:

	``` sh
	ftllm server /root/Kimi-K2-Instruct-INT4MIX -t 12
	```

	## multi cpu (multi numa node)

	If using a multi-socket CPU machine, you need to enable CUDA + NUMA heterogeneous acceleration mode.

	Set the number of threads using the environment variable FASTLLM_NUMA_THREADS (typically set to the number of cores per NUMA node - 2).

	If performance is extremely slow, it may be due to excessive threads—consider reducing them.

	for example:

	``` sh
	export FASTLLM_NUMA_THREADS=12 && ftllm server /root/Kimi-K2-Instruct-INT4MIX --device cuda --moe_device numa -t 1
	```