ComfyUI 预编译扩展包 (llama-cpp-python + SageAttention)

📦 包含内容

包名	版本标识	实际功能
llama-cpp-python	0.3.31+cu130.basic	CUDA 13.0 加速推理
SageAttention	2.2.0 (开发版)	V1 + V2 + V3 (Blackwell)

⚠️ 版本说明：SageAttention 显示 2.2.0 是开发版标识，本包从 GitHub 开发分支编译，已包含 V3 Blackwell 优化，专为 RTX 50 系列 (SM120) 设计。

💻 系统要求

项目	要求
操作系统	Windows 10/11 (64位)
Python	3.13.x (ComfyUI 便携版自带)
CUDA	13.0 ~ 13.3
显卡	NVIDIA RTX 50 系列 (Blackwell) 最佳
PyTorch	2.x (CUDA 版)

🚀 安装指南

方式一：手动复制（推荐）

"`" + "bash" + @"

1. 解压 ZIP 包

2. 将以下文件夹复制到 ComfyUI 的 site-packages 目录：

ComfyUI_windows_portable\python_embeded\Lib\site-packages\

需要复制的文件夹：

llama_cpp/ llama_cpp_python-0.3.31.dist-info/ sageattention/ sageattention-2.2.0-py3.13.egg-info/

"`" + @"

方式二：使用 pip 安装（如打包为 wheel）

"`" + "powershell" + @" cd ComfyUI_windows_portable\python_embeded
.\python.exe -m pip install package_name.whl
"`" + @"

✅ 验证安装

在 python_embeded 目录下执行：

"`" + "powershell" + @"

1. 验证 llama-cpp-python

.\python.exe -c "import llama_cpp; print('llama-cpp-python:', llama_cpp.version)"

2. 验证 SageAttention 路径

.\python.exe -c "import sageattention; print('SageAttention:', sageattention.file)"

3. 验证 V3 核心函数

.\python.exe -c "from sageattention import sageattn; print('✅ V3 (sageattn) 加载成功')"

4. 测试 V3 实际运行 (需 FP16/BF16)

.\python.exe -c "import torch; from sageattention import sageattn; q=torch.randn(1,8,128,64,dtype=torch.float16).cuda(); k=torch.randn(1,8,128,64,dtype=torch.float16).cuda(); v=torch.randn(1,8,128,64,dtype=torch.float16).cuda(); o=sageattn(q,k,v); print('输出形状:', o.shape)"

"`" + @"

预期输出：

"`" + "text" + @" llama-cpp-python: 0.3.31 SageAttention: ...\python_embeded\Lib\site-packages\sageattention_init_.py ✅ V3 (sageattn) 加载成功输出形状: torch.Size([1, 8, 128, 64])
"`" + @"

⚠️ 重要注意事项

SageAttention V3 数据类型限制

V3 仅支持 orch.float16 和 orch.bfloat16
如果输入是 FP32，请在调用前转换：

"`" + "python" + @" q_fp16 = q.to(torch.float16) o = sageattn(q_fp16, k_fp16, v_fp16)
"`" + @"

V3 硬件要求

V3 专为 Blackwell 架构 (SM120) 优化，在 RTX 50 系列上性能最佳
其他架构会自动回退到 V1/V2 实现

版本说明

SageAttention 显示 2.2.0 是开发版标识，已包含 V3 代码
本包从 GitHub 开发分支编译，未做任何修改

📄 来源与许可证

项目	源码地址	许可证
llama-cpp-python	abetlen/llama-cpp-python	MIT
SageAttention	thu-ml/SageAttention	Apache 2.0

本包从官方源码编译，未做任何修改。

❓ 常见问题

Q: 为什么 SageAttention 显示 2.2.0 而不是 V3？

2.2.0 是 GitHub 开发分支的版本号。本包从开发分支编译，已包含 V3 代码，但保留了基础版本标识。

Q: 怎么确认 V3 真的包含了？

执行 rom sageattention import sageattn，导入成功即表示 V3 已包含。

Q: 能在非 RTX 50 系列显卡上使用吗？

可以，V3 的 Blackwell 优化仅对 RTX 50 系列生效，其他显卡会自动回退到 V1/V2。

Q: 为什么 V3 不支持 FP32？

V3 为了极致性能，专门针对 FP16/BF16 和 Tensor Core 优化。如果你需要 FP32 支持，建议使用 V1 (ttn_qk_int8_per_block 等)。

Q: 报错 "Input tensors must be in dtype of torch.float16 or torch.bfloat16" 怎么办？

你的输入张量是 FP32，请在调用前转换为 FP16 或 BF16：q.to(torch.float16)。

📞 问题反馈

如果遇到问题，请提供以下信息：

Windows 版本 (winver)
Python 版本 (.\python.exe --version)
CUDA 版本 ( vcc --version)
显卡型号
完整报错信息

🔄 更新日志

日期	版本	变更
2026-07-02	v1.0	初始发布：llama-cpp-python 0.3.31 + SageAttention 2.2.0 (V3)

感谢使用！ 🚀

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support