Add ApplyRoPE and RMSNorm kernels written in OpenAI Triton
#8
by
wangzihan99 - opened
No description provided.
This PR add kernels of ApplyRoPE and RMSNorm written in OpenAI Triton. These kernels offer better performance, support a wider range of GPU architectures (including V100 and T4), and require no pre-compilation, compared with flash-attn. They are enabled automatically if Triton is installed (usually bundled with PyTorch 2.x).
wangzihan99 changed pull request status to
open
wangzihan99 changed pull request status to
closed