triton-layernorm
A portable Triton LayerNorm kernel that runs on NVIDIA and AMD GPUs without modification.
Generated using the triton-kernels skill from huggingface/kernels.
Usage
from kernels import get_kernel
layernorm_kernel = get_kernel("jaygala223/triton-layernorm", version=1)
import torch
x = torch.randn(4096, 2048, device="cuda", dtype=torch.float32)
weight = torch.randn(2048, device="cuda", dtype=torch.float32)
bias = torch.randn(2048, device="cuda", dtype=torch.float32)
y = layernorm_kernel.layernorm(x, weight=weight, bias=bias, eps=1e-5)
Performance (V100, fp32, M=4096)
| N | Triton (GB/s) | PyTorch (GB/s) | Speedup |
|---|---|---|---|
| 256 | 497 | 365 | 1.36x |
| 1024 | 727 | 548 | 1.33x |
| 2048 | 771 | 534 | 1.44x |
Features
- Row-wise reduction with fp32 accumulation
- Handles optional weight and bias (affine parameters)
- Correct masking for non-power-of-2 dimensions
- Supports fp32, fp16, bf16 inputs
- ~1.45x faster than PyTorch's LayerNorm on V100
- Downloads last month
- -
kernels
apache-2.0