triton-layernorm

A portable Triton LayerNorm kernel that runs on NVIDIA and AMD GPUs without modification.

Generated using the triton-kernels skill from huggingface/kernels.

Usage

from kernels import get_kernel

layernorm_kernel = get_kernel("jaygala223/triton-layernorm", version=1)

import torch
x = torch.randn(4096, 2048, device="cuda", dtype=torch.float32)
weight = torch.randn(2048, device="cuda", dtype=torch.float32)
bias = torch.randn(2048, device="cuda", dtype=torch.float32)

y = layernorm_kernel.layernorm(x, weight=weight, bias=bias, eps=1e-5)

Performance (V100, fp32, M=4096)

N	Triton (GB/s)	PyTorch (GB/s)	Speedup
256	497	365	1.36x
1024	727	548	1.33x
2048	771	534	1.44x

Features

Row-wise reduction with fp32 accumulation
Handles optional weight and bias (affine parameters)
Correct masking for non-power-of-2 dimensions
Supports fp32, fp16, bf16 inputs
~1.45x faster than PyTorch's LayerNorm on V100

Downloads last month: -

kernels

apache-2.0