triton-layernorm

A portable Triton LayerNorm kernel that runs on NVIDIA and AMD GPUs without modification.

Generated using the triton-kernels skill from huggingface/kernels.

Usage

from kernels import get_kernel

layernorm_kernel = get_kernel("jaygala223/triton-layernorm", version=1)

import torch
x = torch.randn(4096, 2048, device="cuda", dtype=torch.float32)
weight = torch.randn(2048, device="cuda", dtype=torch.float32)
bias = torch.randn(2048, device="cuda", dtype=torch.float32)

y = layernorm_kernel.layernorm(x, weight=weight, bias=bias, eps=1e-5)

Performance (V100, fp32, M=4096)

N Triton (GB/s) PyTorch (GB/s) Speedup
256 497 365 1.36x
1024 727 548 1.33x
2048 771 534 1.44x

Features

  • Row-wise reduction with fp32 accumulation
  • Handles optional weight and bias (affine parameters)
  • Correct masking for non-power-of-2 dimensions
  • Supports fp32, fp16, bf16 inputs
  • ~1.45x faster than PyTorch's LayerNorm on V100
Downloads last month
-
kernels
apache-2.0