Buckets:

rtrm's picture
|
download
raw
2.29 kB

Kernels

PyTorch operations are general-purpose. Hardware vendors and the community create specialized implementations that run faster on specific platforms. Installing these optimized kernels is a challenge because it requires matching compiler versions, CUDA toolkits, and platform-specific builds.

platform supported devices
NVIDIA GPUs (CUDA) Modern architectures with compute capability 7.0+ (Volta, Turing, Ampere, Hopper, Blackwell)
AMD GPUs (ROCm) Compatible with ROCm-supported devices
Apple Silicon (Metal) M-series chips (M1, M2, M3, M4 and newer)
Intel GPUs (XPU) Intel Data Center GPU Max Series and compatible devices

Kernels solves this by distributing precompiled binaries through the Hub. It detects your platform at runtime and loads the right binary automatically.

When use_kernels=True, Transformers identifies layers with available optimized kernel implementations. It downloads and caches kernels from the Hub only when needed to reduce startup time. Kernels accelerate compute-intensive operations such as attention, normalization, and fused operations.

Not all operations have kernel implementations. The library falls back to standard PyTorch when no kernel is available.

Determinism

Some kernels produce slightly different results than PyTorch due to operation reordering or accumulation strategies. These differences are functionally equivalent but affect reproducibility.

For deterministic behavior, try the following.

  • Check kernel repository documentation for determinism guarantees. For example, the SDPA kernel in gpt-oss-metal-kernels matches the PyTorch implementation 97% of the time.
  • Disable specific kernels that affect your use case.
  • Set random seeds and PyTorch deterministic flags.

Resources

Xet Storage Details

Size:
2.29 kB
·
Xet hash:
024d5a5f9f6dc0fc17eca3f29bfe5e2b0a574ebbdad125ac25a60b08d0577d5b

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.