Buckets:

hf-doc-build/doc-dev / kernels /pr_321 /en /builder /writing-kernels.md
rtrm's picture
|
download
raw
12.3 kB
# Writing Hub kernels with kernel-builder
## Introduction
The Kernel Hub allows Python libraries and applications to load compute
kernels directly from the [Hub](https://hf.co/). To support this kind
of dynamic loading, Hub kernels differ from traditional Python kernel
packages in that they are made to be:
- Portable: a kernel can be loaded from paths outside `PYTHONPATH`.
- Unique: multiple versions of the same kernel can be loaded in the
same Python process.
- Compatible: kernels must support all recent versions of Python and
the different PyTorch build configurations (various CUDA versions
and C++ ABIs). Furthermore, older C library versions must be supported.
`kernel-builder` is a set of tools that can build conforming kernels. It
takes care of:
- Building kernels for all supported PyTorch configurations (C++98/11 and
different CUDA versions).
- Compatibility with old glibc and libstdc++ versions, so that kernels also
work on older Linux distributions.
- Registering Torch ops, such that multiple versions the same kernel can be
loaded without namespace conflicts.
`kernel-builder` builds are configured through a `build.toml` file.
`build.toml` is a simple format that does not require intricate knowledge
of CMake or setuptools.
This page describes the directory layout of a kernel-builder project, the
format of the `build.toml` file, and some additional Python glue that
`kernel-builder` provides. We will use a [simple ReLU kernel](https://github.com/huggingface/kernels/tree/main/examples/kernels/relu)
as the running example. After reading this page, you may also want to have
a look at the more realistic [ReLU kernel with backprop and `torch.compile`](https://github.com/huggingface/kernels/tree/main/examples/kernels/relu-backprop-compile)
support.
## Setting up environment
In the [`terraform`](https://github.com/huggingface/kernels/tree/main/terraform) directory, we provide an
example of programatically spinning up an EC2 instance that is ready
with everything needed for you to start developing and building
kernels.
If you use a different provider, the Terraform bridges should be
similar and straightforward to modify.
## Kernel project layout
Kernel projects follow this general directory layout:
```text
relu
├── build.toml
├── relu_kernel
│ └── relu.cu
└── torch-ext
└── torch_binding.cpp
└── torch_binding.h
└── relu
└── __init__.py
└── tests
└── __init__.py
└── test_relu.py
```
In this example we can find:
- The build configuration in `build.toml`.
- One or more top-level directories containing kernels (`relu_kernel`).
- The `torch-ext` directory, which contains:
- `torch_binding.h`: contains declarations for kernel entry points
(from `kernel_a` and `kernel_b`).
- `torch_binding.cpp`: registers the entry points as Torch ops.
- `torch_ext/relu`: contains any Python wrapping the kernel needs. At the
bare minimum, it should contain an `__init__.py` file.
- Kernel tests in the directory `tests`.
## `build.toml`
`build.toml` tells `kernel-builder` what to build and how. It looks as
follows for the `relu` kernel:
```toml
[general]
name = "relu"
[torch]
src = [
"torch-ext/torch_binding.cpp",
"torch-ext/torch_binding.h"
]
[kernel.relu]
backend = "cuda"
src = [
"relu_kernel/relu.cu",
]
depends = [ "torch" ]
# If the kernel is only supported on specific capabilities, set the
# cuda-capabilities option:
#
# cuda-capabilities = [ "9.0", "10.0", "12.0" ]
```
### `general`
- `name` (required): the name of the kernel. The Python code for a Torch
extension must be stored in `torch-ext/`.
- `version` (int, **experimental**): the major version of the kernel.
The version is written to the kernel's `metadata.json` and is used
by the `kernels upload` command to upload the kernel to a version
branch named `v`.
- `backends` (required): a list of supported backends. Must be one or
more of `cpu`, `cuda`, `metal`, `rocm`, or `xpu`.
- `python-depends` (**experimental**): a list of additional Python dependencies
that the kernel requires. The only supported dependencies are `einops`
and `nvidia-cutlass-dsl`.
### `general.cuda`
- `maxver`: the maximum CUDA toolkit version (inclusive). This option
_must not_ be set under normal circumstances, since it can exclude Torch
build variants that are [required for compliant kernels](../kernel-requirements).
This option is provided for kernels that cause compiler errors on
newer CUDA toolkit versions.
- `minver`: the minimum required CUDA toolkit version. This option
_must not_ be set under normal circumstances, since it can exclude Torch
build variants that are [required for compliant kernels](../kernel-requirements).
This option is provided for kernels that require functionality only
provided by newer CUDA toolkits.
### `torch`
This section describes the Torch extension. In the future, there may be
similar sections for other frameworks. This section has the following
options:
- `src` (required): a list of source files and headers.
- `pyext` (optional): the list of extensions for Python files. Default:
`["py", "pyi"]`.
- `include` (optional): include directories relative to the project root.
Default: `[]`.
- `maxver` (optional): only build for this Torch version and earlier. Use cautiously, since this option produces
non-compliant kernels if the version range does not correspond to the [required variants](build-variants).
- `minver` (optional): only build for this Torch version and later. Use cautiously, since this option produces
non-compliant kernels if the version range does not correspond to the [required variants](build-variants).
### `kernel.`
Specification of a kernel with the name ``. Multiple `kernel.`
sections can be defined in the same `build.toml`.
See for example [`kernels-community/quantization`](https://huggingface.co/kernels-community/quantization/)
for an example with multiple kernel sections.
The following options can be set for a kernel:
- `backend` (required): the compute backend of the kernel. The currently
supported backends are `cpu`, `cuda`, `metal`, `rocm`, and `xpu`.
**The `cpu` backend is currently experimental and might still change.**
- `depends` (required): a list of dependencies. The supported dependencies
are listed in [`deps.nix`](https://github.com/huggingface/kernels/blob/main/builder/lib/deps.nix).
- `src` (required): a list of source files and headers.
- `include` (optional): include directories relative to the project root.
Default: `[]`.
Besides these shared options, the following backend-specific options
are available:
#### cuda
- `cuda-capabilities` (optional): a list of CUDA capabilities that the
kernel should be compiled for. When absent, the kernel will be built
using all capabilities that the builder supports. The effective
capabilities are the intersection of this list and the capabilities
supported by the CUDA compiler. It is recommended to leave this option
unspecified **unless** a kernel requires specific capabilities.
- `cuda_flags` (optional): additional flags to be passed to `nvcc`.
**Warning**: this option should only be used in exceptional circumstances.
Custom compile flags can interfere with the build process or break
compatibility requirements.
#### rocm
- `rocm-archs`: a list of ROCm architectures that the kernel should be
compiled for.
#### xpu
- `sycl_flags`: a list of additional flags to be passed to the SYCL
compiler.
## Torch bindings
### Defining bindings
Torch bindings are defined in C++, kernels commonly use two files:
- `torch_binding.h` containing function declarations.
- `torch_binding.cpp` registering the functions as Torch ops.
For instance, the `relu` kernel has the following declaration in
`torch_binding.h`:
```cpp
#pragma once
#include
void relu(torch::Tensor &out, torch::Tensor const &input);
```
This is a declaration for the actual kernel, which is in `relu_kernel/relu.cu`:
```cpp
#include
#include
#include
#include
__global__ void relu_kernel(float *__restrict__ out,
float const *__restrict__ input,
const int d) {
const int64_t token_idx = blockIdx.x;
for (int64_t idx = threadIdx.x; idx 0.0f ? x : 0.0f;
}
}
void relu(torch::Tensor &out,
torch::Tensor const &input)
{
TORCH_CHECK(input.scalar_type() == at::ScalarType::Float &&
input.scalar_type() == at::ScalarType::Float,
"relu_kernel only supports float32");
int d = input.size(-1);
int64_t num_tokens = input.numel() / d;
dim3 grid(num_tokens);
dim3 block(std::min(d, 1024));
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
relu_kernel>>(out.data_ptr(),
input.data_ptr(), d);
}
```
This function is then registered as a Torch op in `torch_binding.cpp`:
```cpp
#include
#include "registration.h"
#include "torch_binding.h"
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.def("relu(Tensor! out, Tensor input) -> ()");
ops.impl("relu", torch::kCUDA, &relu);
}
REGISTER_EXTENSION(TORCH_EXTENSION_NAME)
```
This snippet uses macros from `registration.h` to register the function.
`registration.h` is generated by `kernel-builder` itself. A function
is registered through the `def`/`ops` methods. `ops` specifies the
function signature following the [function schema](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md#func).
`impl` associates the function name with the C/C++ function and
the applicable device.
## Using kernel functions from Python
The bindings are typically wrapped in Python code in `torch_ext/`.
The native code is exposed under the `torch.ops` namespace. However,
we add some unique material to the name of the extension to ensure that
different versions of the same extension can be loaded at the same time.
As a result, the extension is registered as
`torch.ops._`.
To deal with this uniqueness, `kernel_builder` generates a Python module
named `_ops` that contains an alias for the name. This can be used to
refer to the correct `torch.ops` module. For example:
```python
from typing import Optional
import torch
from ._ops import ops
def relu(x: torch.Tensor, out: Optional[torch.Tensor] = None) -> torch.Tensor:
if out is None:
out = torch.empty_like(x)
ops.relu(out, x)
return out
```
## Kernel tests
Kernel tests are stored in the `tests` directory. Since running all
kernel tests in CI may be prohibitively expensive, the `pyproject.toml`
generated by the builder adds support for the special `kernels_ci`
PyTest marker that can be used as follows:
```python
import pytest
@pytest.mark.kernels_ci
def test_relu():
...
```
We recommend that you to pick tests that together would catch most error
cases while running within 60 seconds.
You can run the tests (e.g. in CI) using:
```bash
$ nix run .#ci-test
```
If the kernel supports multiple backends, it will run the test for the
first supported backend that was found, obeying the following order: CUDA,
ROCm, XPU, Metal, CPU. If you would like to the tests for a specific build
variant, you can use `nix run .#ciTests.`. For instance:
```bash
$ nix run .#ciTests.torch210-cxx11-cpu-x86_64-linux
```
When running the tests on a non-NixOS systems, make sure that
[the CUDA driver library can be found](https://danieldk.eu/Software/Nix/Nix-CUDA-on-non-NixOS-systems#solutions).
## Kernel docs
We provide a utility to generate a system card for a given kernel, utilizing
information from its `build.toml` and metadata. This system card provides a
reasonable starting point and is meant to be edited afterward by the kernel
developer.
The template card is generated as a part of [`kernels init`](../cli-init)
command and is serialized in the root directory of the kernel.
The card will be filled automatically by the builder when using the
`build-and-upload` or `build-and-copy` command. It will be serialized
to the `build` sub-directory inside the main kernel directory. It
will be uploaded as `README.md` to the Hub.

Xet Storage Details

Size:
12.3 kB
·
Xet hash:
be1ab525d3b9ce8a7ad2ff2bc09a513fe11b7d6aaaedf7e810eced82533a5cc3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.