Buckets:
| # Writing Hub kernels with kernel-builder | |
| ## Introduction | |
| The Kernel Hub allows Python libraries and applications to load compute | |
| kernels directly from the [Hub](https://hf.co/). To support this kind | |
| of dynamic loading, Hub kernels differ from traditional Python kernel | |
| packages in that they are made to be: | |
| - Portable: a kernel can be loaded from paths outside `PYTHONPATH`. | |
| - Unique: multiple versions of the same kernel can be loaded in the | |
| same Python process. | |
| - Compatible: kernels must support all recent versions of Python and | |
| the different PyTorch build configurations (various CUDA versions | |
| and C++ ABIs). Furthermore, older C library versions must be supported. | |
| `kernel-builder` is a set of tools that can build conforming kernels. It | |
| takes care of: | |
| - Building kernels for all supported PyTorch configurations (C++98/11 and | |
| different CUDA versions). | |
| - Compatibility with old glibc and libstdc++ versions, so that kernels also | |
| work on older Linux distributions. | |
| - Registering Torch ops, such that multiple versions the same kernel can be | |
| loaded without namespace conflicts. | |
| `kernel-builder` builds are configured through a `build.toml` file. | |
| `build.toml` is a simple format that does not require intricate knowledge | |
| of CMake or setuptools. | |
| This page describes the directory layout of a kernel-builder project, the | |
| format of the `build.toml` file, and some additional Python glue that | |
| `kernel-builder` provides. We will use a [simple ReLU kernel](https://github.com/huggingface/kernels/tree/main/examples/kernels/relu) | |
| as the running example. After reading this page, you may also want to have | |
| a look at the more realistic [ReLU kernel with backprop and `torch.compile`](https://github.com/huggingface/kernels/tree/main/examples/kernels/relu-backprop-compile) | |
| support. | |
| ## Setting up environment | |
| In the [`terraform`](https://github.com/huggingface/kernels/tree/main/terraform) directory, we provide an | |
| example of programatically spinning up an EC2 instance that is ready | |
| with everything needed for you to start developing and building | |
| kernels. | |
| If you use a different provider, the Terraform bridges should be | |
| similar and straightforward to modify. | |
| ## Kernel project layout | |
| Kernel projects follow this general directory layout: | |
| ```text | |
| relu | |
| ├── build.toml | |
| ├── relu_kernel | |
| │ └── relu.cu | |
| └── torch-ext | |
| └── torch_binding.cpp | |
| └── torch_binding.h | |
| └── relu | |
| └── __init__.py | |
| └── tests | |
| └── __init__.py | |
| └── test_relu.py | |
| ``` | |
| In this example we can find: | |
| - The build configuration in `build.toml`. | |
| - One or more top-level directories containing kernels (`relu_kernel`). | |
| - The `torch-ext` directory, which contains: | |
| - `torch_binding.h`: contains declarations for kernel entry points | |
| (from `kernel_a` and `kernel_b`). | |
| - `torch_binding.cpp`: registers the entry points as Torch ops. | |
| - `torch_ext/relu`: contains any Python wrapping the kernel needs. At the | |
| bare minimum, it should contain an `__init__.py` file. | |
| - Kernel tests in the directory `tests`. | |
| ## `build.toml` | |
| `build.toml` tells `kernel-builder` what to build and how. It looks as | |
| follows for the `relu` kernel: | |
| ```toml | |
| [general] | |
| name = "relu" | |
| [torch] | |
| src = [ | |
| "torch-ext/torch_binding.cpp", | |
| "torch-ext/torch_binding.h" | |
| ] | |
| [kernel.relu] | |
| backend = "cuda" | |
| src = [ | |
| "relu_kernel/relu.cu", | |
| ] | |
| depends = [ "torch" ] | |
| # If the kernel is only supported on specific capabilities, set the | |
| # cuda-capabilities option: | |
| # | |
| # cuda-capabilities = [ "9.0", "10.0", "12.0" ] | |
| ``` | |
| ### `general` | |
| - `name` (required): the name of the kernel. The Python code for a Torch | |
| extension must be stored in `torch-ext/`. | |
| - `version` (int, **experimental**): the major version of the kernel. | |
| The version is written to the kernel's `metadata.json` and is used | |
| by the `kernels upload` command to upload the kernel to a version | |
| branch named `v`. | |
| - `backends` (required): a list of supported backends. Must be one or | |
| more of `cpu`, `cuda`, `metal`, `rocm`, or `xpu`. | |
| - `python-depends` (**experimental**): a list of additional Python dependencies | |
| that the kernel requires. The only supported dependencies are `einops` | |
| and `nvidia-cutlass-dsl`. | |
| ### `general.cuda` | |
| - `maxver`: the maximum CUDA toolkit version (inclusive). This option | |
| _must not_ be set under normal circumstances, since it can exclude Torch | |
| build variants that are [required for compliant kernels](../kernel-requirements). | |
| This option is provided for kernels that cause compiler errors on | |
| newer CUDA toolkit versions. | |
| - `minver`: the minimum required CUDA toolkit version. This option | |
| _must not_ be set under normal circumstances, since it can exclude Torch | |
| build variants that are [required for compliant kernels](../kernel-requirements). | |
| This option is provided for kernels that require functionality only | |
| provided by newer CUDA toolkits. | |
| ### `torch` | |
| This section describes the Torch extension. In the future, there may be | |
| similar sections for other frameworks. This section has the following | |
| options: | |
| - `src` (required): a list of source files and headers. | |
| - `pyext` (optional): the list of extensions for Python files. Default: | |
| `["py", "pyi"]`. | |
| - `include` (optional): include directories relative to the project root. | |
| Default: `[]`. | |
| - `maxver` (optional): only build for this Torch version and earlier. Use cautiously, since this option produces | |
| non-compliant kernels if the version range does not correspond to the [required variants](build-variants). | |
| - `minver` (optional): only build for this Torch version and later. Use cautiously, since this option produces | |
| non-compliant kernels if the version range does not correspond to the [required variants](build-variants). | |
| ### `kernel.` | |
| Specification of a kernel with the name ``. Multiple `kernel.` | |
| sections can be defined in the same `build.toml`. | |
| See for example [`kernels-community/quantization`](https://huggingface.co/kernels-community/quantization/) | |
| for an example with multiple kernel sections. | |
| The following options can be set for a kernel: | |
| - `backend` (required): the compute backend of the kernel. The currently | |
| supported backends are `cpu`, `cuda`, `metal`, `rocm`, and `xpu`. | |
| **The `cpu` backend is currently experimental and might still change.** | |
| - `depends` (required): a list of dependencies. The supported dependencies | |
| are listed in [`deps.nix`](https://github.com/huggingface/kernels/blob/main/builder/lib/deps.nix). | |
| - `src` (required): a list of source files and headers. | |
| - `include` (optional): include directories relative to the project root. | |
| Default: `[]`. | |
| Besides these shared options, the following backend-specific options | |
| are available: | |
| #### cuda | |
| - `cuda-capabilities` (optional): a list of CUDA capabilities that the | |
| kernel should be compiled for. When absent, the kernel will be built | |
| using all capabilities that the builder supports. The effective | |
| capabilities are the intersection of this list and the capabilities | |
| supported by the CUDA compiler. It is recommended to leave this option | |
| unspecified **unless** a kernel requires specific capabilities. | |
| - `cuda_flags` (optional): additional flags to be passed to `nvcc`. | |
| **Warning**: this option should only be used in exceptional circumstances. | |
| Custom compile flags can interfere with the build process or break | |
| compatibility requirements. | |
| #### rocm | |
| - `rocm-archs`: a list of ROCm architectures that the kernel should be | |
| compiled for. | |
| #### xpu | |
| - `sycl_flags`: a list of additional flags to be passed to the SYCL | |
| compiler. | |
| ## Torch bindings | |
| ### Defining bindings | |
| Torch bindings are defined in C++, kernels commonly use two files: | |
| - `torch_binding.h` containing function declarations. | |
| - `torch_binding.cpp` registering the functions as Torch ops. | |
| For instance, the `relu` kernel has the following declaration in | |
| `torch_binding.h`: | |
| ```cpp | |
| #pragma once | |
| #include | |
| void relu(torch::Tensor &out, torch::Tensor const &input); | |
| ``` | |
| This is a declaration for the actual kernel, which is in `relu_kernel/relu.cu`: | |
| ```cpp | |
| #include | |
| #include | |
| #include | |
| #include | |
| __global__ void relu_kernel(float *__restrict__ out, | |
| float const *__restrict__ input, | |
| const int d) { | |
| const int64_t token_idx = blockIdx.x; | |
| for (int64_t idx = threadIdx.x; idx 0.0f ? x : 0.0f; | |
| } | |
| } | |
| void relu(torch::Tensor &out, | |
| torch::Tensor const &input) | |
| { | |
| TORCH_CHECK(input.scalar_type() == at::ScalarType::Float && | |
| input.scalar_type() == at::ScalarType::Float, | |
| "relu_kernel only supports float32"); | |
| int d = input.size(-1); | |
| int64_t num_tokens = input.numel() / d; | |
| dim3 grid(num_tokens); | |
| dim3 block(std::min(d, 1024)); | |
| const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); | |
| const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); | |
| relu_kernel>>(out.data_ptr(), | |
| input.data_ptr(), d); | |
| } | |
| ``` | |
| This function is then registered as a Torch op in `torch_binding.cpp`: | |
| ```cpp | |
| #include | |
| #include "registration.h" | |
| #include "torch_binding.h" | |
| TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) { | |
| ops.def("relu(Tensor! out, Tensor input) -> ()"); | |
| ops.impl("relu", torch::kCUDA, &relu); | |
| } | |
| REGISTER_EXTENSION(TORCH_EXTENSION_NAME) | |
| ``` | |
| This snippet uses macros from `registration.h` to register the function. | |
| `registration.h` is generated by `kernel-builder` itself. A function | |
| is registered through the `def`/`ops` methods. `ops` specifies the | |
| function signature following the [function schema](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md#func). | |
| `impl` associates the function name with the C/C++ function and | |
| the applicable device. | |
| ## Using kernel functions from Python | |
| The bindings are typically wrapped in Python code in `torch_ext/`. | |
| The native code is exposed under the `torch.ops` namespace. However, | |
| we add some unique material to the name of the extension to ensure that | |
| different versions of the same extension can be loaded at the same time. | |
| As a result, the extension is registered as | |
| `torch.ops._`. | |
| To deal with this uniqueness, `kernel_builder` generates a Python module | |
| named `_ops` that contains an alias for the name. This can be used to | |
| refer to the correct `torch.ops` module. For example: | |
| ```python | |
| from typing import Optional | |
| import torch | |
| from ._ops import ops | |
| def relu(x: torch.Tensor, out: Optional[torch.Tensor] = None) -> torch.Tensor: | |
| if out is None: | |
| out = torch.empty_like(x) | |
| ops.relu(out, x) | |
| return out | |
| ``` | |
| ## Kernel tests | |
| Kernel tests are stored in the `tests` directory. Since running all | |
| kernel tests in CI may be prohibitively expensive, the `pyproject.toml` | |
| generated by the builder adds support for the special `kernels_ci` | |
| PyTest marker that can be used as follows: | |
| ```python | |
| import pytest | |
| @pytest.mark.kernels_ci | |
| def test_relu(): | |
| ... | |
| ``` | |
| We recommend that you to pick tests that together would catch most error | |
| cases while running within 60 seconds. | |
| You can run the tests (e.g. in CI) using: | |
| ```bash | |
| $ nix run .#ci-test | |
| ``` | |
| If the kernel supports multiple backends, it will run the test for the | |
| first supported backend that was found, obeying the following order: CUDA, | |
| ROCm, XPU, Metal, CPU. If you would like to the tests for a specific build | |
| variant, you can use `nix run .#ciTests.`. For instance: | |
| ```bash | |
| $ nix run .#ciTests.torch210-cxx11-cpu-x86_64-linux | |
| ``` | |
| When running the tests on a non-NixOS systems, make sure that | |
| [the CUDA driver library can be found](https://danieldk.eu/Software/Nix/Nix-CUDA-on-non-NixOS-systems#solutions). | |
| ## Kernel docs | |
| We provide a utility to generate a system card for a given kernel, utilizing | |
| information from its `build.toml` and metadata. This system card provides a | |
| reasonable starting point and is meant to be edited afterward by the kernel | |
| developer. | |
| The template card is generated as a part of [`kernels init`](../cli-init) | |
| command and is serialized in the root directory of the kernel. | |
| The card will be filled automatically by the builder when using the | |
| `build-and-upload` or `build-and-copy` command. It will be serialized | |
| to the `build` sub-directory inside the main kernel directory. It | |
| will be uploaded as `README.md` to the Hub. | |
Xet Storage Details
- Size:
- 12.3 kB
- Xet hash:
- be1ab525d3b9ce8a7ad2ff2bc09a513fe11b7d6aaaedf7e810eced82533a5cc3
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.