| | --- |
| | tags: |
| | - kernels |
| | license: apache-2.0 |
| | --- |
| | |
| | # Activation |
| |
|
| | Activation is a python package that contains custom CUDA-based activation kernels, primarily targeting AMD GPUs. |
| |
|
| | - Currently implemented |
| | - [PolyNorm](https://arxiv.org/html/2411.03884v1) |
| | - [RMSNorm](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html) |
| | - **FusedAddRMSNorm** |
| |
|
| | A fused operator that combines **residual addition** (`x + residual`) with **RMSNorm** in a single kernel. |
| | - Instead of: |
| | |
| | ```python |
| | y = x + residual |
| | hidden_state = rms_norm(y, weight, eps) |
| | out = y + some_op(hidden_state) |
| | ``` |
| | |
| | - Fused as: |
| |
|
| | ```python |
| | hidden_state, y = fused_add_rms_norm(x, residual, weight, eps) |
| | out = y + some_op(hidden_state) |
| | ``` |
| | |
| | - **FusedMulPolyNorm** |
| |
|
| | A fused operator that combines **PolyNorm** with an **element-wise multiplication** by a Tensor. |
| | - Instead of: |
| | |
| | ```python |
| | y = poly_norm(x, weight, bias, eps) |
| | out = y * a |
| | ``` |
| | |
| | - Fused as: |
| |
|
| | ```python |
| | out = fused_mul_poly_norm(x, a, weight, bias, eps) |
| | ``` |
| | |
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from kernels import get_kernel |
| | |
| | activation = get_kernel("motif-technologies/activation") |
| | |
| | torch.set_default_device("cuda") |
| | poly_norm = activation.layers.PolyNorm(eps=1e-6) |
| | x = torch.randn(10, 10) |
| | |
| | print(poly_norm(x)) |
| | ``` |
| |
|
| | ## Performance |
| | - Test cases are from the Motif LLM |
| | - The results can be reproduced using the provided benchmarking tools. |
| | - For details on how to use the benchmarking tools, please refer to the [benchmarks README](./benchmarks/README.md). |
| | - The benchmark results may show fluctuations, especially in the backward pass and when the dimension size is small. |
| |
|
| | ### RMSNorm |
| |
|
| | #### H100 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | #### MI250 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | --- |
| |
|
| | ### FusedAddRMSNorm |
| |
|
| | > [!NOTE] |
| | > For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**. |
| |
|
| | #### H100 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | #### MI250 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | --- |
| |
|
| | ### PolyNorm |
| |
|
| | #### H100 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | #### MI250 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | --- |
| |
|
| | ### FusedMulPolyNorm |
| |
|
| | > [!NOTE] |
| | > For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**. |
| |
|
| | #### H100 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | #### MI250 Results |
| |
|
| | <details> |
| | <summary>Forward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary>Backward Performance</summary> |
| |
|
| |  |
| |
|
| | </details> |
| |
|
| | ## Pre-commit Hooks |
| |
|
| | This project uses [pre-commit](https://pre-commit.com/) to automatically check and format code before commits. |
| |
|
| | ### Setup |
| |
|
| | 1. Install pre-commit: |
| |
|
| | ```bash |
| | pip install pre-commit |
| | ``` |
| |
|
| | 2. Install the git hooks: |
| | |
| | ```bash |
| | pre-commit install |
| | ``` |
| |
|
| | Once installed, the configured hooks will run automatically on each commit. |
| |
|
| | ### Included Hooks |
| |
|
| | The following tools are run via pre-commit: |
| |
|
| | - **[yapf](https://github.com/google/yapf)** – Python code formatter |
| | - **[typos](https://github.com/crate-ci/typos)** – Spell checker for common typos |
| | - **[isort](https://github.com/PyCQA/isort)** – Organizes and sorts Python imports |
| | - **[clang-format](https://clang.llvm.org/docs/ClangFormat.html)** – Formats C++/CUDA code (`--style=file`) |
| | - **[pymarkdown](https://github.com/jackdewinter/pymarkdown)** – Lints and auto-fixes Markdown files |
| | - **[actionlint](https://github.com/rhysd/actionlint)** – Validates GitHub Actions workflows |
| |
|
| | ### Usage |
| |
|
| | - Run all checks on the entire codebase: |
| |
|
| | ```bash |
| | pre-commit run --all-files |
| | ``` |
| |
|
| | - Run a specific hook (example: isort): |
| | |
| | ```bash |
| | pre-commit run isort --all-files |
| | ``` |
| |
|