Instructions to use Frosty40/hydra with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Kernels
How to use Frosty40/hydra with Kernels:
# !pip install kernels from kernels import get_kernel kernel = get_kernel("Frosty40/hydra") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - kernel | |
| - kernels | |
| - triton | |
| - attention | |
| - long-context | |
| # Hydra | |
| Hydra is an experimental bounded-residency attention kernel for long-context | |
| decode. It keeps sink tokens, recent tokens, and selected older pages resident | |
| instead of forcing each decode step to attend over the full KV cache. | |
| Source code: [https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra](https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra) | |
| This release is intentionally narrow. It is not a general replacement for | |
| full attention, and it does not claim universal speedups or broad quality | |
| preservation. The current target is fit and usability for specific | |
| long-context inference workloads where the full-attention path is memory-bound. | |
| ## Usage | |
| After the kernel is published: | |
| ```python | |
| import torch | |
| from kernels import get_kernel | |
| hydra = get_kernel("Frosty40/hydra") | |
| q = torch.randn(1, 32, 1, 128, device="cuda", dtype=torch.bfloat16) | |
| k = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16) | |
| v = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16) | |
| out = hydra.hydra(q, k, v) | |
| print(out.shape) | |
| ``` | |
| For local development from the public source checkout: | |
| ```python | |
| from pathlib import Path | |
| import sys | |
| sys.path.insert(0, str(Path("hf-kernels") / "hydra" / "torch-ext")) | |
| import hydra | |
| ``` | |
| `readme_example.py` uses the local source packet by default so it can run before | |
| publication. Set `HYDRA_USE_HUB=1` after publication to exercise the Hub-loaded | |
| path. | |
| ## API | |
| ```python | |
| hydra.hydra( | |
| q, | |
| k, | |
| v, | |
| *, | |
| is_causal=True, | |
| sliding_window=None, | |
| policy_layer_idx=None, | |
| precision="high", | |
| ) | |
| ``` | |
| Current constraints: | |
| - CUDA tensors only | |
| - bf16 `q`, `k`, and `v` | |
| - shape `(B, H, T, D)` with `D=128` | |
| - causal attention only | |
| - decode path supports `Tq == 1` with arbitrary `Tkv` | |
| - prefill path requires `T % BLOCK_SIZE == 0` | |
| ## Evidence Boundary | |
| Submission-facing evidence must come from checked artifacts, not prose notes. | |
| Treat evidence in three separate scopes: | |
| - kernel/package validation: tests, CUDA parity logs, `kernel-builder` logs, and | |
| isolated decode benchmarks for this source packet | |
| - broad Hydra research campaign: capacity, quality, sparse-attention comparison, | |
| edge/OOM, diagnostic, and model-family reports from the staging repo | |
| - exact-model proof-of-concept: checked `Qwen/Qwen3.6-35B-A3B-FP8` rows for | |
| named GPUs only | |
| The exact-Qwen proof-of-concept appendix in the staging repo is under: | |
| ```text | |
| results/raw/qwen3p6_35b_a3b_fp8/ | |
| results/reports/QWEN3P6_FP8_EVIDENCE_TABLE.md | |
| ``` | |
| Each cited row must include all three: | |
| - fit/headroom: GPU, context length, memory allocated/reserved, and OOM state | |
| - quality/correctness: prompt/task ID and generated answer artifact | |
| - speed/usability: wall time, generated tokens, tokens/sec, and comparison target | |
| Do not cite proxy models, loader-only probes, failed dependency checks, or | |
| non-matching model runs as Hydra benchmark results. Do not describe the | |
| exact-Qwen proof-of-concept subset as the full Hydra validation campaign. | |
| ## Current Proof-Of-Concept Scope | |
| The current exact-Qwen artifact-backed proof-of-concept scope is: | |
| | GPU | Model | Scope | | |
| | --- | --- | --- | | |
| | RTX PRO 6000 WS | `Qwen/Qwen3.6-35B-A3B-FP8` | 32k/80k/160k repeat packet, 160k c96 warm packet, and frontier/headroom sweeps | | |
| | RTX 3090 | `Qwen/Qwen3.6-35B-A3B-FP8` | 2k/3k/4k/6k/8k fit probes and completed 10k/12k/14k edge sweep | | |
| The 3090 result should be framed as fit/usability evidence, not a speedup | |
| claim. Token rates are slow in the long-context edge rows. The broader Hydra | |
| campaign includes additional GPUs, tasks, and comparison lanes outside this | |
| exact-model appendix. | |
| ## Validation Required Before Merge | |
| Minimum gates for source changes: | |
| ```bash | |
| cd hf-kernels/hydra | |
| python3 -m pytest -q tests | |
| nix run .#ci-test | |
| python3 benchmarks/benchmark_hydra_decode.py --repo . | |
| python3 readme_example.py | |
| ``` | |
| Run the CUDA tests on real GPUs. Local syntax checks are not enough for a | |
| kernel submission. | |
| ## Benchmark Snapshot | |
| The current 8192-token decode smoke/benchmark matrix is intentionally reported | |
| as kernel/package evidence, not as a universal speedup claim. | |
| | GPU | Package smoke decode | HF benchmark mean | | |
| | --- | ---: | ---: | | |
| | RTX 3060 | 0.2574 ms | 0.3229 ms | | |
| | RTX 3070 | 0.1474 ms | 0.2532 ms | | |
| | RTX 3080 | 0.2051 ms | 0.3157 ms | | |
| | RTX 3090 | 0.1492 ms | 0.3107 ms | | |
| | RTX 4070 Ti | 0.1261 ms | 0.2215 ms | | |
| | RTX 4090 | 0.1132 ms | 0.2245 ms | | |
| | A100 SXM4 | 0.1408 ms | 0.2568 ms | | |
| | RTX PRO 6000 Blackwell | 0.1158 ms | 0.1371 ms | | |
| | RTX A6000 | builder smoke 0.2166 ms | 0.3230 ms | | |
| The final `kernel-builder` gate passed on a Vast RTX A6000 with | |
| `BUILDER_VARIANT=torch210-cxx11-cu128-x86_64-linux`: local pytest `6 passed`, | |
| decode smoke `0.2166 ms/iter`, builder pytest `4 passed, 2 skipped`, exit | |
| status `0`. | |