Instructions to use Frosty40/hydra with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Kernels
How to use Frosty40/hydra with Kernels:
# !pip install kernels from kernels import get_kernel kernel = get_kernel("Frosty40/hydra") - Notebooks
- Google Colab
- Kaggle
| # Hydra Kernel Card | |
| ## Summary | |
| Hydra provides a bounded-residency decode attention path for long-context | |
| inference. The implementation is Python plus Triton and is packaged here as a | |
| Hugging Face universal CUDA kernel source directory. | |
| Source code: [https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra](https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra) | |
| ## Intended Use | |
| Use Hydra for experiments where full decode attention over a long KV cache is | |
| memory-bound and a bounded resident set is acceptable for evaluation. | |
| Hydra is not intended as a drop-in universal FlashAttention replacement. Users | |
| should keep an exact-model fallback path and validate quality for their prompt | |
| set. | |
| ## Kernel Interface | |
| The exported call is: | |
| ```python | |
| hydra.hydra(q, k, v, is_causal=True, sliding_window=None) | |
| ``` | |
| Inputs are bf16 tensors shaped `(B, H, T, D)` with `D=128`. The decode path | |
| supports `Tq == 1`; the prefill path requires sequence length to be a multiple | |
| of the compile-time block size. | |
| ## Evidence | |
| This card separates the kernel contribution from benchmark appendices: | |
| - kernel/package validation: import, CSR, CUDA decode parity, builder, example, | |
| and isolated decode benchmark gates | |
| - broad Hydra campaign context: multi-GPU bounded-residency testing, comparison | |
| lanes, capacity/OOM boundaries, and diagnostics in the staging repo | |
| - exact-Qwen proof-of-concept: summary-backed demo rows for: | |
| - RTX PRO 6000 WS with `Qwen/Qwen3.6-35B-A3B-FP8` | |
| - RTX 3090 with `Qwen/Qwen3.6-35B-A3B-FP8` | |
| Use `results/reports/QWEN3P6_FP8_EVIDENCE_TABLE.md` in the staging repo as the | |
| claim ledger for the exact-Qwen proof-of-concept only. The table is generated | |
| from raw summary JSON, answer artifacts, and logs. It intentionally excludes | |
| incomplete scopes from completed benchmark rows. | |
| ## Non-Claims | |
| - no universal speedup claim | |
| - no production-readiness claim | |
| - no broad quality-preservation claim without scorer or inspection evidence | |
| - no proxy/profile/loader-only benchmark claims | |
| - no results from non-Qwen or non-FP8 runs in the exact-Qwen proof-of-concept table | |
| - no framing that treats the exact-Qwen proof-of-concept as the full Hydra campaign | |
| ## Required Validation | |
| For source changes, run: | |
| - import and CSR tests | |
| - CUDA decode parity against PyTorch SDPA on small tensors | |
| - kernel-builder `ci-test` | |
| - one isolated decode benchmark | |
| - one exact-model reproduction on a named GPU/config | |