hydra / README.md

Publish Hydra kernel source packet

7298fd0 verified 21 days ago

4.87 kB

	---
	license: mit
	tags:
	- kernel
	- kernels
	- triton
	- attention
	- long-context
	---

	# Hydra

	Hydra is an experimental bounded-residency attention kernel for long-context
	decode. It keeps sink tokens, recent tokens, and selected older pages resident
	instead of forcing each decode step to attend over the full KV cache.

	Source code: [https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra](https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra)

	This release is intentionally narrow. It is not a general replacement for
	full attention, and it does not claim universal speedups or broad quality
	preservation. The current target is fit and usability for specific
	long-context inference workloads where the full-attention path is memory-bound.

	## Usage

	After the kernel is published:

	```python
	import torch
	from kernels import get_kernel

	hydra = get_kernel("Frosty40/hydra")

	q = torch.randn(1, 32, 1, 128, device="cuda", dtype=torch.bfloat16)
	k = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16)
	v = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16)

	out = hydra.hydra(q, k, v)
	print(out.shape)
	```

	For local development from the public source checkout:

	```python
	from pathlib import Path
	import sys

	sys.path.insert(0, str(Path("hf-kernels") / "hydra" / "torch-ext"))
	import hydra
	```

	`readme_example.py` uses the local source packet by default so it can run before
	publication. Set `HYDRA_USE_HUB=1` after publication to exercise the Hub-loaded
	path.

	## API

	```python
	hydra.hydra(
	q,
	k,
	v,
	*,
	is_causal=True,
	sliding_window=None,
	policy_layer_idx=None,
	precision="high",
	)
	```

	Current constraints:

	- CUDA tensors only
	- bf16 `q`, `k`, and `v`
	- shape `(B, H, T, D)` with `D=128`
	- causal attention only
	- decode path supports `Tq == 1` with arbitrary `Tkv`
	- prefill path requires `T % BLOCK_SIZE == 0`

	## Evidence Boundary

	Submission-facing evidence must come from checked artifacts, not prose notes.
	Treat evidence in three separate scopes:

	- kernel/package validation: tests, CUDA parity logs, `kernel-builder` logs, and
	isolated decode benchmarks for this source packet
	- broad Hydra research campaign: capacity, quality, sparse-attention comparison,
	edge/OOM, diagnostic, and model-family reports from the staging repo
	- exact-model proof-of-concept: checked `Qwen/Qwen3.6-35B-A3B-FP8` rows for
	named GPUs only

	The exact-Qwen proof-of-concept appendix in the staging repo is under:

	```text
	results/raw/qwen3p6_35b_a3b_fp8/
	results/reports/QWEN3P6_FP8_EVIDENCE_TABLE.md
	```

	Each cited row must include all three:

	- fit/headroom: GPU, context length, memory allocated/reserved, and OOM state
	- quality/correctness: prompt/task ID and generated answer artifact
	- speed/usability: wall time, generated tokens, tokens/sec, and comparison target

	Do not cite proxy models, loader-only probes, failed dependency checks, or
	non-matching model runs as Hydra benchmark results. Do not describe the
	exact-Qwen proof-of-concept subset as the full Hydra validation campaign.

	## Current Proof-Of-Concept Scope

	The current exact-Qwen artifact-backed proof-of-concept scope is:

	\| GPU \| Model \| Scope \|
	\| --- \| --- \| --- \|
	\| RTX PRO 6000 WS \| `Qwen/Qwen3.6-35B-A3B-FP8` \| 32k/80k/160k repeat packet, 160k c96 warm packet, and frontier/headroom sweeps \|
	\| RTX 3090 \| `Qwen/Qwen3.6-35B-A3B-FP8` \| 2k/3k/4k/6k/8k fit probes and completed 10k/12k/14k edge sweep \|

	The 3090 result should be framed as fit/usability evidence, not a speedup
	claim. Token rates are slow in the long-context edge rows. The broader Hydra
	campaign includes additional GPUs, tasks, and comparison lanes outside this
	exact-model appendix.

	## Validation Required Before Merge

	Minimum gates for source changes:

	```bash
	cd hf-kernels/hydra
	python3 -m pytest -q tests
	nix run .#ci-test
	python3 benchmarks/benchmark_hydra_decode.py --repo .
	python3 readme_example.py
	```

	Run the CUDA tests on real GPUs. Local syntax checks are not enough for a
	kernel submission.

	## Benchmark Snapshot

	The current 8192-token decode smoke/benchmark matrix is intentionally reported
	as kernel/package evidence, not as a universal speedup claim.

	\| GPU \| Package smoke decode \| HF benchmark mean \|
	\| --- \| ---: \| ---: \|
	\| RTX 3060 \| 0.2574 ms \| 0.3229 ms \|
	\| RTX 3070 \| 0.1474 ms \| 0.2532 ms \|
	\| RTX 3080 \| 0.2051 ms \| 0.3157 ms \|
	\| RTX 3090 \| 0.1492 ms \| 0.3107 ms \|
	\| RTX 4070 Ti \| 0.1261 ms \| 0.2215 ms \|
	\| RTX 4090 \| 0.1132 ms \| 0.2245 ms \|
	\| A100 SXM4 \| 0.1408 ms \| 0.2568 ms \|
	\| RTX PRO 6000 Blackwell \| 0.1158 ms \| 0.1371 ms \|
	\| RTX A6000 \| builder smoke 0.2166 ms \| 0.3230 ms \|

	The final `kernel-builder` gate passed on a Vast RTX A6000 with
	`BUILDER_VARIANT=torch210-cxx11-cu128-x86_64-linux`: local pytest `6 passed`,
	decode smoke `0.2166 ms/iter`, builder pytest `4 passed, 2 skipped`, exit
	status `0`.