Frosty40
/

hydra

Model card Files Files and versions

hydra / CARD.md

Frosty40's picture

Publish Hydra kernel source packet

7298fd0 verified 21 days ago

|

history blame contribute delete

2.42 kB

	# Hydra Kernel Card

	## Summary

	Hydra provides a bounded-residency decode attention path for long-context
	inference. The implementation is Python plus Triton and is packaged here as a
	Hugging Face universal CUDA kernel source directory.

	Source code: [https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra](https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra)

	## Intended Use

	Use Hydra for experiments where full decode attention over a long KV cache is
	memory-bound and a bounded resident set is acceptable for evaluation.

	Hydra is not intended as a drop-in universal FlashAttention replacement. Users
	should keep an exact-model fallback path and validate quality for their prompt
	set.

	## Kernel Interface

	The exported call is:

	```python
	hydra.hydra(q, k, v, is_causal=True, sliding_window=None)
	```

	Inputs are bf16 tensors shaped `(B, H, T, D)` with `D=128`. The decode path
	supports `Tq == 1`; the prefill path requires sequence length to be a multiple
	of the compile-time block size.

	## Evidence

	This card separates the kernel contribution from benchmark appendices:

	- kernel/package validation: import, CSR, CUDA decode parity, builder, example,
	and isolated decode benchmark gates
	- broad Hydra campaign context: multi-GPU bounded-residency testing, comparison
	lanes, capacity/OOM boundaries, and diagnostics in the staging repo
	- exact-Qwen proof-of-concept: summary-backed demo rows for:

	- RTX PRO 6000 WS with `Qwen/Qwen3.6-35B-A3B-FP8`
	- RTX 3090 with `Qwen/Qwen3.6-35B-A3B-FP8`

	Use `results/reports/QWEN3P6_FP8_EVIDENCE_TABLE.md` in the staging repo as the
	claim ledger for the exact-Qwen proof-of-concept only. The table is generated
	from raw summary JSON, answer artifacts, and logs. It intentionally excludes
	incomplete scopes from completed benchmark rows.

	## Non-Claims

	- no universal speedup claim
	- no production-readiness claim
	- no broad quality-preservation claim without scorer or inspection evidence
	- no proxy/profile/loader-only benchmark claims
	- no results from non-Qwen or non-FP8 runs in the exact-Qwen proof-of-concept table
	- no framing that treats the exact-Qwen proof-of-concept as the full Hydra campaign

	## Required Validation

	For source changes, run:

	- import and CSR tests
	- CUDA decode parity against PyTorch SDPA on small tensors
	- kernel-builder `ci-test`
	- one isolated decode benchmark
	- one exact-model reproduction on a named GPU/config