---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3-4B-Instruct-2507
tags:
- sparse-attention
- ann-attention
- distillation
- search-projection
- inference-optimization
library_name: pytorch
---

# ann-sparseattention

Search projections for ANN-substituted attention on
[`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).

Code: [github.com/unixsysdev/ann-sparseattention](https://github.com/unixsysdev/ann-sparseattention)

## Current status

Research prototype. Trained projections work, runtime is a correctness
prototype, eval envelope is narrow. Treat reported numbers as preliminary.

**Validated:** 6-layer pilot on Qwen3-4B-Instruct-2507; WikiText-103 PPL
preserved at K=128 (gap ≈ +0.7%); learned projections retrieve attention-
relevant keys.

**Not yet validated:** 34-layer / whole-model substitution; long-context
tasks (LongBench, RULER, needle); wall-clock speedup vs FlashAttention/SDPA;
KV-cache decode-mode integration; GPU-resident ANN kernel.

**Runtime caveat:** the FAISS path here builds CPU indexes per batch and
the gather step uses dense-style tensor expansion. Compute-reduction
numbers below are *algorithmic scoring reductions, not measured wall-clock
speedups.*

## Relation to RetrievalAttention

RetrievalAttention (Liu et al., 2024) shows that **vanilla ANN over the
model's native Q, K vectors fails** because Q and K live in mismatched
distributions — they were never trained to be each other's nearest
neighbors, only to score via dot product. Their fix is at *index time*:
an attention-aware graph construction (RoarGraph-style).

This work attacks the same problem from the opposite direction. We
**train a tiny shared projection** (`W_Qs, W_Ks → R^64`) so that
`q_search` and `k_search` live in the same distribution by construction.
Off-the-shelf FAISS HNSW with default parameters then suffices.

| | Search space | Index | Trainable |
|---|---|---|---|
| Raw Q/K + vanilla ANN | original Q/K | off-the-shelf | no — fails (Q/K OOD) |
| RetrievalAttention | original Q/K | attention-aware graph | no |
| **This work** | **learned Q\_s / K\_s** | **off-the-shelf** | **yes (~2-11M params)** |

Contribution: *eliminate Q/K mismatch at index-build time via distillation,
instead of patching it at search time.* The clean validating experiment —
vanilla FAISS over raw Q/K vs. learned Q\_s/K\_s vs. exact teacher top-K
— is the next planned run.

## What's in this repo

Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,
trained against the frozen base model's attention via contrastive +
distillation losses. At inference these produce 64-d "search vectors" that
let an off-the-shelf FAISS HNSW index pick the top-K keys to attend to,
replacing dense `O(L²)` attention with `O(L·K)` ANN-substituted attention.

Layers covered (pilot): `[4, 8, 12, 16, 20, 24]` — 6 of 36 layers, ~2M trainable params.

## Pilot results (final, 2K steps on WikiText-103)

| Step | Recall@K=128 | PPL gap (full vs ANN) |
|---|---|---|
| 500 | 47.4% | 1.21% |
| 1000 | 50.7% | 0.68% |
| 1500 | 50.9% | 0.68% |
| **2000 (final)** | **50.9%** | **0.71%** |

PPL gap is the primary signal — at <1% relative gap, the model's output
quality is preserved under ANN substitution. Recall plateaus around step 1000
because the softmax-relevant keys concentrate in the top ~30; disagreement
on positions 30-128 is on near-zero-weight tail and doesn't affect output.

### K-retrieve Pareto (pilot step 2000, FAISS HNSW)

`PPL_full = 9.958`

| K | Recall@K | PPL_ANN | PPL gap |
|---|---|---|---|
| 16 | 24.9% | 10.71 | +7.51% |
| 32 | 22.8% | 10.41 | +4.51% |
| 64 | 23.1% | 10.20 | +2.42% |
| 128 | 26.0% | 10.04 | +0.82% |
| 256 | 31.6% | 9.88 | **−0.79%** |
| 512 | 40.8% | 9.67 | **−2.89%** |

On this small WikiText slice, K ≥ 256 produced lower measured PPL than
the full-attention reference. A plausible explanation is sparse-softmax
denoising, but with 12 eval batches, sample noise, packed-boundary artifacts
(pilot trained with packing on; default in the repo is now off), and
partial-layer substitution acting like regularization are also candidates.
Treating it as a hypothesis to confirm via an exact-topK oracle (full QK^T
→ top-K → restricted attention) at the same K — that separates "denoising
from any sparsity" from "denoising from learned projections."

Code-level sanity checks pass: same input sequences for `ppl_full` vs
`ppl_ann`, intact causal mask in retrieval, single softmax over retrieved
K with no wrapper leakage between iterations.

### Compute / quality knobs (FLOP-counted)

`L = 4096`. Compute reduction is the attention scoring step, ≈ `L / K`.
These are FLOP estimates, not measured wall-clock — the FAISS path in this
repo is a research prototype that does CPU index builds and GPU↔CPU
transfers, so it is not the right thing to time.

| K | PPL gap | Attention scoring reduction |
|---|---|---|
| 512 | −2.89% | ~8× |
| 256 | −0.79% | ~16× |
| 128 | +0.82% | ~32× |
| 64 | +2.42% | ~64× |
| 32 | +4.51% | ~128× |
| 16 | +7.51% | ~256× |

Eval scope: 12 sequences × 4K tokens of WikiText-103 validation (~50K
tokens). Read these as "what we observed on this slice", not population-
level estimates.

The K-sweep recall numbers (24–41%) and the in-training `evaluate()` recall
(50.9% at K=128) come from different sampled subsets of the streaming split
and shouldn't be directly compared. The repo also reports `mass@K` (sum of
teacher attention probability captured by the search top-K) — that's the
more direct retrieval-quality metric when softmax is sharp.

### Per-layer recall (pilot)

| Layer | Recall@K=128 | Recall@K=512 |
|---|---|---|
| 4 | 15.8% | 34.7% |
| 8 | 22.2% | 38.7% |
| 12 | 23.4% | 39.1% |
| 16 | 31.9% | 45.2% |
| 20 | 31.4% | 42.6% |
| 24 | 31.1% | 44.4% |

Early layers are harder for content-addressable retrieval — their attention
is more local/positional than semantic. Consistent across K, so it's a
property of the layer rather than noise.

### Caveats / what's next

- **Packing**: pilot training and eval ran with sequence packing on (no
  segment-level causal mask, since transformers' default forward doesn't
  build them). The relative PPL gap between full and ANN is internally
  consistent under this confound, but the negative gap at K≥256 has at
  least three candidate explanations we haven't disentangled —
  (a) sparse-softmax denoising, (b) ANN happening to filter cross-document
  keys that full attention attends to, (c) sample noise on a small eval.
  The default config now has packing off so the next run isolates (a).
- **Exact-topK oracle**: a four-way Pareto (full vs. exact top-K vs.
  search-topK exact vs. search-ANN) is the natural follow-up to separate
  "denoising from any sparsity" from "denoising from learned projections."
- **Wall-clock**: not measured. The FAISS path in the repo is a CPU-side
  research prototype, not a deployable runtime. A GPU-resident topk kernel
  is the next-step engineering.
- **34-layer headline** was queued (`make_headline_config()` is wired) and
  will mirror its checkpoints here when it runs.

## Files

| File | What |
|---|---|
| `search_step_1000.pt` | Mid-training checkpoint (step 1000, 0.68% PPL gap) |
| `search_step_2000.pt` | Final pilot checkpoint (step 2000, 0.71% PPL gap) |

Each contains `{step, search_module: state_dict, optimizer, scheduler, config}`.

## Loading

```python
import torch
from transformers import AutoModelForCausalLM
# Search module class is in the GitHub repo (model.py)
from model import SearchProjectionModule

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507",
    dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa",
)

search = SearchProjectionModule(
    d_model=2560, d_search=64,
    layer_indices=[4, 8, 12, 16, 20, 24],
    use_mlp=False,
).to(base.device).to(torch.bfloat16)

ckpt = torch.load("search_step_2000.pt", map_location="cpu", weights_only=False)
search.load_state_dict(ckpt["search_module"])
```

Use `inference.install_ann_attention(...)` (in the GitHub repo) to monkey-patch
the trained layers and run with FAISS HNSW retrieval at inference time.

## Training recipe

- Frozen base: Qwen3-4B-Instruct-2507 (36 layers, hidden 2560, GQA 32:8).
- Data: WikiText-103 raw, 4K-token sequences (packing was on at training
  time; default in the repo is now off — see Caveats).
- 2000 steps, batch 8, lr 1e-4 (cosine, 100-step warmup), AdamW.
- `α=β=1` (contrastive + KL distillation, both layers averaged).
- bf16 weights, fp32 loss math.
- SDPA attention (B200, no flash-attn package needed).
- Liger fused RMSNorm/SwiGLU/RoPE on the frozen base.
- Total wall-clock: ~25 min on a single B200.

## License

The search projections are released under Apache-2.0 (matching the base model).