Instructions to use GenomaLabs-com/kv-cache-eviction-mla with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use GenomaLabs-com/kv-cache-eviction-mla with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("GenomaLabs-com/kv-cache-eviction-mla", dtype="auto") - Notebooks
- Google Colab
- Kaggle
How It Works: Architectural Notes
This document explains how the eviction patch hooks into a DeepseekV3Attention layer and where the design choices live. Read this before modifying the patch or porting it to a different attention class.
The patch surface
install_kv_eviction(model, ...) walks model.modules() and finds every DeepseekV3Attention instance. For each one it:
- Stashes the original
forwardmethod as an attribute on the module (soremove_kv_evictioncan restore it later). - Creates an
_EvictionStateobject holdingbudget,n_sink,n_recent,evict_every, and a per-layer accumulated-score tensorscore. - Replaces the layer's
forwardwith a closure that wraps the original. The closure runs the original forward, captures the attention probabilities, accumulates them intostate.score, then calls_maybe_evicton the cache.
The original forward path is not modified. We sit outside the math, observing the attention probabilities and managing the cache as a side effect.
What gets evicted
Eviction operates on the layer's KV cache. With transformers.cache_utils.DynamicCache (the default for HuggingFace generation), each layer has a key_cache[layer_idx] and value_cache[layer_idx] tensor of shape [batch, heads, seq, dim]. We slice along the seq dimension, keeping:
- Sinks: indices
[0, n_sink)always. - Recent: indices
[seq - n_recent, seq)always. - Heavy hitters: the top
budgetindices by accumulated attention mass, drawn from the middle range[n_sink, seq - n_recent).
The middle range is where eviction actually does work; the sink and recent ranges are non-evictable.
Score accumulation
Per-token attention mass is accumulated as:
state.score[batch, token] += sum_over_heads(attn_probs[batch, head, *, token])
That is: for each token in the cache, sum the attention probability that all queries at this step paid to it, summed across all heads. The accumulation runs every step; over time, tokens that are repeatedly attended to by many heads accumulate large scores; tokens that are attended to once or twice fade.
Cross-head, not per-head. This is a defensible default but it has a downside: heads that specialize (e.g., one head dedicated to attention sinks, another to retrieval) get averaged together. Per-head policies are sometimes superior in production; this implementation does not currently expose that knob. PRs welcome.
When eviction triggers
evict_every=N controls how often _maybe_evict actually runs. Default is 1 (every step). Larger values reduce per-step overhead but increase peak cache size between evictions:
peak_cache_size <= n_sink + budget + n_recent + (evict_every - 1)
For most workloads evict_every=1 is fine. If the eviction step shows up in profiling as a measurable cost (rare, since argpartition on a few thousand floats is fast), increase to 8 or 16.
Edge cases
A few cases the code handles explicitly:
- Cache below threshold. If
len(cache) <= n_sink + budget + n_recent,_maybe_evictis a no-op. No eviction happens until the cache grows past the budget. - First call. The accumulated score tensor is initialized lazily on the first forward pass once we know batch size and current cache length.
- Cache reset between generations.
reset_eviction_scores(model)zeroes all accumulated scores. Call this between independent generations; otherwise the previous generation's heavy hitters bias the next one. - Model on multiple devices. The score tensor lives on the same device as the cache. With
device_map="auto"model sharding, each layer's score lives on its layer's device; nothing special to do.
Failure modes to watch for
If eviction breaks, you'll typically see one of these:
- Repetitive garbage output -> sinks were evicted. Verify
n_sink >= 4and that index 0 ofkey_cacheis preserved across eviction calls. - Topical drift / stale context -> recent window too small. Increase
n_recent. - Sudden quality cliff at long context -> heavy-hitter score accumulation has saturated or normalized incorrectly. Check that
state.scoreis being updated every step (not just on eviction-trigger steps). - Memory growing despite eviction -> the cache class is not
DynamicCacheand the patch's slicing assumptions don't hold. Printtype(past_key_value)from inside the patched forward to verify.
Porting to a different attention class
To adapt this patch for non-MLA attention (Llama, Qwen, Mistral, Gemma standard MHA / GQA):
- Find the attention class in the relevant
transformers/models/<arch>/modeling_<arch>.py. - Identify how it stores K/V in the cache. Most use
DynamicCachewith[batch, heads, seq, head_dim]. MLA stores expanded K/V at[batch, heads, seq, qk_dim]and[batch, heads, seq, v_dim]separately because qk_dim != v_dim. For uniform-dim attention, the slicing in_maybe_evictsimplifies. - Replace the
DeepseekV3Attentionclass lookup ininstall_kv_evictionwith the relevant class. - Verify the attention probability tensor shape returned by the original forward; the score-accumulation code assumes
[batch, heads, q_len, k_len].
The H2O recipe itself does not need to change; only the cache management code does.
Testing methodology
The smoke test in src/kv_eviction_mla.py exercises the eviction-decision logic on a _FakeCache that mirrors DynamicCache's structure but skips the model. Real-model validation requires a GPU and a checkpoint. We recommend:
- Run the smoke test (no GPU required) to verify the patch logic.
- Patch a small DeepSeek/Kimi model (e.g., a 1B or 7B variant) and run a perplexity-sweep notebook on a public benchmark prompt set.
- Run RULER 128K NIAH at your target budget. Compare to full-cache baseline.
- Stress-test on your actual workload distribution.
A canonical RULER benchmark with this code is in the project roadmap.