| Intended Task/Domain: |
KV cache pruning / inference optimization for transformer-based language models (prefilling and decoding). |
| Model Type: |
Feed-forward surrogate (Linear or 2-layer MLP) applied to transformer hidden states. |
| Intended Users: |
ML researchers and engineers integrating KV cache pruning into LLM inference stacks and benchmarking long-context / long-decoding performance. |
| Output: |
Numeric pruning scores in log-space with shape ((T, H)) (per token, per KV head), used to threshold which KV pairs to retain and to enforce a local sliding window. |
| Describe how the model works: |
KVzap maps transformer hidden states ((T, D_h)) to per-token, per-KV-head scores ((T, H)). Tokens below a threshold (\tau) are pruned from the KV cache; a recent sliding window is always retained. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: |
Not Applicable. |
| Technical Limitations & Mitigation: |
Over-pruning can degrade the host model; mitigate by selecting (\tau) conservatively and validating on target tasks. |
| Verified to have met prescribed NVIDIA quality standards: |
Yes. |
| Performance Metrics: |
Downstream accuracy under compression (e.g., RULER, LongBench, AIME25), compression ratio, and runtime overhead (KVzap compute/memory overhead is designed to be negligible relative to transformer layers). |
| Potential Known Risks: |
The scores predicted by KVzap are used to prune the KV cache. But KV Pruning may change host-model (LLM) outputs even with low thresholds as such models were not trained with KV cache Pruning. |
| Licensing: |
Apache License 2.0. |