nvidia
/

KVzap-linear-Qwen3-8B

Model card Files Files and versions

KVzap-linear-Qwen3-8B / explainability.md

simjeg's picture

Upload 5 files

48b1885 verified 9 days ago

|

history blame contribute delete

2.72 kB

	Field \| Response
	:------------------------------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------
	Intended Task/Domain: \| KV cache pruning / inference optimization for transformer-based language models (prefilling and decoding).
	Model Type: \| Feed-forward surrogate (Linear or 2-layer MLP) applied to transformer hidden states.
	Intended Users: \| ML researchers and engineers integrating KV cache pruning into LLM inference stacks and benchmarking long-context / long-decoding performance.
	Output: \| Numeric pruning scores in log-space with shape \((T, H)\) (per token, per KV head), used to threshold which KV pairs to retain and to enforce a local sliding window.
	Describe how the model works: \| KVzap maps transformer hidden states \((T, D_h)\) to per-token, per-KV-head scores \((T, H)\). Tokens below a threshold \(\tau\) are pruned from the KV cache; a recent sliding window is always retained.
	Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: \| Not Applicable.
	Technical Limitations & Mitigation: \| Over-pruning can degrade the host model; mitigate by selecting \(\tau\) conservatively and validating on target tasks.
	Verified to have met prescribed NVIDIA quality standards: \| Yes.
	Performance Metrics: \| Downstream accuracy under compression (e.g., RULER, LongBench, AIME25), compression ratio, and runtime overhead (KVzap compute/memory overhead is designed to be negligible relative to transformer layers).
	Potential Known Risks: \| The scores predicted by KVzap are used to prune the KV cache. But KV Pruning may change host-model (LLM) outputs even with low thresholds as such models were not trained with KV cache Pruning.
	Licensing: \| [Apache License 2.0.](https://www.apache.org/licenses/LICENSE-2.0)