metadata
license: apache-2.0
datasets:
- nvidia/Nemotron-Pretraining-Dataset-sample
library_name: transformers
tags:
- nvidia
- pytorch
track_downloads: true
KVzap
KVzap is a KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold.
KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository (source).