Other
Transformers
Safetensors
PyTorch
kvzap
nvidia
d3bach's picture
docs: update readme to include GitHub url
37eb524 verified
|
raw
history blame
1.3 kB
metadata
license: apache-2.0
datasets:
  - nvidia/Nemotron-Pretraining-Dataset-sample
library_name: transformers
tags:
  - nvidia
  - pytorch
track_downloads: true

KVzap

License GitHub KVzap collection arXiv

KVzap is a KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold.

KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository (source).