Transformers
Safetensors
PyTorch
kvzap
nvidia

KVzap

License GitHub KVzap collection arXiv

KVzap is a KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold.

KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository (source).

Downloads last month
28
Safetensors
Model size
1.05M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nvidia/KVzap-linear-Llama-3.1-8B-Instruct

Collection including nvidia/KVzap-linear-Llama-3.1-8B-Instruct

Papers for nvidia/KVzap-linear-Llama-3.1-8B-Instruct