docs: update readme to include GitHub url

37eb524 verified 4 months ago

1.3 kB

license: apache-2.0
datasets:
  - nvidia/Nemotron-Pretraining-Dataset-sample
library_name: transformers
tags:
  - nvidia
  - pytorch
track_downloads: true

KVzap

KVzap is a KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold.

KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository (source).