Upload 5 files

e3c1320 verified 4 months ago

6.44 kB

Model Overview

Description:

KVzap predicts per-token importance scores from a transformer’s hidden states to enable fast, input-dependent KV cache pruning during both prefilling and decoding. KVzap (KVzap-Linear and KVzap-MLP) was developed by NVIDIA as part of the KVpress project and the NVIDIA/KVzap model collection. This model is ready for commercial/non-commercial use

License/Terms of Use:

Apache License 2.0.

Deployment Geography:

Global

Use Case:

Researchers and developers optimizing transformer inference (KV cache memory and throughput) in long-context and long-decoding settings, and benchmarking KV cache pruning methods (e.g., RULER, LongBench, reasoning workloads).

Release Date:

Hugging Face 01/12/2026 via https://huggingface.co/collections/nvidia/kvzap

References(s):

KVpress repository: https://github.com/NVIDIA/kvpress
Hugging Face collection: https://huggingface.co/collections/nvidia/kvzap

Model Architecture:

Architecture Type: Feed-forward neural network (MLP) / Linear projection

Network Architecture: Per-layer surrogate model applied to hidden states; either (i) a single linear layer (KVzap-Linear) or (ii) a 2-layer MLP with GELU (KVzap-MLP). Output dimension equals the number of KV heads (e.g., (H=8)).

This model was developed based on: hidden states produced by the target base LLM (e.g., Qwen3-8B, Llama-3.1-8B-Instruct, Qwen3-32B).

Number of model parameters: Depends on base model and variant; typically in the range ~1.1M to ~210M parameters for the best-performing released configurations (e.g., Llama-3.1-8B-Instruct Linear ~1.1M; Qwen3-8B MLP ~76M; Qwen3-32B MLP ~210M).

Input(s):

Input Type(s): Text-derived hidden-state tensors (numeric)

Input Format(s): PyTorch tensor (e.g., torch.float16/torch.bfloat16)

Input Parameters: Two-dimensional (2D) hidden states with shape ((T, D_h)), where (T) is sequence length and (D_h) is the base model hidden size (e.g., 4096 or 5120).

Other Properties Related to Input: KVzap operates on the base model’s hidden states at each transformer layer and position; it does not take raw text directly.

Output(s)

Output Type(s): Numeric scores (log-space)

Output Format(s): PyTorch tensor

Output Parameters: Two-dimensional (2D) scores with shape ((T, H)), where (H) is the number of KV heads (e.g., 8).

Other Properties Related to Outupt: Outputs approximate (\log(s^+)), where (s^+) is the KVzip+ importance score derived from attention weights and value/residual norms. Scores are used for threshold-based pruning and for enforcing a local sliding window (e.g., last (w=128) tokens are always kept).

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

PyTorch
Hugging Face Transformers (for base model inference and integration)
KVpress (integration layer)

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Volta

Preferred/Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

KVzap v1 (KVzap-Linear and KVzap-MLP), with separate checkpoints per supported base model family and size.

Training Dataset

Link: https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample

Data Modality

Text

Text Training Data Size Less than a billion tokens

Data Collection Method by dataset Automated

Labeling Method by dataset

Automated (targets computed from base-model attention and norms)

Properties: Text prompts used to generate hidden-state / score pairs. Supervision labels are derived automatically from the target base model’s attention behavior (KVzip+).

Testing Dataset

Link: https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample

Data Modality

Text

Data Collection Method by dataset Automated

Labeling Method by dataset

Automated (targets computed from base-model attention and norms)

Properties: Validation split at the prompt level (subset-dependent); validation includes 45 prompts corresponding to 22.5k held-out pairs per KV head.

Evaluation

Data Collection Method by dataset: Automated

Labeling Method by dataset: Automated

Properties Evaluation was performed on the Testing dataset, consisting in a set of 22.5k pairs (X, y) where X is the hidden state and y is the KVzip+ score. The KVzap model was applied to the hidden states X to obtain a prediction y_pred. On the testing dataset, the average pearson R^2 score between the KVzip+ scores y and the KVzap predictions y_pred were in the range of 0.60 to 0.80 range depending on the attention head.

Inference:

Acceleration Engine: PyTorch / CUDA; integration typically via KVpress and Hugging Face Transformers for base-model execution. Test Hardware: NVIDIA GPUs (e.g., H100 used for experiments reported in the paper).

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.