KV Admission: Learning What to Write for Efficient Long-Context Inference

This repository contains the official weights for the Write-Gate MLP introduced in the paper KV Admission: Learning What to Write for Efficient Long-Context Inference.

Abstract

Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches (KV Selection or Eviction) mitigate this post-hoc, but overlook the root inefficiency: indiscriminate writing to memory. We propose Write-Gated KV (WG-KV) to introduce a missing primitive: KV Admission. Instead of blindly persisting every token, WG-KV employs a lightweight, learnable mechanism to predict token utility before cache entry. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, WG-KV reduces memory usage by 46-68% and delivers significant prefill and decode speedups, while maintaining compatibility with FlashAttention and Paged-KV systems.

Resources

Usage

Using these checkpoints requires the environment and custom implementations provided in the official repository. This includes a modified version of the Transformers library and a custom vLLM fork for sparse prefill kernels. Please refer to the official installation guide.

After setting up the environment, you can run inference using the trained gate by specifying the checkpoint path:

python scripts/inference.py \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --filtering_path weights/llama-3.1-8b-instruct-0.04.pt

The checkpoints follow the naming convention {model_name}-{lambda}.pt, where lambda (λ) controls the trade-off between sparsity and accuracy.

Citation

@misc{wgkv,
   title={KV Admission: Learning What to Write for Efficient Long-Context Inference},
   author={Yen-Chieh Huang and Pi-Cheng Hsiu and Rui Fang and Ming-Syan Chen},
   year={2025},
   eprint={2512.17452},
   archivePrefix={arXiv},
   primaryClass={cs.LG},
   url={https://arxiv.org/abs/2512.17452}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for WG-KV/checkpoints