TrimKV: Token Retention for Memory-Bounded Key-Value Eviction

This repository contains the weights for TRIM-KV-Qwen3-1.7B-Math, as presented in the paper Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs.

The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call token retention, and then decay this importance exponentially over time to mimic standard inference running with eviction.

The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step.

Quick Start

Installation

pip install trimkv

Usage

import torch
from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM
from trimkv.cache_utils import PagedTrimKVCache
from transformers import AutoTokenizer

model_path = "ngocbh/TrimKV-Qwen3-1.7B-Math"

model = TrimKVQwen3ForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    load_trimkv_weights=True,
    use_cache=True,
    device_map="cuda",
)
model.config._attn_implementation = "flash_attention_2"

tokenizer = AutoTokenizer.from_pretrained(
    model.config.base_model, use_fast=True, padding_side="left"
)

# PagedTrimKVCache is the inference-time cache used by TRIM-KV. 
# It allocates a global pool of blocks and reassigns them to heads on the fly.
past_key_values = PagedTrimKVCache(
    num_layers=model.config.num_hidden_layers,
    num_heads=model.config.num_key_value_heads,
    max_seq_len=32768,
    memory_size=128,
    num_blocks_ratio=1.0,
    buffer_size=32,
    strategy="fixed_budget",
    device="cuda",
)

# Use model.generate as normal — pass past_key_values to enable TrimKV eviction.

Citation

@article{bui2025cache,
  title={Cache what lasts: Token retention for memory-bounded kv cache in llms},
  author={Bui, Ngoc and Sharma, Shubham and Lamba, Simran and Mishra, Saumitra and Ying, Rex},
  journal={arXiv preprint arXiv:2512.03324},
  year={2025}
}
@article{bui2025make,
  title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
  author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
  journal={arXiv preprint arXiv:2512.03324},
  year={2025}
}
Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ngocbh/TrimKV-Qwen3-1.7B-Math

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(752)
this model

Dataset used to train ngocbh/TrimKV-Qwen3-1.7B-Math

Collection including ngocbh/TrimKV-Qwen3-1.7B-Math

Papers for ngocbh/TrimKV-Qwen3-1.7B-Math