--- base_model: - Qwen/Qwen3-4B datasets: - open-r1/OpenR1-Math-220k license: apache-2.0 pipeline_tag: text-generation tags: - math - dbtrimkv - trimkv - kv-cache - compression --- > DBTrimKV is the dynamic-budget variant of TrimKV: a single global KV budget is shared across layers and heads and reallocated on the fly, with the retention-gate's final projection tied across layers. This repository hosts the **DBTrimKV** retention-gate weights for `Qwen/Qwen3-4B` (32768-token training context, M = 128). The base-model weights are not included — they are loaded from `Qwen/Qwen3-4B` at runtime and the retention-gate weights from `trimkv_weights.pth` are overlaid on top. This model was introduced in the paper [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649). For the full list of released checkpoints, training recipes, and benchmark scripts, see the GitHub repository: **https://github.com/ngocbh/trimkv**. ## Quick start To use this model, please install the `trimkv` library from the [GitHub repo](https://github.com/ngocbh/trimkv). ```python import torch from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM from trimkv.cache_utils import PagedTrimKVCache from transformers import AutoTokenizer model = TrimKVQwen3ForCausalLM.from_pretrained( "ngocbh/DBTrimKV-Qwen3-4B-Math", torch_dtype=torch.bfloat16, load_trimkv_weights=True, download_from="huggingface", use_cache=True, device_map="cuda", ) model.config._attn_implementation = "flash_attention_2" tokenizer = AutoTokenizer.from_pretrained( model.config.base_model, use_fast=True, padding_side="left" ) past_key_values = PagedTrimKVCache( num_layers=model.config.num_hidden_layers, num_heads=model.config.num_key_value_heads, max_seq_len=32768, memory_size=128, num_blocks_ratio=1.0, buffer_size=32, strategy="fixed_budget", device="cuda", ) # Use as a normal HF model — pass `past_key_values=past_key_values` to .generate ``` See [`examples/test_qwen3.py`](https://github.com/ngocbh/trimkv/blob/main/examples/test_qwen3.py) in the GitHub repo for a full runnable example. ## Training details - **Base model**: `Qwen/Qwen3-4B` - **Variant**: **DBTrimKV** (`retention_gate=rg10`) - **Training dataset**: `open-r1/OpenR1-Math-220k` - **Training memory size M**: `128` - **Training context length**: `32768` - **Loss**: `fwkl_ntp` - **Attention impl**: `rg_attn_flex` ## Citation ```bibtex @article{bui2025cache, title={Cache what lasts: Token retention for memory-bounded kv cache in llms}, author={Bui, Ngoc and Sharma, Shubham and Lamba, Simran and Mishra, Saumitra and Ying, Rex}, journal={arXiv preprint arXiv:2512.03324}, year={2025} } @article{bui2025make, title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction}, author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex}, journal={arXiv preprint arXiv:2512.03324}, year={2025} } ```