ngocbh
/

TrimKV-Qwen3-4B-Math

@@ -17,37 +17,15 @@ tags:
 > **TRIM-KV** is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference.
-This model is a Qwen3-4B variant fine-tuned with TRIM-KV on the `OpenR1-Math-220k` dataset. It is based on the research paper [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649).
-The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction.
-The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.
-- **Paper:** [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649)
-- **Code:** [Official GitHub Repository](https://github.com/ngocbh/trimkv)
-### Why TRIM-KV?
-It's fast
-<div align="center">
-    <img width="1000" alt="teaser" src="https://github.com/ngocbh/trimkv/blob/main/assets/speed.png?raw=true"/>
-</div>
-It's smart
-<div align="center">
-    <img width="1000" alt="teaser" src="https://github.com/ngocbh/trimkv/blob/main/assets/performance.png?raw=true"/>
-</div>
-And it's interpretable
-<div align="center">
-    <img width="1000" alt="teaser" src="https://github.com/ngocbh/trimkv/blob/main/assets/eviction.png?raw=true"/>
-</div>
----
 ## Getting Started
@@ -56,9 +34,7 @@ And it's interpretable
 To use this model, you need to install the `trimkv` library from the [official repository](https://github.com/ngocbh/trimkv):
 ```sh
-git clone https://github.com/ngocbh/trimkv.git
-cd trimkv
-pip install -e .
 ```
 ### Quick Start
@@ -102,6 +78,12 @@ For a runnable end-to-end example, see [`examples/test_qwen3.py`](https://github
 ## Citation
 ```bibtex
 @article{bui2025make,
   title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
   author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},

 > **TRIM-KV** is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference.
+This model is a Qwen3-4B variant fine-tuned with TRIM-KV on the `OpenR1-Math-220k` dataset.
+The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic standard inference running with eviction.
+The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step.
+- **Paper:** [Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs](https://huggingface.co/papers/2605.09649)
+- **Code:** [GitHub - ngocbh/trimkv](https://github.com/ngocbh/trimkv)
+- **Arxiv:** [2512.03324](https://arxiv.org/abs/2512.03324)
 ## Getting Started
 To use this model, you need to install the `trimkv` library from the [official repository](https://github.com/ngocbh/trimkv):
 ```sh
+pip install trimkv
 ```
 ### Quick Start
 ## Citation
 ```bibtex
+@article{bui2025cache,
+  title={Cache what lasts: Token retention for memory-bounded kv cache in llms},
+  author={Bui, Ngoc and Sharma, Shubham and Lamba, Simran and Mishra, Saumitra and Ying, Rex},
+  journal={arXiv preprint arXiv:2512.03324},
+  year={2025}
+}
 @article{bui2025make,
   title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
   author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},