Add pipeline tag and paper link to model card
Browse filesHi! I'm Niels from the community science team at Hugging Face.
I've opened this PR to improve the model card for this repository. Specifically, I've:
- Added the `text-generation` pipeline tag to the metadata to improve discoverability on the Hub.
- Added a link to the original research paper: [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649).
- Formatted the authors' information and linked the official GitHub repository.
This helps users understand the context and technical foundations of the model.
README.md
CHANGED
|
@@ -1,9 +1,10 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
datasets:
|
| 4 |
-
- open-r1/OpenR1-Math-220k
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen3-14B
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
tags:
|
| 8 |
- math
|
| 9 |
- trimkv
|
|
@@ -12,15 +13,18 @@ tags:
|
|
| 12 |
- Compression
|
| 13 |
---
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
| 16 |
|
| 17 |
The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction.
|
| 18 |
|
| 19 |
The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.
|
| 20 |
|
| 21 |
-
|
| 22 |
<a href="https://arxiv.org/pdf/2512.03324"><img src="https://img.shields.io/badge/arxiv-2512.03324-red?style=for-the-badge"></a>
|
| 23 |
|
|
|
|
|
|
|
| 24 |
|
| 25 |
### Why TRIM-KV?
|
| 26 |
|
|
@@ -62,8 +66,6 @@ And it's interpretable
|
|
| 62 |
pip install -r requirements.txt
|
| 63 |
```
|
| 64 |
|
| 65 |
-
This is a minimal set of requirements for training purposes. Additional dependencies may be needed for running specific experiments. We provided a full example of the environment used in our experiments in [`examples/env.yaml`](examples/env.yaml).
|
| 66 |
-
|
| 67 |
### Installation
|
| 68 |
|
| 69 |
From the root of the repo:
|
|
@@ -72,7 +74,7 @@ From the root of the repo:
|
|
| 72 |
git clone https://github.com/ngocbh/trimkv.git
|
| 73 |
cd trimkv
|
| 74 |
pip install -e .
|
| 75 |
-
```
|
| 76 |
|
| 77 |
---
|
| 78 |
|
|
@@ -84,7 +86,7 @@ from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM
|
|
| 84 |
from trimkv.cache_utils import TrimKVCache
|
| 85 |
from transformers import AutoTokenizer
|
| 86 |
|
| 87 |
-
model_path = "
|
| 88 |
download_from = "huggingface" # options: "wandb", "local", "huggingface"
|
| 89 |
|
| 90 |
model = TrimKVQwen3ForCausalLM.from_pretrained(
|
|
@@ -112,7 +114,7 @@ tokenizer = AutoTokenizer.from_pretrained(
|
|
| 112 |
# Note: TRIM-KV uses TrimKVCache under the hood. So please pass TrimKVCache to model.generate
|
| 113 |
```
|
| 114 |
|
| 115 |
-
For a runnable end-to-end example, see [`examples/test_qwen3.py`](examples/test_qwen3.py).
|
| 116 |
|
| 117 |
## Released Models
|
| 118 |
|
|
@@ -126,4 +128,13 @@ For a runnable end-to-end example, see [`examples/test_qwen3.py`](examples/test_
|
|
| 126 |
| Phi-3-mini-128k-instruct | [TrimKV-Phi-3-mini-128k-instruct](https://huggingface.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct) | LongAlpaca | 128K | 512 |
|
| 127 |
| DeepSeek-R1-Distill-Llama-8B | [TrimKV-DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/ngocbh/TrimKV-DeepSeek-R1-Distill-Llama-8B) | OpenR1-Math-220k | 32K | 256 |
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen3-14B
|
| 4 |
+
datasets:
|
| 5 |
+
- open-r1/OpenR1-Math-220k
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
tags:
|
| 9 |
- math
|
| 10 |
- trimkv
|
|
|
|
| 13 |
- Compression
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# TrimKV: Token Retention for Memory-Bounded Key-Value Eviction
|
| 17 |
+
|
| 18 |
+
TRIM-KV is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference. It was introduced in the paper [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649) by Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, and Rex Ying.
|
| 19 |
|
| 20 |
The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction.
|
| 21 |
|
| 22 |
The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.
|
| 23 |
|
|
|
|
| 24 |
<a href="https://arxiv.org/pdf/2512.03324"><img src="https://img.shields.io/badge/arxiv-2512.03324-red?style=for-the-badge"></a>
|
| 25 |
|
| 26 |
+
- **Official Code:** [GitHub - ngocbh/trimkv](https://github.com/ngocbh/trimkv)
|
| 27 |
+
- **Paper:** [https://huggingface.co/papers/2605.09649](https://huggingface.co/papers/2605.09649)
|
| 28 |
|
| 29 |
### Why TRIM-KV?
|
| 30 |
|
|
|
|
| 66 |
pip install -r requirements.txt
|
| 67 |
```
|
| 68 |
|
|
|
|
|
|
|
| 69 |
### Installation
|
| 70 |
|
| 71 |
From the root of the repo:
|
|
|
|
| 74 |
git clone https://github.com/ngocbh/trimkv.git
|
| 75 |
cd trimkv
|
| 76 |
pip install -e .
|
| 77 |
+
```
|
| 78 |
|
| 79 |
---
|
| 80 |
|
|
|
|
| 86 |
from trimkv.cache_utils import TrimKVCache
|
| 87 |
from transformers import AutoTokenizer
|
| 88 |
|
| 89 |
+
model_path = "ngocbh/TrimKV-Qwen3-14B-Math"
|
| 90 |
download_from = "huggingface" # options: "wandb", "local", "huggingface"
|
| 91 |
|
| 92 |
model = TrimKVQwen3ForCausalLM.from_pretrained(
|
|
|
|
| 114 |
# Note: TRIM-KV uses TrimKVCache under the hood. So please pass TrimKVCache to model.generate
|
| 115 |
```
|
| 116 |
|
| 117 |
+
For a runnable end-to-end example, see [`examples/test_qwen3.py`](https://github.com/ngocbh/trimkv/blob/main/examples/test_qwen3.py).
|
| 118 |
|
| 119 |
## Released Models
|
| 120 |
|
|
|
|
| 128 |
| Phi-3-mini-128k-instruct | [TrimKV-Phi-3-mini-128k-instruct](https://huggingface.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct) | LongAlpaca | 128K | 512 |
|
| 129 |
| DeepSeek-R1-Distill-Llama-8B | [TrimKV-DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/ngocbh/TrimKV-DeepSeek-R1-Distill-Llama-8B) | OpenR1-Math-220k | 32K | 256 |
|
| 130 |
|
| 131 |
+
## Citation
|
| 132 |
+
|
| 133 |
+
```bibtex
|
| 134 |
+
@article{bui2025make,
|
| 135 |
+
title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
|
| 136 |
+
author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
|
| 137 |
+
journal={arXiv preprint arXiv:2512.03324},
|
| 138 |
+
year={2025}
|
| 139 |
+
}
|
| 140 |
+
```
|