Text Generation
qwen3
math
trimkv
KV
Cache
Compression

Add pipeline tag, paper link, and improve model card documentation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +26 -42
README.md CHANGED
@@ -1,9 +1,10 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - open-r1/OpenR1-Math-220k
5
  base_model:
6
  - Qwen/Qwen3-4B
 
 
 
 
7
  tags:
8
  - math
9
  - trimkv
@@ -12,13 +13,18 @@ tags:
12
  - Compression
13
  ---
14
 
15
- > TRIM-KV is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference.
 
 
 
 
16
 
17
  The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction.
18
 
19
  The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.
20
 
21
- <a href="https://arxiv.org/pdf/2512.03324"><img src="https://img.shields.io/badge/arxiv-2512.03324-red?style=for-the-badge"></a>
 
22
 
23
  ### Why TRIM-KV?
24
 
@@ -41,40 +47,21 @@ And it's interpretable
41
  <img width="1000" alt="teaser" src="https://github.com/ngocbh/trimkv/blob/main/assets/eviction.png?raw=true"/>
42
  </div>
43
 
44
- <div align="center">
45
- <img width="1000" alt="teaser" src="https://github.com/ngocbh/trimkv/blob/main/assets/vis.png?raw=true"/>
46
- </div>
47
-
48
  ---
49
 
50
  ## Getting Started
51
 
52
- ### Requirements
53
-
54
- - Python 3.11 or higher (tested with 3.12)
55
- - PyTorch 2.7.0 or higher (tested with 2.8.0)
56
- - FlashAttention 2.7.2.post1 or higher (tested with 2.8.0)
57
- - Transformers 4.57.1
58
-
59
- ```sh
60
- pip install -r requirements.txt
61
- ```
62
-
63
- This is a minimal set of requirements for training purposes. Additional dependencies may be needed for running specific experiments. We provided a full example of the environment used in our experiments in [`examples/env.yaml`](examples/env.yaml).
64
-
65
  ### Installation
66
 
67
- From the root of the repo:
68
 
69
  ```sh
70
  git clone https://github.com/ngocbh/trimkv.git
71
  cd trimkv
72
  pip install -e .
73
- ````
74
-
75
- ---
76
 
77
- ## Quick Start
78
 
79
  ```python
80
  import torch
@@ -82,7 +69,7 @@ from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM
82
  from trimkv.cache_utils import TrimKVCache
83
  from transformers import AutoTokenizer
84
 
85
- model_path = "<TrimKV model_path here>"
86
  download_from = "huggingface" # options: "wandb", "local", "huggingface"
87
 
88
  model = TrimKVQwen3ForCausalLM.from_pretrained(
@@ -101,7 +88,7 @@ model.config.memory_size = 512
101
  model.config.buffer_size = 128
102
 
103
  tokenizer = AutoTokenizer.from_pretrained(
104
- model.config.base_model,
105
  use_fast=True,
106
  padding_side="left",
107
  )
@@ -110,18 +97,15 @@ tokenizer = AutoTokenizer.from_pretrained(
110
  # Note: TRIM-KV uses TrimKVCache under the hood. So please pass TrimKVCache to model.generate
111
  ```
112
 
113
- For a runnable end-to-end example, see [`examples/test_qwen3.py`](examples/test_qwen3.py).
114
-
115
- ## Released Models
116
 
117
- | Base Model | TRIM-KV Checkpoints | Training Datasets | Training Context Len | Training $M$ |
118
- |------------------------------|-----------------------------------------------|--------------------------|-------------------------|--------------|
119
- | Qwen3-1.7B | [TRIM-KV-Qwen3-1.7B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-1.7B-Math) | OpenR1-Math-220k | 16K | 256 |
120
- | Qwen3-4B | [TRIM-KV-Qwen3-4B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-4B-Math) | OpenR1-Math-220k | 16K | 256 |
121
- | Qwen3-8B | [TRIM-KV-Qwen3-8B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-8B-Math) | OpenR1-Math-220k | 16K | 256 |
122
- | Qwen3-14B | [TRIM-KV-Qwen3-14B-Math](https://huggingface.co/ngocbh/TrimKV-Qwen3-14B-Math) | OpenR1-Math-220k | 16K | 256 |
123
- | Qwen3-4B-Instruct-2507 | [TrimKV-Qwen3-4B-Instruct-2507](https://huggingface.co/ngocbh/TrimKV-Qwen3-4B-Instruct-2507) | Synth-Long, BookSum, Buddhi | 128K | 1024 |
124
- | Phi-3-mini-128k-instruct | [TrimKV-Phi-3-mini-128k-instruct](https://huggingface.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct) | LongAlpaca | 128K | 512 |
125
- | DeepSeek-R1-Distill-Llama-8B | [TrimKV-DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/ngocbh/TrimKV-DeepSeek-R1-Distill-Llama-8B) | OpenR1-Math-220k | 32K | 256 |
126
 
127
- ---
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen3-4B
4
+ datasets:
5
+ - open-r1/OpenR1-Math-220k
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
  tags:
9
  - math
10
  - trimkv
 
13
  - Compression
14
  ---
15
 
16
+ # TrimKV-Qwen3-4B-Math
17
+
18
+ > **TRIM-KV** is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference.
19
+
20
+ This model is a Qwen3-4B variant fine-tuned with TRIM-KV on the `OpenR1-Math-220k` dataset. It is based on the research paper [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649).
21
 
22
  The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call *token retention*, and then decay this importance exponentially over time to mimic the standard inference running with eviction.
23
 
24
  The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step, making them local, myopic, and highly dependent on the transient decoding state.
25
 
26
+ - **Paper:** [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649)
27
+ - **Code:** [Official GitHub Repository](https://github.com/ngocbh/trimkv)
28
 
29
  ### Why TRIM-KV?
30
 
 
47
  <img width="1000" alt="teaser" src="https://github.com/ngocbh/trimkv/blob/main/assets/eviction.png?raw=true"/>
48
  </div>
49
 
 
 
 
 
50
  ---
51
 
52
  ## Getting Started
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ### Installation
55
 
56
+ To use this model, you need to install the `trimkv` library from the [official repository](https://github.com/ngocbh/trimkv):
57
 
58
  ```sh
59
  git clone https://github.com/ngocbh/trimkv.git
60
  cd trimkv
61
  pip install -e .
62
+ ```
 
 
63
 
64
+ ### Quick Start
65
 
66
  ```python
67
  import torch
 
69
  from trimkv.cache_utils import TrimKVCache
70
  from transformers import AutoTokenizer
71
 
72
+ model_path = "ngocbh/TrimKV-Qwen3-4B-Math"
73
  download_from = "huggingface" # options: "wandb", "local", "huggingface"
74
 
75
  model = TrimKVQwen3ForCausalLM.from_pretrained(
 
88
  model.config.buffer_size = 128
89
 
90
  tokenizer = AutoTokenizer.from_pretrained(
91
+ "Qwen/Qwen3-4B",
92
  use_fast=True,
93
  padding_side="left",
94
  )
 
97
  # Note: TRIM-KV uses TrimKVCache under the hood. So please pass TrimKVCache to model.generate
98
  ```
99
 
100
+ For a runnable end-to-end example, see [`examples/test_qwen3.py`](https://github.com/ngocbh/trimkv/blob/main/examples/test_qwen3.py).
 
 
101
 
102
+ ## Citation
 
 
 
 
 
 
 
 
103
 
104
+ ```bibtex
105
+ @article{bui2025make,
106
+ title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
107
+ author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
108
+ journal={arXiv preprint arXiv:2512.03324},
109
+ year={2025}
110
+ }
111
+ ```