Add pipeline tag and link to paper
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,9 +1,10 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
datasets:
|
| 4 |
-
- open-r1/OpenR1-Math-220k
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen3-4B
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
tags:
|
| 8 |
- math
|
| 9 |
- dbtrimkv
|
|
@@ -16,12 +17,16 @@ tags:
|
|
| 16 |
|
| 17 |
This repository hosts the **DBTrimKV** retention-gate weights for `Qwen/Qwen3-4B` (32768-token training context, M = 128). The base-model weights are not included — they are loaded from `Qwen/Qwen3-4B` at runtime and the retention-gate weights from `trimkv_weights.pth` are overlaid on top.
|
| 18 |
|
|
|
|
|
|
|
| 19 |
<a href="https://arxiv.org/pdf/2512.03324"><img src="https://img.shields.io/badge/arxiv-2512.03324-red?style=for-the-badge"></a>
|
| 20 |
|
| 21 |
For the full list of released checkpoints, training recipes, and benchmark scripts, see the GitHub repository: **https://github.com/ngocbh/trimkv**.
|
| 22 |
|
| 23 |
## Quick start
|
| 24 |
|
|
|
|
|
|
|
| 25 |
```python
|
| 26 |
import torch
|
| 27 |
from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM
|
|
@@ -60,14 +65,21 @@ See [`examples/test_qwen3.py`](https://github.com/ngocbh/trimkv/blob/main/exampl
|
|
| 60 |
|
| 61 |
## Training details
|
| 62 |
|
| 63 |
-
- Base model: `Qwen/Qwen3-4B`
|
| 64 |
-
- Variant: **DBTrimKV** (`retention_gate=rg10`)
|
| 65 |
-
- Training dataset: open-r1/OpenR1-Math-220k
|
| 66 |
-
- Training memory size M: `128`
|
| 67 |
-
- Training context length: `32768`
|
| 68 |
-
- Loss: `fwkl_ntp`
|
| 69 |
-
- Attention impl: `rg_attn_flex`
|
| 70 |
|
| 71 |
## Citation
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen3-4B
|
| 4 |
+
datasets:
|
| 5 |
+
- open-r1/OpenR1-Math-220k
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
tags:
|
| 9 |
- math
|
| 10 |
- dbtrimkv
|
|
|
|
| 17 |
|
| 18 |
This repository hosts the **DBTrimKV** retention-gate weights for `Qwen/Qwen3-4B` (32768-token training context, M = 128). The base-model weights are not included — they are loaded from `Qwen/Qwen3-4B` at runtime and the retention-gate weights from `trimkv_weights.pth` are overlaid on top.
|
| 19 |
|
| 20 |
+
This model was introduced in the paper [Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction](https://huggingface.co/papers/2605.09649).
|
| 21 |
+
|
| 22 |
<a href="https://arxiv.org/pdf/2512.03324"><img src="https://img.shields.io/badge/arxiv-2512.03324-red?style=for-the-badge"></a>
|
| 23 |
|
| 24 |
For the full list of released checkpoints, training recipes, and benchmark scripts, see the GitHub repository: **https://github.com/ngocbh/trimkv**.
|
| 25 |
|
| 26 |
## Quick start
|
| 27 |
|
| 28 |
+
To use this model, please install the `trimkv` library from the [GitHub repo](https://github.com/ngocbh/trimkv).
|
| 29 |
+
|
| 30 |
```python
|
| 31 |
import torch
|
| 32 |
from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM
|
|
|
|
| 65 |
|
| 66 |
## Training details
|
| 67 |
|
| 68 |
+
- **Base model**: `Qwen/Qwen3-4B`
|
| 69 |
+
- **Variant**: **DBTrimKV** (`retention_gate=rg10`)
|
| 70 |
+
- **Training dataset**: `open-r1/OpenR1-Math-220k`
|
| 71 |
+
- **Training memory size M**: `128`
|
| 72 |
+
- **Training context length**: `32768`
|
| 73 |
+
- **Loss**: `fwkl_ntp`
|
| 74 |
+
- **Attention impl**: `rg_attn_flex`
|
| 75 |
|
| 76 |
## Citation
|
| 77 |
|
| 78 |
+
```bibtex
|
| 79 |
+
@article{bui2025make,
|
| 80 |
+
title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
|
| 81 |
+
author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
|
| 82 |
+
journal={arXiv preprint arXiv:2512.03324},
|
| 83 |
+
year={2025}
|
| 84 |
+
}
|
| 85 |
+
```
|