bys0318 commited on
Commit
4ea7d1a
·
verified ·
1 Parent(s): 8ce593e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - agentica-org/DeepScaleR-Preview-Dataset
5
+ base_model:
6
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
7
+ tags:
8
+ - reinforcement-learning
9
+ language:
10
+ - en
11
+ - zh
12
+ pipeline_tag: text-generation
13
+ library_name: transformers
14
+ ---
15
+
16
+ <p align="center">
17
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64ed568ccf6118a9379a61b8/BHITqJU33sXqf-Jbytrxg.png" width="100"/>
18
+ <b><span style="font-size:28px">SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression</span></b>
19
+ </p>
20
+
21
+
22
+
23
+ <p align="center">
24
+ 📃 <a href="https://arxiv.org" target="_blank">Paper</a> • 📝 <a href="https://arxiv.org" target="_blank">Wandb</a>
25
+ </p>
26
+
27
+ ---
28
+
29
+ ## 🔍 Overview
30
+
31
+ **SIRI (Scaling Iterative Reinforcement Learning with Interleaved Compression)** is a reinforcement-learning–based framework designed to improve the efficiency and accuracy of **Large Reasoning Models (LRMs)**.
32
+
33
+ Traditional RL training often causes **overthinking** and long, redundant reasoning traces. Prior methods that compress outputs (length penalties, pruning, or skipping thought tokens) improve efficiency but hurt accuracy.
34
+
35
+ SIRI solves this trade-off by **iteratively alternating between compression and expansion of the reasoning budget**, controlled by a cosine length scheduler. This approach dynamically balances concise reasoning with long-horizon exploration.
36
+
37
+ <p align="center">
38
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64ed568ccf6118a9379a61b8/SXow6xntEgrwhvWtzvrkE.png" alt="pareto_front" width="500"/>
39
+ </p>
40
+
41
+ ---
42
+
43
+ ## 🚀 Key Features
44
+
45
+ - **Interleaved Compression–Expansion**:
46
+ - *Compression phase*: forces concise, high-density reasoning by limiting rollout length.
47
+ - *Expansion phase*: restores longer rollouts to encourage exploration and planning.
48
+ - **Token Efficiency without Accuracy Loss**: Unlike previous methods, SIRI improves accuracy *while reducing average token usage*.
49
+ - **Iterative RL Training**: Built on GRPO with modifications from DAPO (clip-high/low decoupling, KL removal).
50
+ - **Generalization Across Model Sizes**: Validated on both **1.5B** and **7B** models.
51
+
52
+ ---
53
+
54
+ ## 📊 Benchmarks
55
+
56
+
57
+ ![perf](https://cdn-uploads.huggingface.co/production/uploads/64ed568ccf6118a9379a61b8/0S2d9VZTiaoGI6_N9Vrh2.png)
58
+
59
+
60
+ ---
61
+
62
+ ## 📝 Citation
63
+
64
+ ```bibtex