Add metadata (license, library, pipeline tag) and update paper/code links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +176 -152
README.md CHANGED
@@ -1,152 +1,176 @@
1
- <!-- <p align="center" width="100%">
2
- <img src="./docs/static/images/logo_resize.png" width="80%">
3
- </p> -->
4
-
5
- <div align="center">
6
- <h1 align="center"> SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
7
- </h1>
8
- </div>
9
-
10
- <p align="center">
11
- <img src="assets/mainprocess.png">
12
- </p>
13
-
14
-
15
-
16
- - **Authors**: [Zhi Zheng](https://zz1358m.github.io/zhizheng.github.io/), [Wee Sun Lee](https://scholar.google.com/citations?user=8PCrLgwAAAAJ&hl=en)
17
- - **Institutes**: School of Computing, National University of Singapore, Singapore;
18
- - **Resources**: [📖[Paper](https://arxiv.org/abs/2511.06411)] [[🏠Twitter](https://x.com/zhengzhi20/status/1988071729789628731)] [[🤗Huggingface](https://huggingface.co/zz1358m/SofT-GRPO-master)]
19
-
20
-
21
-
22
- ## 📧 Welcome to feedback
23
-
24
- We greatly appreciate your feedback and questions regarding the current status of this work.
25
-
26
- Please feel free to contact Zhi Zheng at [zhi.zheng@u.nus.edu](zhi.zheng@u.nus.edu)
27
-
28
-
29
- ## 💡 Highlights
30
-
31
- - 🔥 **The First Powerful RLVR Algorithm for Soft-Thinking Reasoning:** We introduce **SofT-GRPO**, a novel and powerful policy optimization algorithm designed for reinforcing the soft-thinking reasoning paradigm in LLMs.
32
-
33
- - ⚙️ **Gumbel-Softmax Noise in Rollout:** It integrates the Gumbel-Softmax technique into the group rollout process, actively obtaining diverse but valid soft-thinking reasoning paths.
34
-
35
- - ⚙️ **Gumbel Reparameterization:** We propose an innovative gradient estimation approach via Gumbel reparameterization, enabling precise attribution of improvements to the LLM’s output probability distributions in policy optimization.
36
-
37
- - 📝 **Comprehensive Experiments and High Effectiveness:** We conduct comprehensive experiments across LLMs of 1.5B–7B parameters on five benchmarks, demonstrating that SofT-GRPO consistently outperforms the discrete-token GRPO baselines, especially at higher sample rates (Pass@16 and Pass@32). SofT-GRPO can also improve the out-of-Domain generalization ability of LLMs.
38
-
39
- - 🔥 **Showing the Prospects of Soft-Thinking:** Can Soft-Thinking be the Answer for Better Effectiveness?
40
-
41
- ## 📜 News
42
-
43
- **[2025/9/24]** [Code](https://github.com/zz1358m/SofT-GRPO-master), [Weight](https://huggingface.co/zz1358m/SofT-GRPO-master), and [Paper](https://arxiv.org/abs/2511.06411) are released!
44
-
45
- ## 👨‍💻 Todo
46
-
47
- - [x] SGLang & verl Code Modification (e.g., activate the overlap for efficiency).
48
-
49
-
50
- ## 🛠️ Usage
51
-
52
- ### 1. Clone the repository
53
- ```bash
54
- git clone https://github.com/zz1358m/SofT-GRPO-master
55
- cd SofT-GRPO-master
56
- ```
57
-
58
- ### 2. Install dependencies
59
- ##### Option 1: For inference only,
60
- ```bash
61
- conda create -n soft_grpo python=3.11.13 -y && conda activate soft_grpo
62
- pip install pip==25.2
63
- pip install torch==2.6.0 transformers==4.51.1 tensorboard==2.20.0 sgl_kernel==0.1.1 accelerate==1.10.1 torch_memory_saver==0.0.8 uvloop==0.21.0 jsonlines math_verify openai
64
- pip install flash_attn==2.7.3 --no-build-isolation # may take more time (20min). try `pip install flash_attn==2.7.3 --no-build-isolation` if find undefined symbol bug, or try downloading from its official github.
65
-
66
- cd Soft-Thinking+noise+loss-main/sglang_soft_thinking_pkg
67
- pip install -e "python[all]"
68
- cd ../..
69
- ```
70
-
71
- ##### Option 2: For inference & SofT-GRPO fine-tuning,
72
-
73
- building the verl-0.4.x after doing the Option1.
74
- ```bash
75
- cd verl-0.4.x
76
- pip3 install -e .
77
- cd ..
78
- ```
79
-
80
- or trying to install requirements. (not recommended)
81
- ```bash
82
- pip install -r requirements.txt
83
- ```
84
-
85
-
86
- ---
87
-
88
- ### 3. Evaluating SofT-GRPO fine-tuned LLMs with soft-thinking pattern
89
-
90
- #### Step 1: Download the SofT-GRPO, GRPO, weights from [[🤗Huggingface](https://huggingface.co/zz1358m/SofT-GRPO-master)]
91
-
92
- #### Step 2: Evaluating GRPO under the discrete-token CoT pattern.
93
- ```bash
94
- ./Soft-Thinking+noise+loss-main/run_sample_discrete-token_grpo.sh
95
- ```
96
-
97
- #### Step 3: Evaluating GRPO under the soft-thinking reasoning pattern.
98
- ```bash
99
- ./Soft-Thinking+noise+loss-main/run_sample_gumbel_grpo.sh
100
- ```
101
-
102
- #### Step 3: Evaluating SofT-GRPO under the soft-thinking reasoning pattern.
103
- ```bash
104
- ./Soft-Thinking+noise+loss-main/run_sample_gumbel.sh
105
- ```
106
-
107
-
108
- ---
109
-
110
- ### 4. Training with SofT-GRPO
111
-
112
- #### Option 1: Train the SofT-GRPO on DeepSeek-R1-Distill-Qwen-1.5B
113
- ```bash
114
- ./SofT-GRPO-deepscaler-8k.sh # change the LLM path, dataset path accordingly
115
- ```
116
-
117
- #### Option 2: Train the SofT-GRPO on DeepSeek-R1-Distill-Qwen-7B
118
- ```bash
119
- ./SofT-GRPO-deepscaler-8k-qwen7.sh # change the LLM path, dataset path accordingly
120
- ```
121
-
122
-
123
- #### Option 3: Train the SofT-GRPO on Llama-3.2-3B-Instruct
124
- ```bash
125
- ./SofT-GRPO-deepscaler-8k-llama3.sh # change the LLM path, dataset path accordingly
126
- ```
127
-
128
-
129
-
130
-
131
- ## ✒️ Citation
132
-
133
- If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
134
-
135
- ```bibtex
136
- @misc{zheng2025softgrposurpassingdiscretetokenllm,
137
- title={SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization},
138
- author={Zhi Zheng and Wee Sun Lee},
139
- year={2025},
140
- eprint={2511.06411},
141
- archivePrefix={arXiv},
142
- primaryClass={cs.AI},
143
- url={https://arxiv.org/abs/2511.06411},
144
- }
145
- ```
146
-
147
- ## ❤️ Acknowledgments
148
-
149
- - [Soft-Thinking](https://github.com/eric-ai-lab/Soft-Thinking): The codebase we built upon. Thanks for their wonderful work.
150
- - [verl-0.4.x](https://github.com/volcengine/verl/tree/v0.4.x): Our work is based on this codebase as well.
151
- - [SIM-CoT](https://github.com/InternLM/SIM-CoT): We use their template for README.md!
152
- - [Yu Gu](https://github.com/kuangrepi): Undergraduate student from Nanjing University, volunteer for helping in code re-organization!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-generation
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ ---
6
+
7
+ <!-- <p align="center" width="100%">
8
+ <img src="./docs/static/images/logo_resize.png" width="80%">
9
+ </p> -->
10
+
11
+ <div align="center">
12
+ <h1 align="center"> SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
13
+ </h1>
14
+ </div>
15
+
16
+ <p align="center">
17
+ <img src="assets/mainprocess.png">
18
+ </p>
19
+
20
+
21
+
22
+ - **Authors**: [Zhi Zheng](https://zz1358m.github.io/zhizheng.github.io/), [Wee Sun Lee](https://scholar.google.com/citations?user=8PCrLgwAAAAJ&hl=en)
23
+ - **Institutes**: School of Computing, National University of Singapore, Singapore;
24
+ - **Resources**: [📖[Paper](https://huggingface.co/papers/2511.06411)] [[💻Code](https://github.com/zz1358m/SofT-GRPO-master)] [[🏠Twitter](https://x.com/zhengzhi20/status/1988071729789628731)] [[🤗Huggingface](https://huggingface.co/zz1358m/SofT-GRPO-master)]
25
+
26
+
27
+
28
+ ## 📧 Welcome to feedback
29
+
30
+ We greatly appreciate your feedback and questions regarding the current status of this work.
31
+
32
+ Please feel free to contact Zhi Zheng at [zhi.zheng@u.nus.edu](zhi.zheng@u.nus.edu)
33
+
34
+
35
+ ## 💡 Highlights
36
+
37
+ - 🔥 **The First Powerful RLVR Algorithm for Soft-Thinking Reasoning:** We introduce **SofT-GRPO**, a novel and powerful policy optimization algorithm designed for reinforcing the soft-thinking reasoning paradigm in LLMs.
38
+
39
+ - ⚙️ **Gumbel-Softmax Noise in Rollout:** It integrates the Gumbel-Softmax technique into the group rollout process, actively obtaining diverse but valid soft-thinking reasoning paths.
40
+
41
+ - ⚙️ **Gumbel Reparameterization:** We propose an innovative gradient estimation approach via Gumbel reparameterization, enabling precise attribution of improvements to the LLM’s output probability distributions in policy optimization.
42
+
43
+ - 📝 **Comprehensive Experiments and High Effectiveness:** We conduct comprehensive experiments across LLMs of 1.5B–7B parameters on five benchmarks, demonstrating that SofT-GRPO consistently outperforms the discrete-token GRPO baselines, especially at higher sample rates (Pass@16 and Pass@32). SofT-GRPO can also improve the out-of-Domain generalization ability of LLMs.
44
+
45
+ - 🔥 **Showing the Prospects of Soft-Thinking:** Can Soft-Thinking be the Answer for Better Effectiveness?
46
+
47
+ ## 📜 News
48
+
49
+ **[2025/9/24]** [Code](https://github.com/zz1358m/SofT-GRPO-master), [Weight](https://huggingface.co/zz1358m/SofT-GRPO-master), and [Paper](https://arxiv.org/abs/2511.06411) are released!
50
+
51
+ ## 👨‍💻 Todo
52
+
53
+ - [x] SGLang & verl Code Modification (e.g., activate the overlap for efficiency).
54
+
55
+
56
+ ## 🛠️ Usage
57
+
58
+ ### 1. Clone the repository
59
+ ```bash
60
+ git clone https://github.com/zz1358m/SofT-GRPO-master
61
+ cd SofT-GRPO-master
62
+ ```
63
+
64
+ ### 2. Install dependencies
65
+ ##### Option 1: For inference only,
66
+ ```bash
67
+ conda create -n soft_grpo python=3.11.13 -y && conda activate soft_grpo
68
+ pip install pip==25.2
69
+ pip install torch==2.6.0 transformers==4.51.1 tensorboard==2.20.0 sgl_kernel==0.1.1 accelerate==1.10.1 torch_memory_saver==0.0.8 uvloop==0.21.0 jsonlines math_verify openai
70
+ pip install flash_attn==2.7.3 --no-build-isolation # may take more time (20min). try `pip install flash_attn==2.7.3 --no-build-isolation` if find undefined symbol bug, or try downloading from its official github.
71
+
72
+ cd Soft-Thinking+noise+loss-main/sglang_soft_thinking_pkg
73
+ pip install -e "python[all]"
74
+ cd ../..
75
+ ```
76
+
77
+ ##### Option 2: For inference & SofT-GRPO fine-tuning,
78
+
79
+ building the verl-0.4.x after doing the Option1.
80
+ ```bash
81
+ cd verl-0.4.x
82
+ pip3 install -e .
83
+ cd ..
84
+ ```
85
+
86
+ or trying to install requirements. (not recommended)
87
+ ```bash
88
+ pip install -r requirements.txt
89
+ ```
90
+
91
+
92
+ ---
93
+
94
+ ### 3. Evaluating SofT-GRPO fine-tuned LLMs with soft-thinking pattern
95
+
96
+ #### Step 1: Download the SofT-GRPO, GRPO, weights from [[🤗Huggingface](https://huggingface.co/zz1358m/SofT-GRPO-master)]
97
+
98
+ #### Step 2: Evaluating GRPO under the discrete-token CoT pattern.
99
+ ```bash
100
+ ./Soft-Thinking+noise+loss-main/run_sample_discrete-token_grpo.sh
101
+ ```
102
+
103
+ #### Step 3: Evaluating GRPO under the soft-thinking reasoning pattern.
104
+ ```bash
105
+ ./Soft-Thinking+noise+loss-main/run_sample_gumbel_grpo.sh
106
+ ```
107
+
108
+ #### Step 3: Evaluating SofT-GRPO under the soft-thinking reasoning pattern.
109
+ ```bash
110
+ ./Soft-Thinking+noise+loss-main/run_sample_gumbel.sh
111
+ ```
112
+
113
+
114
+ ---
115
+
116
+ ### 4. Training with SofT-GRPO
117
+
118
+ #### Option 1: Train the SofT-GRPO on DeepSeek-R1-Distill-Qwen-1.5B
119
+ ```bash
120
+ ./SofT-GRPO-deepscaler-8k.sh # change the LLM path, dataset path accordingly
121
+ ```
122
+
123
+ #### Option 2: Train the SofT-GRPO on DeepSeek-R1-Distill-Qwen-7B
124
+ ```bash
125
+ ./SofT-GRPO-deepscaler-8k-qwen7.sh # change the LLM path, dataset path accordingly
126
+ ```
127
+
128
+
129
+ #### Option 3: Train the SofT-GRPO on Llama-3.2-3B-Instruct
130
+ ```bash
131
+ ./SofT-GRPO-deepscaler-8k-llama3.sh # change the LLM path, dataset path accordingly
132
+ ```
133
+
134
+
135
+ ---
136
+
137
+ ### Alternative: Using Docker
138
+
139
+ For a quick and reproducible setup, a pre-built Docker image is also available on GitHub Packages. You can pull and run it with the following commands:
140
+
141
+ #### Step 1: Pull the pre-built image from GitHub Packages
142
+
143
+ ```bash
144
+ docker pull ghcr.io/kuangrepi/soft-grpo:latest
145
+ ```
146
+
147
+ #### Step 2: Run the container with GPU access and an interactive shell
148
+
149
+ ```bash
150
+ docker run --gpus all -it --rm ghcr.io/kuangrepi/soft-grpo:latest
151
+ ```
152
+
153
+
154
+
155
+ ## ✒️ Citation
156
+
157
+ If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
158
+
159
+ ```bibtex
160
+ @misc{zheng2025softgrposurpassingdiscretetokenllm,
161
+ title={SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization},
162
+ author={Zhi Zheng and Wee Sun Lee},
163
+ year={2025},
164
+ eprint={2511.06411},
165
+ archivePrefix={arXiv},
166
+ primaryClass={cs.AI},
167
+ url={https://arxiv.org/abs/2511.06411},
168
+ }
169
+ ```
170
+
171
+ ## ❤️ Acknowledgments
172
+
173
+ - [Soft-Thinking](https://github.com/eric-ai-lab/Soft-Thinking): The codebase we built upon. Thanks for their wonderful work.
174
+ - [verl-0.4.x](https://github.com/volcengine/verl/tree/v0.4.x): Our work is based on this codebase as well.
175
+ - [SIM-CoT](https://github.com/InternLM/SIM-CoT): We use their template for README.md!
176
+ - [Yu Gu](https://github.com/kuangrepi): Undergraduate student from Nanjing University, volunteer for helping in code re-organization!