zz1358m
/

SofT-GRPO-master

TensorBoard

Safetensors

Model card Files Files and versions

xet

Metrics Training metrics Community

zz1358m commited on Nov 7, 2025

Commit

a54c8b6

verified ·

1 Parent(s): a58dbfa

Upload 2 files

Browse files

Files changed (2) hide show

README.md +135 -1
requirements.txt +206 -0

README.md CHANGED Viewed

@@ -1,3 +1,137 @@
 ---
-license: mit
 ---

+<!-- <p align="center" width="100%">
+<img src="./docs/static/images/logo_resize.png"  width="80%">
+</p> -->
+<div align="center">
+    <h1 align="center"> SofT-GRPO: Reinforcing the LLM Soft-Thinking Policy with Gumbel Reparameterization
+    </h1>
+</div>
+<p align="center">
+  <img src="assets/mainprocess.png">
+</p>
+- **Authors**: [Zhi Zheng](https://zz1358m.github.io/zhizheng.github.io/), [Wee Sun Lee](https://scholar.google.com/citations?user=8PCrLgwAAAAJ&hl=en)
+- **Institutes**: School of Computing, National University of Singapore, Singapore;
+- **Resources**: [📖[Paper]()] [[🏠Twitter]()] [[🤗Huggingface](https://huggingface.co/zz1358m/SofT-GRPO-master)]
+## 📧 Welcome for feedback
+We greatly appreciate your feedback and questions regarding the current status of this work.
+Please feel free to contact Zhi Zheng by [zhi.zheng@u.nus.edu](zhi.zheng@u.nus.edu)
+## 💡 Highlights
+- 🔥 **The First Powerful RLVR Algorithm for Soft-Thinking Reasoning:** We introduce **SofT-GRPO**, a novel and powerful policy optimization algorithm designed for reinforcing the soft-thinking reasoning paradigm in LLMs.
+- ⚙️ **Gumbel-Softmax Noise in Rollout:** It integrates the Gumbel-Softmax technique into the group rollout process, actively obtaining diverse but valid soft-thinking reasoning paths.
+- ⚙️ **Gumbel Reparameterization :** We propose an innovative gradient estimation approach via Gumbel reparameterization, enabling precise attribution of improvements to the LLM’s output probability distributions in policy optimization.
+- 🔥 **Comprehensive Experiments and High Effectiveness:** We conduct comprehensive experiments across LLMs of 1.5B–7B parameters on five benchmarks, demonstrating that SofT-GRPO consistently outperforms the discrete-token GRPO baselines, especially at higher sample rates (Pass@16 and Pass@32).
+## 📜 News
+**[2025/9/24]** [Code]() [Weight]() and [Paper](https://arxiv.org/pdf/2509.20317) are released!
+## 👨‍💻 Todo
+- [x] SGLang & verl Code Modification (e.g., activate the overlap for efficiency).
+## 🛠️ Usage
+### 1. Clone the repository
+```bash
+git clone https://github.com/zz1358m/SofT-GRPO-master
+cd SofT-GRPO-master
+```
+### 2. Install dependencies
+##### Option1: For inference only,
+```bash
+conda create -n st python=3.11 -y && conda activate st
+pip install --upgrade pip
+pip install torch transformers accelerate jsonlines math_verify openai torch_memory_saver
+pip install flash_attn --no-build-isolation # may take more time (20min). try `pip install flash_attn==2.7.3 --no-build-isolation` if find undefined symbol bug
+cd Soft-Thinking+noise+loss-main/sglang_soft_thinking_pkg
+pip install -e "python[all]"
+cd ../..
+```
+##### Option2: For inference & SofT-GRPO fine-tuning,
+```bash
+pip install -r requirements.txt
+```
+or building the verl-0.4.x after doing the Option1.
+```bash
+cd verl-0.4.x
+pip3 install --no-deps -e .
+cd ..
+```
 ---
+### 3. Evaluating SofT-GRPO fine-tuned LLMs with soft-thinking pattern
+#### Step 1: Download the SofT-GRPO, GRPO, weights from [[🤗Huggingface](https://huggingface.co/zz1358m/SofT-GRPO-master)]
+#### Step 2: Evaluating GRPO under the discrete-token CoT pattern.
+```bash
+./Soft-Thinking+noise+loss-main/run_sample_discrete-token_grpo.sh
+```
+#### Step 3: Evaluating GRPO under the soft-thinking reasoning pattern.
+```bash
+./Soft-Thinking+noise+loss-main/run_sample_gumbel_grpo.sh
+```
+#### Step 3: Evaluating SofT-GRPO under the soft-thinking reasoning pattern.
+```bash
+./Soft-Thinking+noise+loss-main/run_sample_gumbel.sh
+```
 ---
+### 4. Training with SofT-GRPO
+#### Option 1: Train the SofT-GRPO on DeepSeek-R1-Distill-Qwen-1.5B
+```bash
+./SofT-GRPO-deepscaler-8k.sh # change the LLM path, dataset path accordingly
+```
+#### Option 2: Train the SofT-GRPO on DeepSeek-R1-Distill-Qwen-7B
+```bash
+./SofT-GRPO-deepscaler-8k-qwen7.sh # change the LLM path, dataset path accordingly
+```
+#### Option 3: Train the SofT-GRPO on Llama-3.2-3B-Instruct
+```bash
+./SofT-GRPO-deepscaler-8k-llama3.sh # change the LLM path, dataset path accordingly
+```
+## ✒️ Citation
+If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
+```bibtex
+```
+## ❤️ Acknowledgments
+- [Soft-Thinking](https://github.com/eric-ai-lab/Soft-Thinking): The codebase we built upon. Thanks for their wonderful work.
+- [verl-0.4.x](https://github.com/volcengine/verl/tree/v0.4.x): Our work is based on this codebase as well.
+- [SIM-CoT](https://github.com/InternLM/SIM-CoT): We use their template for README.md!

requirements.txt ADDED Viewed

	@@ -0,0 +1,206 @@

+# Editable install with no version control (sglang==0.4.6.post1)
+-e ./Soft-Thinking+noise+loss-main/sglang_soft_thinking_pkg/python
+# Editable install with no version control (verl==0.4.0)
+-e ./verl-0.4.x
+absl-py==2.3.1
+accelerate==1.10.1
+aiohappyeyeballs==2.6.1
+aiohttp==3.12.15
+aiosignal==1.4.0
+airportsdata==20250909
+annotated-types==0.7.0
+anthropic==0.68.1
+antlr4-python3-runtime==4.9.3
+anyio==4.11.0
+asttokens==3.0.0
+attrs==25.3.0
+blobfile==3.0.0
+build==1.3.0
+cachetools==6.2.0
+certifi==2025.8.3
+cffi==2.0.0
+charset-normalizer==3.4.3
+click==8.3.0
+cloudpickle==3.1.1
+codetiming==1.4.0
+compressed-tensors==0.11.0
+contourpy==1.3.3
+cuda-bindings==13.0.1
+cuda-pathfinder==1.2.3
+cuda-python==13.0.1
+cycler==0.12.1
+datasets==4.1.1
+decorator==5.2.1
+decord==0.6.0
+dill==0.4.0
+diskcache==5.6.3
+distro==1.9.0
+docstring_parser==0.17.0
+einops==0.8.1
+executing==2.2.1
+expecttest==0.3.0
+fastapi==0.117.1
+fastuuid==0.13.5
+filelock==3.19.1
+flash-attn==2.7.3
+flashinfer-python==0.2.3
+fonttools==4.60.1
+frozendict==2.4.6
+frozenlist==1.7.0
+fsspec==2025.9.0
+gitdb==4.0.12
+GitPython==3.1.45
+grpcio==1.75.1
+h11==0.16.0
+hf-xet==1.1.10
+hf_transfer==0.1.9
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.35.1
+hydra-core==1.3.2
+idna==3.10
+importlib_metadata==8.7.0
+iniconfig==2.1.0
+interegular==0.3.3
+ipython==9.5.0
+ipython_pygments_lexers==1.1.1
+jedi==0.19.2
+Jinja2==3.1.6
+jiter==0.11.0
+joblib==1.5.2
+jsonlines==4.0.0
+jsonschema==4.25.1
+jsonschema-specifications==2025.9.1
+kiwisolver==1.4.9
+lark==1.3.0
+latex2sympy2_extended==1.10.2
+litellm==1.77.5
+llguidance==0.7.30
+lxml==6.0.2
+Markdown==3.9
+MarkupSafe==3.0.3
+math-verify==0.8.0
+matplotlib==3.10.6
+matplotlib-inline==0.1.7
+modelscope==1.30.0
+mpmath==1.3.0
+msgpack==1.1.1
+msgspec==0.19.0
+multidict==6.6.4
+multiprocess==0.70.16
+nanobind==2.9.2
+nest-asyncio==1.6.0
+networkx==3.5
+ninja==1.13.0
+numpy==2.3.3
+nvidia-cublas-cu12==12.4.5.8
+nvidia-cuda-cupti-cu12==12.4.127
+nvidia-cuda-nvrtc-cu12==12.4.127
+nvidia-cuda-runtime-cu12==12.4.127
+nvidia-cudnn-cu12==9.1.0.70
+nvidia-cudnn-frontend==1.14.1
+nvidia-cufft-cu12==11.2.1.3
+nvidia-cufile-cu12==1.13.1.3
+nvidia-curand-cu12==10.3.5.147
+nvidia-cusolver-cu12==11.6.1.9
+nvidia-cusparse-cu12==12.3.1.170
+nvidia-cusparselt-cu12==0.6.2
+nvidia-cutlass-dsl==4.2.1
+nvidia-ml-py==13.580.82
+nvidia-nccl-cu12==2.21.5
+nvidia-nvjitlink-cu12==12.4.127
+nvidia-nvtx-cu12==12.4.127
+omegaconf==2.3.0
+openai==1.109.1
+openai-harmony==0.0.4
+orjson==3.11.3
+outlines==0.1.11
+outlines_core==0.1.26
+packaging==25.0
+pandas==2.3.2
+parso==0.8.5
+partial-json-parser==0.2.1.1.post6
+peft==0.17.1
+pexpect==4.9.0
+pillow==11.3.0
+platformdirs==4.4.0
+pluggy==1.6.0
+prometheus_client==0.23.1
+prompt_toolkit==3.0.52
+propcache==0.3.2
+protobuf==6.32.1
+psutil==7.1.0
+ptyprocess==0.7.0
+pure_eval==0.2.3
+pyarrow==21.0.0
+pybase64==1.4.2
+pybind11==3.0.1
+pycountry==24.6.1
+pycparser==2.23
+pycryptodomex==3.23.0
+pydantic==2.11.9
+pydantic_core==2.33.2
+Pygments==2.19.2
+pylatexenc==2.10
+pynvml==13.0.1
+pyparsing==3.2.5
+pyproject_hooks==1.2.0
+pytest==8.4.2
+python-dateutil==2.9.0.post0
+python-dotenv==1.1.1
+python-multipart==0.0.20
+pytz==2025.2
+pyvers==0.1.0
+PyYAML==6.0.3
+pyzmq==27.1.0
+ray==2.49.2
+referencing==0.36.2
+regex==2025.9.18
+requests==2.32.5
+rpds-py==0.27.1
+safetensors==0.6.2
+scikit-learn==1.7.2
+scipy==1.16.2
+sentence-transformers==5.1.1
+sentencepiece==0.2.1
+sentry-sdk==2.39.0
+setproctitle==1.3.7
+sgl-kernel==0.1.1
+six==1.17.0
+smmap==5.0.2
+sniffio==1.3.1
+soundfile==0.13.1
+stack-data==0.6.3
+starlette==0.48.0
+sympy==1.13.1
+tabulate==0.9.0
+tensorboard==2.20.0
+tensorboard-data-server==0.7.2
+tensordict==0.10.0
+threadpoolctl==3.6.0
+tiktoken==0.11.0
+timm==1.0.16
+tokenizers==0.21.4
+torch==2.6.0
+torch_memory_saver==0.0.8
+torchao==0.9.0
+torchaudio==2.8.0
+torchdata==0.11.0
+torchvision==0.21.0
+tqdm==4.67.1
+traitlets==5.14.3
+transformers==4.51.1
+triton==3.2.0
+typing-inspection==0.4.1
+typing_extensions==4.15.0
+tzdata==2025.2
+urllib3==2.5.0
+uvicorn==0.37.0
+uvloop==0.21.0
+wandb==0.22.0
+wcwidth==0.2.14
+Werkzeug==3.1.3
+xgrammar==0.1.17
+xxhash==3.5.0
+yarl==1.20.1
+zipp==3.23.0