Update README.md
Browse files
README.md
CHANGED
|
@@ -1,20 +1,20 @@
|
|
| 1 |
<div align="center">
|
| 2 |
|
| 3 |
# MarsRL
|
| 4 |
-
|
| 5 |
<div>
|
| 6 |
Advancing <strong>M</strong>ulti-<strong>A</strong>gent <strong>R</strong>easoning <strong>S</strong>ystem via <strong>R</strong>einforcement <strong>L</strong>earning with Agentic Pipeline Parallelism
|
| 7 |
</div>
|
| 8 |
</div>
|
| 9 |
|
| 10 |
## Overview
|
| 11 |
-
|
| 12 |
Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 85.6\% to 93.3\% and BeyondAIME from 65.3\% to 72.6\%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
|
| 13 |
<div align="center">
|
| 14 |
<img src="home.jpg" width="80%" />
|
| 15 |
</div>
|
| 16 |
|
| 17 |
## V-C Reasoning System Evaluation Instructions
|
|
|
|
| 18 |
### step1: Download our released model or other open source models
|
| 19 |
Supported models: Qwen3/DeepSeekV3.1/DeepSeek R1. You can modify the llm_client.py to use other models.
|
| 20 |
|
|
@@ -36,6 +36,7 @@ This step will generate a file named "eval_overalljsonl" in the input_dir. Your
|
|
| 36 |
|
| 37 |
|
| 38 |
## Acknowledgements
|
|
|
|
| 39 |
- Our implementation is heaviliy built on [verl](https://github.com/volcengine/verl).
|
| 40 |
- Our models are trained on top of [Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507).
|
| 41 |
- Our V-C Reasoning system is built on [IMO25 pipline](https://github.com/lyang36/IMO25).
|
|
@@ -43,6 +44,7 @@ This step will generate a file named "eval_overalljsonl" in the input_dir. Your
|
|
| 43 |
Thanks for their wonderful work.
|
| 44 |
|
| 45 |
## Citation
|
|
|
|
| 46 |
```bibtex
|
| 47 |
TODO
|
| 48 |
```
|
|
|
|
| 1 |
<div align="center">
|
| 2 |
|
| 3 |
# MarsRL
|
|
|
|
| 4 |
<div>
|
| 5 |
Advancing <strong>M</strong>ulti-<strong>A</strong>gent <strong>R</strong>easoning <strong>S</strong>ystem via <strong>R</strong>einforcement <strong>L</strong>earning with Agentic Pipeline Parallelism
|
| 6 |
</div>
|
| 7 |
</div>
|
| 8 |
|
| 9 |
## Overview
|
| 10 |
+
<hr />
|
| 11 |
Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 85.6\% to 93.3\% and BeyondAIME from 65.3\% to 72.6\%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
|
| 12 |
<div align="center">
|
| 13 |
<img src="home.jpg" width="80%" />
|
| 14 |
</div>
|
| 15 |
|
| 16 |
## V-C Reasoning System Evaluation Instructions
|
| 17 |
+
<hr />
|
| 18 |
### step1: Download our released model or other open source models
|
| 19 |
Supported models: Qwen3/DeepSeekV3.1/DeepSeek R1. You can modify the llm_client.py to use other models.
|
| 20 |
|
|
|
|
| 36 |
|
| 37 |
|
| 38 |
## Acknowledgements
|
| 39 |
+
<hr />
|
| 40 |
- Our implementation is heaviliy built on [verl](https://github.com/volcengine/verl).
|
| 41 |
- Our models are trained on top of [Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507).
|
| 42 |
- Our V-C Reasoning system is built on [IMO25 pipline](https://github.com/lyang36/IMO25).
|
|
|
|
| 44 |
Thanks for their wonderful work.
|
| 45 |
|
| 46 |
## Citation
|
| 47 |
+
<hr />
|
| 48 |
```bibtex
|
| 49 |
TODO
|
| 50 |
```
|