forestliutc commited on
Commit
c6c684d
·
verified ·
1 Parent(s): d33dce6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -1,20 +1,20 @@
1
  <div align="center">
2
 
3
  # MarsRL
4
-
5
  <div>
6
  Advancing <strong>M</strong>ulti-<strong>A</strong>gent <strong>R</strong>easoning <strong>S</strong>ystem via <strong>R</strong>einforcement <strong>L</strong>earning with Agentic Pipeline Parallelism
7
  </div>
8
  </div>
9
 
10
  ## Overview
11
-
12
  Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 85.6\% to 93.3\% and BeyondAIME from 65.3\% to 72.6\%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
13
  <div align="center">
14
  <img src="home.jpg" width="80%" />
15
  </div>
16
 
17
  ## V-C Reasoning System Evaluation Instructions
 
18
  ### step1: Download our released model or other open source models
19
  Supported models: Qwen3/DeepSeekV3.1/DeepSeek R1. You can modify the llm_client.py to use other models.
20
 
@@ -36,6 +36,7 @@ This step will generate a file named "eval_overalljsonl" in the input_dir. Your
36
 
37
 
38
  ## Acknowledgements
 
39
  - Our implementation is heaviliy built on [verl](https://github.com/volcengine/verl).
40
  - Our models are trained on top of [Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507).
41
  - Our V-C Reasoning system is built on [IMO25 pipline](https://github.com/lyang36/IMO25).
@@ -43,6 +44,7 @@ This step will generate a file named "eval_overalljsonl" in the input_dir. Your
43
  Thanks for their wonderful work.
44
 
45
  ## Citation
 
46
  ```bibtex
47
  TODO
48
  ```
 
1
  <div align="center">
2
 
3
  # MarsRL
 
4
  <div>
5
  Advancing <strong>M</strong>ulti-<strong>A</strong>gent <strong>R</strong>easoning <strong>S</strong>ystem via <strong>R</strong>einforcement <strong>L</strong>earning with Agentic Pipeline Parallelism
6
  </div>
7
  </div>
8
 
9
  ## Overview
10
+ <hr />
11
  Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 85.6\% to 93.3\% and BeyondAIME from 65.3\% to 72.6\%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
12
  <div align="center">
13
  <img src="home.jpg" width="80%" />
14
  </div>
15
 
16
  ## V-C Reasoning System Evaluation Instructions
17
+ <hr />
18
  ### step1: Download our released model or other open source models
19
  Supported models: Qwen3/DeepSeekV3.1/DeepSeek R1. You can modify the llm_client.py to use other models.
20
 
 
36
 
37
 
38
  ## Acknowledgements
39
+ <hr />
40
  - Our implementation is heaviliy built on [verl](https://github.com/volcengine/verl).
41
  - Our models are trained on top of [Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507).
42
  - Our V-C Reasoning system is built on [IMO25 pipline](https://github.com/lyang36/IMO25).
 
44
  Thanks for their wonderful work.
45
 
46
  ## Citation
47
+ <hr />
48
  ```bibtex
49
  TODO
50
  ```