Papers
arxiv:2602.00759

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

Published on Jan 31
· Submitted by
Zhipeng Chen
on Feb 3
Authors:
,
,
,
,

Abstract

Adaptive Ability Decomposing (A²D) enhances reinforcement learning with verifiable rewards by decomposing complex questions into simpler sub-questions, improving LLM reasoning through guided exploration without requiring a teacher model.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A^2D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A^2D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.

Community

Paper submitter

This is a new work by Renmin University of China and ByteDance Seed. We introduce a novel RLVR algorithm that allows a single base model to evolve into two complementary models i.e., Decomposer and Reasoner, which can mutually reinforce each other. We warmly welcome the community’s interest and feedback!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.00759 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.00759 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.00759 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.