Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
Abstract
Generative Reward Models can be improved by structuring Chain-of-Thought reasoning into breadth and depth components and optimizing them through supervised fine-tuning and reinforcement learning with verifiable rewards.
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at https://huggingface.co/collections/DonJoey/mix-grm{Hugging Face}, and the code is released at https://github.com/Don-Joey/Mix-GRM{Github}.
Community
🚀 Is making CoT longer really the silver bullet for Reward Models?
As long-cot dominates the LLM landscape, the standard approach to improving Generative Reward Models (LLM-as-a-Judge) has been straightforward: just force the model to generate longer reasoning traces. But does "one size fit all"?
In our new paper, "Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models," we prove that when it comes to evaluation, structure matters just as much as length.
🔥 The Core Problem:
Real-world evaluation is fundamentally divided:
Subjective Preference (e.g., Chat): Requires Breadth (B-CoT)—evaluating multiple dimensions like tone, format, and helpfulness simultaneously.
Objective Correctness (e.g., Math/Code): Requires Depth (D-CoT)—rigorous, step-by-step deductive verification.
Forcing a model to "think longer" on a subjective chat task often just accumulates noise, while using broad aspects on a math problem misses critical logical flaws.
💡 Enter Mix-GRM & Key Discoveries:
- 🧠Synergizing Structures: We designed a framework that equips the GRM with both Breadth (B-CoT) and Depth (D-CoT) reasoning capabilities.
2.⚡ "Emergent Polarization": We trained the model using Reinforcement Learning (RLVR) relying exclusively on final verdict supervision—with zero explicit routing labels. Amazingly, the model's structural alignment surged to 95%. It autonomously learned to polarize its reasoning, dynamically selecting Breadth for Preference and Depth for Correctness.
- 📉 Highly Compute-Efficient: Unlike length-scaling baselines (like Self-Consistency) that burn massive amounts of tokens, Mix-GRM achieves superior performance while keeping token consumption within the exact same order of magnitude as standard single-pass reasoning.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper