LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
Abstract
LongTraceRL addresses long-context reasoning challenges in large language models through tiered distractor construction and rubric reward design for improved reasoning quality.
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents (2026)
- CoSearch: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search (2026)
- IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning (2026)
- Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models (2026)
- Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization (2026)
- Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning (2026)
- Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.31584 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 3
THU-KEG/LongTraceRL-8B
Datasets citing this paper 1
THU-KEG/LongTraceRL
Spaces citing this paper 0
No Space linking this paper