SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks
Abstract
SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision.
Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.
Community
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning (2026)
- Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text (2026)
- EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics (2026)
- Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback (2026)
- Prompt-Level Reward Specifications for Open-Ended Post-Training (2026)
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents (2026)
- Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.31433 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper