Papers
arxiv:2604.14054

π-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Published on May 25
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Self-play training for deep search agents is enhanced by incorporating question construction paths as privileged information to enable dense feedback and improve learning efficiency.

AI-generated summary

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information: self-play can provide high-quality privileged information for the self-distillation at low cost and at scale, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play (π-Play), a novel multi-agent self-evolution framework combining self-play and self-distillation. In π-Play, an examiner generates tasks together with QCPs, and a teacher employs QCP as privileged context to densely supervise a student via self-distillation. This design transforms sparse-reward self-play into a dense-feedback co-evolution. Extensive experiments show that data-free π-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3times over conventional self-play. Code is available at https://github.com/zhyaoch/pi-play.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.14054
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14054 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14054 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14054 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.