arxiv:2603.00656

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Published on Feb 28

· Submitted by

Authors:

Abstract

InfoPO optimizes agent-user collaboration by identifying valuable interaction turns through information-gain rewards and adaptive variance-gated fusion for improved decision-making.

AI-generated summary

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

View arXiv page View PDF GitHub 3 Add to collection

Community

Fancylalala

Paper submitter about 12 hours ago

🌟 We introduce InfoPO (Information-Driven Policy Optimization) — a practical way to train multi-turn LLM agents with turn-level credit assignment.

🧠 Key idea: treat interaction as active uncertainty reduction. We compute a counterfactual information-gain reward by comparing the agent’s next-action distribution with vs. without user feedback (masked-feedback counterfactual), so the model learns which turns actually matter.

🎯 Why it matters: outcome-only rewards in multi-turn GRPO-style training can be sparse and noisy. InfoPO provides dense, targeted learning signals, and we further keep it task-aligned via an adaptive variance-gated fusion with task outcomes.

📊 Results: consistent gains and improved stability across diverse interactive settings (e.g., intent clarification, collaborative coding, tool-augmented decision making), including UserGym, ColBench, and τ²-Bench.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.00656 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.00656 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.00656 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.