InfoPO: Information-Driven Policy Optimization for User-Centric Agents
Abstract
InfoPO optimizes agent-user collaboration by identifying valuable interaction turns through information-gain rewards and adaptive variance-gated fusion for improved decision-making.
Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.
Community
🌟 We introduce InfoPO (Information-Driven Policy Optimization) — a practical way to train multi-turn LLM agents with turn-level credit assignment.
🧠Key idea: treat interaction as active uncertainty reduction. We compute a counterfactual information-gain reward by comparing the agent’s next-action distribution with vs. without user feedback (masked-feedback counterfactual), so the model learns which turns actually matter.
🎯 Why it matters: outcome-only rewards in multi-turn GRPO-style training can be sparse and noisy. InfoPO provides dense, targeted learning signals, and we further keep it task-aligned via an adaptive variance-gated fusion with task outcomes.
📊 Results: consistent gains and improved stability across diverse interactive settings (e.g., intent clarification, collaborative coding, tool-augmented decision making), including UserGym, ColBench, and τ²-Bench.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper