ProAct / README.md
nielsr's picture
nielsr HF Staff
Update pipeline tag, add library name, and usage information
c829fe6 verified
|
raw
history blame
3.46 kB
metadata
base_model:
  - Qwen/Qwen3-4B-Instruct-2507
language:
  - en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
  - agent
  - reinforcement-learning
  - game-playing
  - game2048
  - sokoban

ProAct: Agentic Lookahead in Interactive Environments

馃摉 Introduction

This repository contains the official model weights for the paper "ProAct: Agentic Lookahead in Interactive Environments".

Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm:

  1. GLAD (Grounded LookAhead Distillation): The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal reasoning chains and distilled into the model via Supervised Fine-Tuning (SFT).
  2. MC-Critic (Monte-Carlo Critic): The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like PPO and GRPO without relying on expensive model-based value approximation.

Experiments show that the ProAct model (based on Qwen3-4B-Instruct) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (2048) and deterministic (Sokoban) environments.

馃搨 Repository Structure

This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders:

Subfolder Task Stage Description
2048_sft 2048 SFT (Stage 1) Model trained using GLAD on MCTS-generated trajectories.
2048_rl 2048 RL (Stage 2) Model further fine-tuned using RL with MC-Critic, initialized from the SFT checkpoint.
sokoban_sft Sokoban SFT (Stage 1) GLAD SFT model for the Sokoban task.
sokoban_rl Sokoban RL (Stage 2) MC-Critic RL model for the Sokoban task.

馃殌 Sample Usage

You can deploy the model weights using vLLM. For example, to serve the 2048_rl checkpoint:

# Start the vLLM server
vllm serve biang889/ProAct --subfolder 2048_rl \
  --served-model-name ProAct \
  --host 0.0.0.0 \
  --port 8080 \
  --tensor-parallel-size 1

Once served, you can interact with the model via an OpenAI-compatible API.

馃摐 Citation

If you find this project useful in your research, please cite our paper:

@misc{yu2026proactagenticlookaheadinteractive,
      title={ProAct: Agentic Lookahead in Interactive Environments}, 
      author={Yangbin Yu and Mingyu Yang and Junyou Li and Yiming Gao and Feiyu Liu and Yijun Yang and Zichuan Lin and Jiafei Lyu and Yicheng Liu and Zhicong Lu and Deheng Ye and Jie Jiang},
      year={2026},
      eprint={2602.05327},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.05327}, 
}