ProAct / README.md
biang889's picture
Update README.md
4aad533 verified
metadata
license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen3-4B-Instruct-2507
pipeline_tag: question-answering
tags:
  - agent
  - reinforcement-learning
  - game-playing
  - game2048
  - sokoban

ProAct: Agentic Lookahead in Interactive Environments

## 馃摉 Introduction

This repository contains the official model weights for the paper "ProAct: Agentic Lookahead in Interactive Environments".

Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm:

  1. GLAD (Grounded LookAhead Distillation): The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal reasoning chains and distilled into the model via Supervised Fine-Tuning (SFT).
  2. MC-Critic (Monte-Carlo Critic): The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like PPO and GRPO without relying on expensive model-based value approximation.

Experiments show that the ProAct model (based on Qwen3-4B-Instruct) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (2048) and deterministic (Sokoban) environments.

馃搨 Repository Structure

This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders:

Subfolder Task Stage Description
2048_sft 2048 SFT (Stage 1) Model trained using GLAD on MCTS-generated trajectories.
2048_rl 2048 RL (Stage 2) Model further fine-tuned using RL with MC-Critic, initialized from the SFT checkpoint.
sokoban_sft Sokoban SFT (Stage 1) GLAD SFT model for the Sokoban task.
sokoban_rl Sokoban RL (Stage 2) MC-Critic RL model for the Sokoban task.