ProAct / README.md

biang889

Update README.md

4aad533 verified 1 day ago

preview code

raw

history blame contribute delete

2.46 kB

metadata

license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen3-4B-Instruct-2507
pipeline_tag: question-answering
tags:
  - agent
  - reinforcement-learning
  - game-playing
  - game2048
  - sokoban

ProAct: Agentic Lookahead in Interactive Environments

[Paper] [Code] [Project Page]

## 📖 Introduction

This repository contains the official model weights for the paper "ProAct: Agentic Lookahead in Interactive Environments".

Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm:

GLAD (Grounded LookAhead Distillation): The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal reasoning chains and distilled into the model via Supervised Fine-Tuning (SFT).
MC-Critic (Monte-Carlo Critic): The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like PPO and GRPO without relying on expensive model-based value approximation.

Experiments show that the ProAct model (based on Qwen3-4B-Instruct) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (2048) and deterministic (Sokoban) environments.

📂 Repository Structure

This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders:

Subfolder	Task	Stage	Description
`2048_sft`	2048	SFT (Stage 1)	Model trained using GLAD on MCTS-generated trajectories.
`2048_rl`	2048	RL (Stage 2)	Model further fine-tuned using RL with MC-Critic, initialized from the SFT checkpoint.
`sokoban_sft`	Sokoban	SFT (Stage 1)	GLAD SFT model for the Sokoban task.
`sokoban_rl`	Sokoban	RL (Stage 2)	MC-Critic RL model for the Sokoban task.