biang889
/

ProAct

Question Answering

reinforcement-learning

Model card Files Files and versions

ProAct / README.md

biang889's picture

Update README.md

4aad533 verified 2 days ago

|

history blame contribute delete

2.46 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen3-4B-Instruct-2507
	pipeline_tag: question-answering
	tags:
	- agent
	- reinforcement-learning
	- game-playing
	- game2048
	- sokoban
	---

	# ProAct: Agentic Lookahead in Interactive Environments

	<div align="center">

	[[Paper](https://arxiv.org/abs/2602.05327)] [[Code](https://github.com/GreatX3/ProAct)]
	[[Project Page](https://github.com/GreatX3/ProAct)]
	</div>
	## 📖 Introduction

	This repository contains the official model weights for the paper "ProAct: Agentic Lookahead in Interactive Environments".

	Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm:

	1. GLAD (Grounded LookAhead Distillation): The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal reasoning chains and distilled into the model via Supervised Fine-Tuning (SFT).
	2. MC-Critic (Monte-Carlo Critic): The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like PPO and GRPO without relying on expensive model-based value approximation.

	Experiments show that the ProAct model (based on Qwen3-4B-Instruct) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (2048) and deterministic (Sokoban) environments.

	## 📂 Repository Structure

	This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders:

	\| Subfolder \| Task \| Stage \| Description \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| `2048_sft` \| 2048 \| SFT (Stage 1) \| Model trained using GLAD on MCTS-generated trajectories. \|
	\| `2048_rl` \| 2048 \| RL (Stage 2) \| Model further fine-tuned using RL with MC-Critic, initialized from the SFT checkpoint. \|
	\| `sokoban_sft` \| Sokoban \| SFT (Stage 1) \| GLAD SFT model for the Sokoban task. \|
	\| `sokoban_rl` \| Sokoban \| RL (Stage 2) \| MC-Critic RL model for the Sokoban task. \|