license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3-4B-Instruct-2507
pipeline_tag: question-answering
tags:
- agent
- reinforcement-learning
- game-playing
- game2048
- sokoban
ProAct: Agentic Lookahead in Interactive Environments
[Paper] [Code] [Project Page]
This repository contains the official model weights for the paper "ProAct: Agentic Lookahead in Interactive Environments".
Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm:
- GLAD (Grounded LookAhead Distillation): The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal reasoning chains and distilled into the model via Supervised Fine-Tuning (SFT).
- MC-Critic (Monte-Carlo Critic): The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like PPO and GRPO without relying on expensive model-based value approximation.
Experiments show that the ProAct model (based on Qwen3-4B-Instruct) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (2048) and deterministic (Sokoban) environments.
馃搨 Repository Structure
This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders:
| Subfolder | Task | Stage | Description |
|---|---|---|---|
2048_sft |
2048 | SFT (Stage 1) | Model trained using GLAD on MCTS-generated trajectories. |
2048_rl |
2048 | RL (Stage 2) | Model further fine-tuned using RL with MC-Critic, initialized from the SFT checkpoint. |
sokoban_sft |
Sokoban | SFT (Stage 1) | GLAD SFT model for the Sokoban task. |
sokoban_rl |
Sokoban | RL (Stage 2) | MC-Critic RL model for the Sokoban task. |