|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen3-4B-Instruct-2507 |
|
|
pipeline_tag: question-answering |
|
|
tags: |
|
|
- agent |
|
|
- reinforcement-learning |
|
|
- game-playing |
|
|
- game2048 |
|
|
- sokoban |
|
|
--- |
|
|
|
|
|
# ProAct: Agentic Lookahead in Interactive Environments |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[[Paper](https://arxiv.org/abs/2602.05327)] [[Code](https://github.com/GreatX3/ProAct)] |
|
|
[[Project Page](https://github.com/GreatX3/ProAct)] |
|
|
</div> |
|
|
## 馃摉 Introduction |
|
|
|
|
|
This repository contains the official model weights for the paper **"ProAct: Agentic Lookahead in Interactive Environments"**. |
|
|
|
|
|
Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to **compounding errors** when simulating future states. To address this, we propose **ProAct**, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm: |
|
|
|
|
|
1. **GLAD (Grounded LookAhead Distillation)**: The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal **reasoning chains** and distilled into the model via Supervised Fine-Tuning (SFT). |
|
|
2. **MC-Critic (Monte-Carlo Critic)**: The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like **PPO** and **GRPO** without relying on expensive model-based value approximation. |
|
|
|
|
|
Experiments show that the **ProAct** model (based on **Qwen3-4B-Instruct**) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (**2048**) and deterministic (**Sokoban**) environments. |
|
|
|
|
|
## 馃搨 Repository Structure |
|
|
|
|
|
This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders: |
|
|
|
|
|
| Subfolder | Task | Stage | Description | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **`2048_sft`** | 2048 | SFT (Stage 1) | Model trained using **GLAD** on MCTS-generated trajectories. | |
|
|
| **`2048_rl`** | 2048 | RL (Stage 2) | Model further fine-tuned using RL with **MC-Critic**, initialized from the SFT checkpoint. | |
|
|
| **`sokoban_sft`** | Sokoban | SFT (Stage 1) | GLAD SFT model for the Sokoban task. | |
|
|
| **`sokoban_rl`** | Sokoban | RL (Stage 2) | MC-Critic RL model for the Sokoban task. | |
|
|
|