Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,43 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- Qwen/Qwen3-4B-Instruct-2507
|
| 7 |
+
pipeline_tag: question-answering
|
| 8 |
+
tags:
|
| 9 |
+
- agent
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- game-playing
|
| 12 |
+
- game2048
|
| 13 |
+
- sokoban
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# ProAct: Agentic Lookahead in Interactive Environments
|
| 17 |
+
|
| 18 |
+
<div align="center">
|
| 19 |
+
|
| 20 |
+
[[Paper](https://arxiv.org/abs/24xx.xxxxx)] [[Code](https://github.com/GreatX3/ProAct)]
|
| 21 |
+
[[Project Page](https://github.com/GreatX3/ProAct)]
|
| 22 |
+
|
| 23 |
+
## 馃摉 Introduction
|
| 24 |
+
|
| 25 |
+
This repository contains the official model weights for the paper **"ProAct: Agentic Lookahead in Interactive Environments"**.
|
| 26 |
+
|
| 27 |
+
Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to **compounding errors** when simulating future states. To address this, we propose **ProAct**, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm:
|
| 28 |
+
|
| 29 |
+
1. **GLAD (Grounded LookAhead Distillation)**: The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal **reasoning chains** and distilled into the model via Supervised Fine-Tuning (SFT).
|
| 30 |
+
2. **MC-Critic (Monte-Carlo Critic)**: The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like **PPO** and **GRPO** without relying on expensive model-based value approximation.
|
| 31 |
+
|
| 32 |
+
Experiments show that the **ProAct** model (based on **Qwen3-4B-Instruct**) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (**2048**) and deterministic (**Sokoban**) environments.
|
| 33 |
+
|
| 34 |
+
## 馃搨 Repository Structure
|
| 35 |
+
|
| 36 |
+
This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders:
|
| 37 |
+
|
| 38 |
+
| Subfolder | Task | Stage | Description |
|
| 39 |
+
| :--- | :--- | :--- | :--- |
|
| 40 |
+
| **`2048_sft`** | 2048 | SFT (Stage 1) | Model trained using **GLAD** on MCTS-generated trajectories. |
|
| 41 |
+
| **`2048_rl`** | 2048 | RL (Stage 2) | Model further fine-tuned using RL with **MC-Critic**, initialized from the SFT checkpoint. |
|
| 42 |
+
| **`sokoban_sft`** | Sokoban | SFT (Stage 1) | GLAD SFT model for the Sokoban task. |
|
| 43 |
+
| **`sokoban_rl`** | Sokoban | RL (Stage 2) | MC-Critic RL model for the Sokoban task. |
|