biang889 commited on
Commit
d90d67b
verified
1 Parent(s): 4a2fe08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -3
README.md CHANGED
@@ -1,3 +1,43 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen3-4B-Instruct-2507
7
+ pipeline_tag: question-answering
8
+ tags:
9
+ - agent
10
+ - reinforcement-learning
11
+ - game-playing
12
+ - game2048
13
+ - sokoban
14
+ ---
15
+
16
+ # ProAct: Agentic Lookahead in Interactive Environments
17
+
18
+ <div align="center">
19
+
20
+ [[Paper](https://arxiv.org/abs/24xx.xxxxx)] [[Code](https://github.com/GreatX3/ProAct)]
21
+ [[Project Page](https://github.com/GreatX3/ProAct)]
22
+
23
+ ## 馃摉 Introduction
24
+
25
+ This repository contains the official model weights for the paper **"ProAct: Agentic Lookahead in Interactive Environments"**.
26
+
27
+ Existing LLM agents often struggle in interactive environments requiring long-horizon planning due to **compounding errors** when simulating future states. To address this, we propose **ProAct**, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm:
28
+
29
+ 1. **GLAD (Grounded LookAhead Distillation)**: The first stage. We use Monte-Carlo Tree Search (MCTS) to probe the environment and generate high-quality trajectories. These complex search trees are then compressed into concise, causal **reasoning chains** and distilled into the model via Supervised Fine-Tuning (SFT).
30
+ 2. **MC-Critic (Monte-Carlo Critic)**: The second stage. This is a plug-and-play auxiliary value estimator. It leverages lightweight environment rollouts to calibrate value estimates, providing a low-variance signal that stabilizes policy gradient algorithms like **PPO** and **GRPO** without relying on expensive model-based value approximation.
31
+
32
+ Experiments show that the **ProAct** model (based on **Qwen3-4B-Instruct**) significantly outperforms open-source baselines and rivals state-of-the-art closed-source models in both stochastic (**2048**) and deterministic (**Sokoban**) environments.
33
+
34
+ ## 馃搨 Repository Structure
35
+
36
+ This repository contains model weights for different tasks (2048, Sokoban) and training stages (SFT, RL), organized into separate subfolders:
37
+
38
+ | Subfolder | Task | Stage | Description |
39
+ | :--- | :--- | :--- | :--- |
40
+ | **`2048_sft`** | 2048 | SFT (Stage 1) | Model trained using **GLAD** on MCTS-generated trajectories. |
41
+ | **`2048_rl`** | 2048 | RL (Stage 2) | Model further fine-tuned using RL with **MC-Critic**, initialized from the SFT checkpoint. |
42
+ | **`sokoban_sft`** | Sokoban | SFT (Stage 1) | GLAD SFT model for the Sokoban task. |
43
+ | **`sokoban_rl`** | Sokoban | RL (Stage 2) | MC-Critic RL model for the Sokoban task. |