Lee93whut commited on
Commit ·
385cc9f
1
Parent(s): a264030
docs: README — results table, architecture, quickstart, references
Browse files- Algorithm comparison table (R1/R2/R3 success rate + SPL)
- Technical highlights: 3-dashboard TensorBoard, episode warmup,
SPL metric, potential-based shaping, anti-loop dual-layer,
BFS reachability, single random source design
- Architecture diagram (train.py → maze_env → model → app.py)
- Quickstart: install, download weights, run demo, train from scratch
- References: Mnih 2015, van Hasselt 2016, Wang 2016,
Ng 1999, Anderson 2018, Henderson 2018
README.md
CHANGED
|
@@ -10,7 +10,7 @@ license: mit
|
|
| 10 |
|
| 11 |
# RL Maze Navigator
|
| 12 |
|
| 13 |
-
### Benchmarking DQN variants on procedurally-generated mazes · SPL evaluation ·
|
| 14 |
|
| 15 |
[](https://github.com/Lee93whut/rl-maze/actions/workflows/test.yml)
|
| 16 |
[](https://www.python.org/)
|
|
@@ -22,15 +22,15 @@ license: mit
|
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
-
## 算法对比结果(Round
|
| 26 |
|
| 27 |
> Holdout 评估:100 张训练中**从未见过**的独立地图(seed+200000),ε=0 贪心推理。
|
| 28 |
> 指标:[SPL](https://arxiv.org/abs/1807.06757)(Anderson et al. 2018,导航领域标准评估指标)。
|
| 29 |
-
> Round
|
| 30 |
|
| 31 |
| 算法 | 成功率 | SPL | 峰值成功率 | 收敛 Episode |
|
| 32 |
|------|:------:|:---:|:---------:|:-----------:|
|
| 33 |
-
| **Double DQN** (
|
| 34 |
| Double DQN (R2) | 64.0% | 0.633 | 74.0% | 3300 |
|
| 35 |
| Vanilla DQN (R1) | 56.0% | 0.559 | — | 1921 |
|
| 36 |
| Double DQN (R1) | 61.0% | 0.605 | — | 948 |
|
|
@@ -98,15 +98,27 @@ if self.distance_shaping_alpha != 0.0:
|
|
| 98 |
|
| 99 |
撞墙步位置不变,不触发 shaping,避免撞墙获得零 shaping 奖励误导策略。`α=0.5` 使 shaping 幅度为基础奖励(-1)的 50%,提供方向感但不压过终点奖励(+100)。符合 Ng et al. (1999) 的势函数 shaping 理论,保证最优策略不变。
|
| 100 |
|
| 101 |
-
### 5. Anti-Loop
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
### 6. BFS 连通性保证
|
| 112 |
|
|
@@ -215,7 +227,7 @@ rl-maze/
|
|
| 215 |
│
|
| 216 |
├── src/
|
| 217 |
│ ├── train.py DQN 训练主循环(Warmup + 三看板 + 盲测评估)
|
| 218 |
-
│ ├── model.py DQNNetwork
|
| 219 |
│ ├── replay_buffer.py 环形经验回放池
|
| 220 |
│ └── report.py 四算法 Holdout 横向对比报告生成器
|
| 221 |
│
|
|
@@ -250,8 +262,7 @@ rl-maze/
|
|
| 250 |
|------|---------|:--------------:|:---:|
|
| 251 |
| Round 1 | 初版(`ep=2000`, `decay=0.995`) | 61.0% | 0.605 |
|
| 252 |
| Round 2 | `ep=6000`, `decay=0.9985` | 64.0% | 0.633 |
|
| 253 |
-
| Round 3 | `buffer=80k`, `target_freq=1500`, `shaping=0.5` | 74.0% | 0.735 |
|
| 254 |
-
| **Round 4** | `visited_map` 4th channel + EVAL-based checkpoint | **75.0%** | **0.735** |
|
| 255 |
|
| 256 |
完整超参诊断与论文依据详见 [`docs/hyperparameter_study.md`](docs/hyperparameter_study.md)。
|
| 257 |
|
|
|
|
| 10 |
|
| 11 |
# RL Maze Navigator
|
| 12 |
|
| 13 |
+
### Benchmarking DQN variants on procedurally-generated mazes · SPL evaluation · 74% Holdout success rate
|
| 14 |
|
| 15 |
[](https://github.com/Lee93whut/rl-maze/actions/workflows/test.yml)
|
| 16 |
[](https://www.python.org/)
|
|
|
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
+
## 算法对比结果(Round 3,最终)
|
| 26 |
|
| 27 |
> Holdout 评估:100 张训练中**从未见过**的独立地图(seed+200000),ε=0 贪心推理。
|
| 28 |
> 指标:[SPL](https://arxiv.org/abs/1807.06757)(Anderson et al. 2018,导航领域标准评估指标)。
|
| 29 |
+
> Round 3 超参:`buffer=80000`、`target_update_freq=1500`、`distance_shaping_alpha=0.5`,Double DQN 单算法验证。
|
| 30 |
|
| 31 |
| 算法 | 成功率 | SPL | 峰值成功率 | 收敛 Episode |
|
| 32 |
|------|:------:|:---:|:---------:|:-----------:|
|
| 33 |
+
| **Double DQN** (R3) | **74.0%** | **0.735** | **84.0%** | 3750 |
|
| 34 |
| Double DQN (R2) | 64.0% | 0.633 | 74.0% | 3300 |
|
| 35 |
| Vanilla DQN (R1) | 56.0% | 0.559 | — | 1921 |
|
| 36 |
| Double DQN (R1) | 61.0% | 0.605 | — | 948 |
|
|
|
|
| 98 |
|
| 99 |
撞墙步位置不变,不触发 shaping,避免撞墙获得零 shaping 奖励误导策略。`α=0.5` 使 shaping 幅度为基础奖励(-1)的 50%,提供方向感但不压过终点奖励(+100)。符合 Ng et al. (1999) 的势函数 shaping 理论,保证最优策略不变。
|
| 100 |
|
| 101 |
+
### 5. Anti-Loop 双层防护
|
| 102 |
|
| 103 |
+
**训练阶段**:对重复访问格子施加递进奖励惩罚,引导 Q 函数主动规避循环路径:
|
| 104 |
|
| 105 |
+
```python
|
| 106 |
+
if revisit_penalty != 0.0 and not info.get("hit_wall", False):
|
| 107 |
+
visit_cnt = ep_visited.get(cur_pos, 0)
|
| 108 |
+
if visit_cnt > 0:
|
| 109 |
+
reward += revisit_penalty * visit_cnt # 访问次数越多,惩罚越重
|
| 110 |
+
ep_visited[cur_pos] = visit_cnt + 1
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
**推理阶段**(Demo):对高频重复访问的格子施加 Q 值惩罚,作为额外安全网:
|
| 114 |
|
| 115 |
+
```python
|
| 116 |
+
visit_cnt = visited_count.get(cur_pos, 0)
|
| 117 |
+
if visit_cnt >= 2:
|
| 118 |
+
q_values[action] -= 3.0 * visit_cnt
|
| 119 |
+
```
|
| 120 |
|
| 121 |
+
两层机制职责分离:训练层修改 reward shaping 使 Q 函数内化回避循环;推理层直接修正 Q 值作为兜底,不影响训练分布。
|
| 122 |
|
| 123 |
### 6. BFS 连通性保证
|
| 124 |
|
|
|
|
| 227 |
│
|
| 228 |
├── src/
|
| 229 |
│ ├── train.py DQN 训练主循环(Warmup + 三看板 + 盲测评估)
|
| 230 |
+
│ ├── model.py DQNNetwork / DuelingDQNNetwork(3-Conv + 2-FC)
|
| 231 |
│ ├── replay_buffer.py 环形经验回放池
|
| 232 |
│ └── report.py 四算法 Holdout 横向对比报告生成器
|
| 233 |
│
|
|
|
|
| 262 |
|------|---------|:--------------:|:---:|
|
| 263 |
| Round 1 | 初版(`ep=2000`, `decay=0.995`) | 61.0% | 0.605 |
|
| 264 |
| Round 2 | `ep=6000`, `decay=0.9985` | 64.0% | 0.633 |
|
| 265 |
+
| Round 3 | `buffer=80k`, `target_freq=1500`, `shaping=0.5` | **74.0%** | **0.735** |
|
|
|
|
| 266 |
|
| 267 |
完整超参诊断与论文依据详见 [`docs/hyperparameter_study.md`](docs/hyperparameter_study.md)。
|
| 268 |
|