| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # Maze Environment Configuration | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| maze: | |
| grid_size: 10 | |
| obstacle_density: 0.25 | |
| max_steps: 200 | |
| # seed ๅญๆฎตๅทฒ็งป้ค๏ผๆญฃๅธธ่ฎญ็ปๅฐๅพ้ๆบๅคๆ ทๅ๏ผไฟ่ฏๆณๅ่ฝๅใ | |
| # ้ๅบๅฎๅฐๅพๆถ่ฏทไฝฟ็จ overfit ่ๆๆพๅผ่ฐ็จ env.reset(seed=X)ใ | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # Reward shaping | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| rewards: | |
| goal: 100 | |
| wall_hit: -10 | |
| step: -1 | |
| distance_shaping_alpha: 0.0 # ่ท็ฆป shaping ็ณปๆฐ๏ผ0 = ๅ ณ้ญ | |
| # env.py ๅ ้จๆฏๆ่ฏฅๅๆฐ๏ผๆฏๆญฅ้ขๅคๅฅๅฑ = alpha ร ฮๆผๅ้กฟ่ท็ฆป๏ผ๏ผ | |
| # ไฝๅฝๅ train.py ้ๆๅๆช้ไผ ๆญคๅญๆฎต๏ผๅฎ้ ไธบ 0.0ใ | |
| # ่ฅ้ๅฏ็จ๏ผ้ๅจ src/train.py ๆๅทฅๆ้ MazeEnv(...) ๅค่ฟฝๅ | |
| # `distance_shaping_alpha=distance_shaping_alpha` ๅนถๅจ้ ็ฝฎ่ฏปๅๅค | |
| # ่งฃๆ reward_cfg.get("distance_shaping_alpha", 0.0)ใ | |
| revisit_penalty: 0.0 # ๅทฒ็งป่ณ็ถๆๅฑ๏ผvisited_map ็ฌฌ4้้็ผ็ ่ฎฟ้ฎๅๅฒ๏ผMarkov-correct๏ผ | |
| # ๅฅๅฑๅฑ revisit_penalty ่ฟๅ้ฉฌๅฐๅฏๅคซๆง๏ผๅทฒๅผ็จ | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # DQN Training Hyperparameters | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| dqn: | |
| # โโ Reproducibility โโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| seed: 42 | |
| # โโ Algorithm variant โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # vanilla : DQNNetwork + Vanilla Target (Mnih et al., 2015) | |
| # double : DQNNetwork + Double DQN (van Hasselt et al., AAAI 2016) | |
| # dueling : DuelingDQN + Vanilla Target (Wang et al., 2016) โ ๆไผ๏ผR4 Holdout 84%๏ผ | |
| # double_dueling : DuelingDQN + Double DQN (ไธค้กนๆน่ฟๆญฃไบคๅ ๅ ) | |
| algorithm: "dueling" | |
| # โโ Replay Buffer โโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| buffer_capacity: 80000 # max transitions stored (ring-list, O(batch_size) sampling) | |
| # r2=20000 ็บฆ 250 ๅฑ่ฝฎๆข๏ผๆๅๆ ทๆฌๅฟซ้ๆถๅคฑ๏ผr3 ่ตทๆฉ่ณ 80000๏ผ็บฆ 1000 ๅฑ๏ผ | |
| batch_size: 64 # SGD mini-batch size | |
| # โโ Training schedule โโโโโโโโโโโโโโโโโโโโโโโโ | |
| num_episodes: 5000 # total training episodes | |
| # r1=2000 ๆถๆฒ็บฟๆชๆถๆ๏ผr2 ่ตท่ฐๆดไธบ 6000 | |
| # r4 ่ตทๆนไธบ 5000๏ผR3 ๅณฐๅผๅจ ep=3750๏ผ5000 ๆไฝ้ไธ่็ๆถ้ด | |
| learning_rate: 0.0005 | |
| gamma: 0.99 # discount factor | |
| # โโ ฮต-greedy exploration โโโโโโโโโโโโโโโโโโโโโ | |
| epsilon_start: 1.0 | |
| epsilon_end: 0.05 | |
| epsilon_decay: 0.9985 # multiplicative decay per episode (after warmup) | |
| # r1=0.995 ๅฏผ่ด epโ800 ๆข็ดข่งฆๅบ๏ผๅ 1200 ep ๆ ทๆฌๅคๆ ทๆงๆฏ็ซญ | |
| # r2 ่ตท่ฐๆดไธบ 0.9985๏ผepโ2189 ๆ่งฆๅบ๏ผ่ฆ็ๅฎๆดๆๆ่ฎญ็ปๆ | |
| # โโ Target network sync โโโโโโโโโโโโโโโโโโโโโโ | |
| target_update_freq: 1500 # hard-copy every N gradient update steps | |
| # r2=500 ้ๆบ่ตท็ป็น Q ๆนๅทฎๅคง๏ผๅๆญฅ่ฟ้ขๅฏผ่ด็ฎๆ ๆผ็งป๏ผr3 ่ตท่ฐๆดไธบ 1500 | |
| # โโ Episode-based warmup โโโโโโโโโโโโโโโโโโโโโ | |
| warmup_episodes: 200 # first N episodes: pure random (ฮต=1.0), no grad updates | |
| # โโ TensorBoard three-category logging โโโโโโโ | |
| eval_every: 100 # Evaluation_Exam/ frequency (episodes) | |
| # r4 ไป 50 ๆนไธบ 100๏ผๅๅฐ EVAL ๅผ้๏ผๅ ้่ฎญ็ป๏ผ่็็บฆ 20%๏ผ | |
| num_test_mazes: 50 # blind test mazes per evaluation | |
| # โโ Logging & saving โโโโโโโโโโโโโโโโโโโโโโโโโ | |
| log_dir: "runs" # TensorBoard log root | |
| save_dir: "results" # directory for best_model.pth | |
| success_window: 100 # rolling window for success-rate metric | |
| save_window: 50 # rolling window for best-model save trigger | |
| print_every: 10 # console print frequency (episodes) | |
| # โโ Start / Goal position โโโโโโโโโโโโโโโโโโโโโ | |
| # false๏ผ้ป่ฎค๏ผ๏ผๅบๅฎ่ตท็น (1,1)ใ็ป็น (N-2,N-2)๏ผไธ็ฐๆ่ฎญ็ปๆจกๅๅ ผๅฎน | |
| # true ๏ผๆฏๅฑ้ๆบ้ๅ่ตท็ป็น๏ผ่ฏไผฐไน้ๆบๅ๏ผ้้ๆฐ่ฎญ็ปๆจกๅ | |
| random_start_goal: true | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # Overfit (debug) mode โ 5ร5 tiny maze | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| overfit: | |
| grid_size: 5 | |
| obstacle_density: 0.0 # no random obstacles โ deterministic map | |
| max_steps: 50 | |
| seed: 0 | |
| num_episodes: 500 | |
| epsilon_decay: 0.990 | |
| warmup_episodes: 50 # shorter warmup for overfit debug | |
| batch_size: 32 | |
| target_update_freq: 100 | |
| eval_every: 50 | |
| num_test_mazes: 10 | |
| print_every: 50 | |
| algorithm: "double_dueling" | |