interview / config.yaml
Lee93whut
docs: clean up R3/R4 record and consolidate technical narrative
92423f0
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Maze Environment Configuration
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
maze:
grid_size: 10
obstacle_density: 0.25
max_steps: 200
# seed ๅญ—ๆฎตๅทฒ็งป้™ค๏ผšๆญฃๅธธ่ฎญ็ปƒๅœฐๅ›พ้šๆœบๅคšๆ ทๅŒ–๏ผŒไฟ่ฏๆณ›ๅŒ–่ƒฝๅŠ›ใ€‚
# ้œ€ๅ›บๅฎšๅœฐๅ›พๆ—ถ่ฏทไฝฟ็”จ overfit ่Š‚ๆˆ–ๆ˜พๅผ่ฐƒ็”จ env.reset(seed=X)ใ€‚
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Reward shaping
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
rewards:
goal: 100
wall_hit: -10
step: -1
distance_shaping_alpha: 0.0 # ่ท็ฆป shaping ็ณปๆ•ฐ๏ผ›0 = ๅ…ณ้—ญ
# env.py ๅ†…้ƒจๆ”ฏๆŒ่ฏฅๅ‚ๆ•ฐ๏ผˆๆฏๆญฅ้ขๅค–ๅฅ–ๅŠฑ = alpha ร— ฮ”ๆ›ผๅ“ˆ้กฟ่ท็ฆป๏ผ‰๏ผŒ
# ไฝ†ๅฝ“ๅ‰ train.py ้‡ๆž„ๅŽๆœช้€ไผ ๆญคๅญ—ๆฎต๏ผŒๅฎž้™…ไธบ 0.0ใ€‚
# ่‹ฅ้œ€ๅฏ็”จ๏ผŒ้œ€ๅœจ src/train.py ๆ‰‹ๅทฅๆž„้€  MazeEnv(...) ๅค„่ฟฝๅŠ 
# `distance_shaping_alpha=distance_shaping_alpha` ๅนถๅœจ้…็ฝฎ่ฏปๅ–ๅค„
# ่งฃๆž reward_cfg.get("distance_shaping_alpha", 0.0)ใ€‚
revisit_penalty: 0.0 # ๅทฒ็งป่‡ณ็Šถๆ€ๅฑ‚๏ผšvisited_map ็ฌฌ4้€š้“็ผ–็ ่ฎฟ้—ฎๅކๅฒ๏ผˆMarkov-correct๏ผ‰
# ๅฅ–ๅŠฑๅฑ‚ revisit_penalty ่ฟๅ้ฉฌๅฐ”ๅฏๅคซๆ€ง๏ผŒๅทฒๅผƒ็”จ
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# DQN Training Hyperparameters
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
dqn:
# โ”€โ”€ Reproducibility โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
seed: 42
# โ”€โ”€ Algorithm variant โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# vanilla : DQNNetwork + Vanilla Target (Mnih et al., 2015)
# double : DQNNetwork + Double DQN (van Hasselt et al., AAAI 2016)
# dueling : DuelingDQN + Vanilla Target (Wang et al., 2016) โ† ๆœ€ไผ˜๏ผˆR4 Holdout 84%๏ผ‰
# double_dueling : DuelingDQN + Double DQN (ไธค้กนๆ”น่ฟ›ๆญฃไบคๅ ๅŠ )
algorithm: "dueling"
# โ”€โ”€ Replay Buffer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
buffer_capacity: 80000 # max transitions stored (ring-list, O(batch_size) sampling)
# r2=20000 ็บฆ 250 ๅฑ€่ฝฎๆข๏ผŒๆˆๅŠŸๆ ทๆœฌๅฟซ้€Ÿๆถˆๅคฑ๏ผŒr3 ่ตทๆ‰ฉ่‡ณ 80000๏ผˆ็บฆ 1000 ๅฑ€๏ผ‰
batch_size: 64 # SGD mini-batch size
# โ”€โ”€ Training schedule โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
num_episodes: 5000 # total training episodes
# r1=2000 ๆ—ถๆ›ฒ็บฟๆœชๆ”ถๆ•›๏ผŒr2 ่ตท่ฐƒๆ•ดไธบ 6000
# r4 ่ตทๆ”นไธบ 5000๏ผšR3 ๅณฐๅ€ผๅœจ ep=3750๏ผŒ5000 ๆœ‰ไฝ™้‡ไธ”่Š‚็œๆ—ถ้—ด
learning_rate: 0.0005
gamma: 0.99 # discount factor
# โ”€โ”€ ฮต-greedy exploration โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
epsilon_start: 1.0
epsilon_end: 0.05
epsilon_decay: 0.9985 # multiplicative decay per episode (after warmup)
# r1=0.995 ๅฏผ่‡ด epโ‰ˆ800 ๆŽข็ดข่งฆๅบ•๏ผŒๅŽ 1200 ep ๆ ทๆœฌๅคšๆ ทๆ€งๆžฏ็ซญ
# r2 ่ตท่ฐƒๆ•ดไธบ 0.9985๏ผŒepโ‰ˆ2189 ๆ‰่งฆๅบ•๏ผŒ่ฆ†็›–ๅฎŒๆ•ดๆœ‰ๆ•ˆ่ฎญ็ปƒๆœŸ
# โ”€โ”€ Target network sync โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
target_update_freq: 1500 # hard-copy every N gradient update steps
# r2=500 ้šๆœบ่ตท็ปˆ็‚น Q ๆ–นๅทฎๅคง๏ผŒๅŒๆญฅ่ฟ‡้ข‘ๅฏผ่‡ด็›ฎๆ ‡ๆผ‚็งป๏ผ›r3 ่ตท่ฐƒๆ•ดไธบ 1500
# โ”€โ”€ Episode-based warmup โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
warmup_episodes: 200 # first N episodes: pure random (ฮต=1.0), no grad updates
# โ”€โ”€ TensorBoard three-category logging โ”€โ”€โ”€โ”€โ”€โ”€โ”€
eval_every: 100 # Evaluation_Exam/ frequency (episodes)
# r4 ไปŽ 50 ๆ”นไธบ 100๏ผŒๅ‡ๅฐ‘ EVAL ๅผ€้”€๏ผŒๅŠ ้€Ÿ่ฎญ็ปƒ๏ผˆ่Š‚็œ็บฆ 20%๏ผ‰
num_test_mazes: 50 # blind test mazes per evaluation
# โ”€โ”€ Logging & saving โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
log_dir: "runs" # TensorBoard log root
save_dir: "results" # directory for best_model.pth
success_window: 100 # rolling window for success-rate metric
save_window: 50 # rolling window for best-model save trigger
print_every: 10 # console print frequency (episodes)
# โ”€โ”€ Start / Goal position โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# false๏ผˆ้ป˜่ฎค๏ผ‰๏ผšๅ›บๅฎš่ตท็‚น (1,1)ใ€็ปˆ็‚น (N-2,N-2)๏ผŒไธŽ็Žฐๆœ‰่ฎญ็ปƒๆจกๅž‹ๅ…ผๅฎน
# true ๏ผšๆฏๅฑ€้šๆœบ้€‰ๅ–่ตท็ปˆ็‚น๏ผŒ่ฏ„ไผฐไนŸ้šๆœบๅŒ–๏ผŒ้œ€้‡ๆ–ฐ่ฎญ็ปƒๆจกๅž‹
random_start_goal: true
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Overfit (debug) mode โ€” 5ร—5 tiny maze
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
overfit:
grid_size: 5
obstacle_density: 0.0 # no random obstacles โ†’ deterministic map
max_steps: 50
seed: 0
num_episodes: 500
epsilon_decay: 0.990
warmup_episodes: 50 # shorter warmup for overfit debug
batch_size: 32
target_update_freq: 100
eval_every: 50
num_test_mazes: 10
print_every: 50
algorithm: "double_dueling"