Update docs: Phase 2 complete, Phase 3 ready

47c41e4 verified about 8 hours ago

5.82 kB

TIL-26-AE: Automated Exploration Bomberman Agent

Repository: E-Rong/til-26-ae-agent Challenge: The Intelligent League (TIL) — Automated Exploration (AE) Base Environment: e-rong/til-26-ae Space Model Repo: E-Rong/til-26-ae-agent (checkpoints + inference code)

Research & Literature Review
Problem Analysis
Development Decisions
Training Phases
Results
Artifacts
Next Steps

1. Research & Literature Review

1.1 Domain: Multi-Agent Bomberman RL

The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is autonomous exploration.

1.2 Key Papers

Paper	arXiv ID	Key Insight	Relevance
Pommerman: A Multi-Agent Benchmark	2407.00662	PettingZoo + parallel env standard	Confirmed approach
MAPPO	2103.01955	Shared parameters, curriculum	Justified curriculum
Invalid Action Masking	2006.14171	Masks logits before softmax	Directly applicable
PPO Algorithms	1707.06347	Clipped surrogate, stable	Chosen over DQN

1.3 Why MaskablePPO?

Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes action_mask: uint8[6]. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

1.4 Why Curriculum Learning?

Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.

1.5 Why Not DQN?

DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in sb3-contrib.

2. Problem Analysis

2.1 Environment Structure

Grid size: 16×16
Agents: Configurable (default 2 teams, Phase 3 uses 3)
Observations: Dict with agent_viewcone[7×5×25], base_viewcone[5×5×25], direction, location, health, action_mask[6], etc.
Actions: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
Episode length: ~200 steps

2.2 Observation Flattening

Flattened to 1511-dim vector: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

2.3 Action Masking

Critical bug found: Monitor must wrap outside ActionMasker, not inside. Otherwise get_action_masks() fails because Monitor does not expose action_masks().

3. Development Decisions

3.1 Single-Agent Wrapper

Controls only agent_0; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

3.2 3-Phase Curriculum

Phase	Opponent	Duration	Purpose
1	Random	500k	Learn movement, bombs, basics
2	Random + exploration bonus	500k	Prevent camping exploit
3	Rule-based curriculum	1M	Generalize to structured opponents

3.3 Philosophy

stable-baselines3 for PPO core
sb3-contrib for MaskablePPO + ActionMasker
huggingface_hub for persistent checkpoint storage

3.4 Why Hub Every 50k Steps

Sandbox resets (T4 container recycling) caused local /app/data/ loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.

4. Training Phases

4.1 Phase 1: Foundation (vs Random)

Duration: 500,352 steps Result: Win rate 92%, avg reward 180.1, 100% survival Challenges: Wrapper ordering, dependency issues, sandbox resets

4.2 Phase 2: Exploration Shaping (COMPLETE)

Duration: 500,408 additional steps (600,352 → 1,001,760) Mechanism: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) Hardware: A10G, ~50 FPS Wall time: ~2h 45min Result: Win rate 93.0%, avg reward 153.4, avg bombs 20.1 Key insight: Reward decreased (180→153) but win rate increased (92%→93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.

4.3 Phase 3: Curriculum Self-Play (PENDING)

Script: phase3_curriculum.py (ready on Hub) Plan: 5-stage rule-based curriculum — static → random → simple_bomb → evasive → mixed Duration: 1M steps Advancement gate: >55% win rate per stage

5. Results

5.1 Phase 1 Results

Metric	Value
Timesteps	500,352
Final Reward	237.0
FPS	52 (A10G)
Wall time	~2h 15min
Win Rate (eval)	92.0%
Avg Reward (eval)	180.1
Survival Rate	100.0%

5.2 Phase 2 Results

Metric	Value
Timesteps	1,001,760 total (500,408 new)
FPS	50 (A10G)
Wall time	~2h 45min
Win Rate (eval)	93.0%
Avg Reward (eval)	153.4
Avg Bombs	20.1

6. Artifacts

File	Purpose
`phase1_final.zip`	Phase 1 complete checkpoint
`phase2_final.zip`	Phase 2 complete checkpoint
`phase2_ckpt_*.zip`	Phase 2 intermediates (650k–1M)
`phase2_eval_results.txt`	Phase 2 evaluation metrics
`ae_manager.py`	Inference code
`docs/ae.md`	This documentation

7. Next Steps

Submit Phase 3 HF Job (phase3_curriculum.py)
Monitor 5-stage curriculum progression
Evaluate final model vs mixed rule-based opponents
Future: CNN policy, opponent modeling, LSTM memory

Last updated: 2026-05-14 — Phase 2 complete, Phase 3 ready

E-Rong
/

til-26-ae-agent