TIL-26-AE: Automated Exploration Bomberman Agent
Repository: E-Rong/til-26-ae-agent
Challenge: The Intelligent League (TIL) β Automated Exploration (AE)
Base Environment: e-rong/til-26-ae Space
Model Repo: E-Rong/til-26-ae-agent (checkpoints + inference code)
Table of Contents
- Research & Literature Review
- Problem Analysis
- Development Decisions
- Training Phases
- Results
- Artifacts
- Next Steps
1. Research & Literature Review
1.1 Domain: Multi-Agent Bomberman RL
The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is autonomous exploration.
1.2 Key Papers
| Paper | arXiv ID | Key Insight | Relevance |
|---|---|---|---|
| Pommerman: A Multi-Agent Benchmark | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
| MAPPO | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
| Invalid Action Masking | 2006.14171 | Masks logits before softmax | Directly applicable |
| PPO Algorithms | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |
1.3 Why MaskablePPO?
Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes action_mask: uint8[6]. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.
1.4 Why Curriculum Learning?
Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy β hard) is standard in competitive multi-agent RL.
1.5 Why Not DQN?
DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in sb3-contrib.
2. Problem Analysis
2.1 Environment Structure
- Grid size: 16Γ16
- Agents: Configurable (default 2 teams, Phase 3 uses 3)
- Observations: Dict with
agent_viewcone[7Γ5Γ25],base_viewcone[5Γ5Γ25], direction, location, health,action_mask[6], etc. - Actions: Discrete(6) β FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
- Episode length: ~200 steps
2.2 Observation Flattening
Flattened to 1511-dim vector: agent_viewcone(875) + base_viewcone(625) + 11 scalars.
2.3 Action Masking
Critical bug found: Monitor must wrap outside ActionMasker, not inside. Otherwise get_action_masks() fails because Monitor does not expose action_masks().
3. Development Decisions
3.1 Single-Agent Wrapper
Controls only agent_0; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.
3.2 3-Phase Curriculum
| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| 1 | Random | 500k | Learn movement, bombs, basics |
| 2 | Random + exploration bonus | 500k | Prevent camping exploit |
| 3 | Rule-based curriculum | 1M | Generalize to structured opponents |
3.3 Philosophy
stable-baselines3for PPO coresb3-contribfor MaskablePPO + ActionMaskerhuggingface_hubfor persistent checkpoint storage
3.4 Why Hub Every 50k Steps
Sandbox resets (T4 container recycling) caused local /app/data/ loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.
4. Training Phases
4.1 Phase 1: Foundation (vs Random)
Duration: 500,352 steps Result: Win rate 92%, avg reward 180.1, 100% survival Challenges: Wrapper ordering, dependency issues, sandbox resets
4.2 Phase 2: Exploration Shaping (COMPLETE)
Duration: 500,408 additional steps (600,352 β 1,001,760) Mechanism: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) Hardware: A10G, ~50 FPS Wall time: ~2h 45min Result: Win rate 93.0%, avg reward 153.4, avg bombs 20.1 Key insight: Reward decreased (180β153) but win rate increased (92%β93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.
4.3 Phase 3: Curriculum Self-Play (PENDING)
Script: phase3_curriculum.py (ready on Hub)
Plan: 5-stage rule-based curriculum β static β random β simple_bomb β evasive β mixed
Duration: 1M steps
Advancement gate: >55% win rate per stage
5. Results
5.1 Phase 1 Results
| Metric | Value |
|---|---|
| Timesteps | 500,352 |
| Final Reward | 237.0 |
| FPS | 52 (A10G) |
| Wall time | ~2h 15min |
| Win Rate (eval) | 92.0% |
| Avg Reward (eval) | 180.1 |
| Survival Rate | 100.0% |
5.2 Phase 2 Results
| Metric | Value |
|---|---|
| Timesteps | 1,001,760 total (500,408 new) |
| FPS | 50 (A10G) |
| Wall time | ~2h 45min |
| Win Rate (eval) | 93.0% |
| Avg Reward (eval) | 153.4 |
| Avg Bombs | 20.1 |
6. Artifacts
| File | Purpose |
|---|---|
phase1_final.zip |
Phase 1 complete checkpoint |
phase2_final.zip |
Phase 2 complete checkpoint |
phase2_ckpt_*.zip |
Phase 2 intermediates (650kβ1M) |
phase2_eval_results.txt |
Phase 2 evaluation metrics |
ae_manager.py |
Inference code |
docs/ae.md |
This documentation |
7. Next Steps
- Submit Phase 3 HF Job (
phase3_curriculum.py) - Monitor 5-stage curriculum progression
- Evaluate final model vs mixed rule-based opponents
- Future: CNN policy, opponent modeling, LSTM memory
Last updated: 2026-05-14 β Phase 2 complete, Phase 3 ready