E-Rong's picture
Update docs: Phase 2 complete, Phase 3 ready
47c41e4 verified

TIL-26-AE: Automated Exploration Bomberman Agent

Repository: E-Rong/til-26-ae-agent Challenge: The Intelligent League (TIL) β€” Automated Exploration (AE) Base Environment: e-rong/til-26-ae Space Model Repo: E-Rong/til-26-ae-agent (checkpoints + inference code)


Table of Contents

  1. Research & Literature Review
  2. Problem Analysis
  3. Development Decisions
  4. Training Phases
  5. Results
  6. Artifacts
  7. Next Steps

1. Research & Literature Review

1.1 Domain: Multi-Agent Bomberman RL

The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is autonomous exploration.

1.2 Key Papers

Paper arXiv ID Key Insight Relevance
Pommerman: A Multi-Agent Benchmark 2407.00662 PettingZoo + parallel env standard Confirmed approach
MAPPO 2103.01955 Shared parameters, curriculum Justified curriculum
Invalid Action Masking 2006.14171 Masks logits before softmax Directly applicable
PPO Algorithms 1707.06347 Clipped surrogate, stable Chosen over DQN

1.3 Why MaskablePPO?

Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes action_mask: uint8[6]. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

1.4 Why Curriculum Learning?

Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy β†’ hard) is standard in competitive multi-agent RL.

1.5 Why Not DQN?

DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in sb3-contrib.


2. Problem Analysis

2.1 Environment Structure

  • Grid size: 16Γ—16
  • Agents: Configurable (default 2 teams, Phase 3 uses 3)
  • Observations: Dict with agent_viewcone[7Γ—5Γ—25], base_viewcone[5Γ—5Γ—25], direction, location, health, action_mask[6], etc.
  • Actions: Discrete(6) β€” FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
  • Episode length: ~200 steps

2.2 Observation Flattening

Flattened to 1511-dim vector: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

2.3 Action Masking

Critical bug found: Monitor must wrap outside ActionMasker, not inside. Otherwise get_action_masks() fails because Monitor does not expose action_masks().


3. Development Decisions

3.1 Single-Agent Wrapper

Controls only agent_0; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

3.2 3-Phase Curriculum

Phase Opponent Duration Purpose
1 Random 500k Learn movement, bombs, basics
2 Random + exploration bonus 500k Prevent camping exploit
3 Rule-based curriculum 1M Generalize to structured opponents

3.3 Philosophy

  • stable-baselines3 for PPO core
  • sb3-contrib for MaskablePPO + ActionMasker
  • huggingface_hub for persistent checkpoint storage

3.4 Why Hub Every 50k Steps

Sandbox resets (T4 container recycling) caused local /app/data/ loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.


4. Training Phases

4.1 Phase 1: Foundation (vs Random)

Duration: 500,352 steps Result: Win rate 92%, avg reward 180.1, 100% survival Challenges: Wrapper ordering, dependency issues, sandbox resets

4.2 Phase 2: Exploration Shaping (COMPLETE)

Duration: 500,408 additional steps (600,352 β†’ 1,001,760) Mechanism: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) Hardware: A10G, ~50 FPS Wall time: ~2h 45min Result: Win rate 93.0%, avg reward 153.4, avg bombs 20.1 Key insight: Reward decreased (180β†’153) but win rate increased (92%β†’93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.

4.3 Phase 3: Curriculum Self-Play (PENDING)

Script: phase3_curriculum.py (ready on Hub) Plan: 5-stage rule-based curriculum β€” static β†’ random β†’ simple_bomb β†’ evasive β†’ mixed Duration: 1M steps Advancement gate: >55% win rate per stage


5. Results

5.1 Phase 1 Results

Metric Value
Timesteps 500,352
Final Reward 237.0
FPS 52 (A10G)
Wall time ~2h 15min
Win Rate (eval) 92.0%
Avg Reward (eval) 180.1
Survival Rate 100.0%

5.2 Phase 2 Results

Metric Value
Timesteps 1,001,760 total (500,408 new)
FPS 50 (A10G)
Wall time ~2h 45min
Win Rate (eval) 93.0%
Avg Reward (eval) 153.4
Avg Bombs 20.1

6. Artifacts

File Purpose
phase1_final.zip Phase 1 complete checkpoint
phase2_final.zip Phase 2 complete checkpoint
phase2_ckpt_*.zip Phase 2 intermediates (650k–1M)
phase2_eval_results.txt Phase 2 evaluation metrics
ae_manager.py Inference code
docs/ae.md This documentation

7. Next Steps

  • Submit Phase 3 HF Job (phase3_curriculum.py)
  • Monitor 5-stage curriculum progression
  • Evaluate final model vs mixed rule-based opponents
  • Future: CNN policy, opponent modeling, LSTM memory

Last updated: 2026-05-14 β€” Phase 2 complete, Phase 3 ready