Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach
Paper β’ 2407.00662 β’ Published
This repository contains the training pipeline for an RL agent competing in the TIL-26 Automated Exploration (AE) challenge β a competitive multi-agent Bomberman-like environment.
Environment: 2β6 team competitive Bomberman on a procedurally generated 16Γ16 maze. Key challenges:
| Phase | Description | Opponents | Key Technique |
|---|---|---|---|
| 1 | MaskablePPO baseline | Random valid actions | Invalid action masking |
| 2 | Adaptive exploration | Random + visit-count bonus | Annealing: Ξ± = 1 β tanh(kΒ·deaths) |
| 3 | Curriculum self-play | Rule-based (static β smart) | Elo-style difficulty progression |
sb3-contrib): Handles invalid actions by setting logits to -β before softmax. Proven superior to action penalties (Huang & OntaΓ±Γ³n, 2020).# Download the environment (auto-bootstrapped in script)
python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
export TRACKIO_PROJECT="til-26-ae"
python train_all_phases.py
# Requires HF credits β run from a Space with the script uploaded
# Hardware: cpu-upgrade or a10g-large for GPU acceleration
Trackio dashboard: E-Rong/til-26-ae-trackio
Logged metrics per phase:
train/mean_episode_rewardtrain/mean_episode_lengthtrain/mean_explore_bonus (Phase 2)train/curriculum_stage (Phase 3)Alerts trigger on:
train_all_phases.py # Full 3-phase pipeline
requirements.txt # Dependencies
bomberman_phase1_final.zip # Saved after Phase 1
bomberman_phase2_final.zip # Saved after Phase 2
bomberman_phase3_final.zip # Saved after Phase 3
To evaluate a trained agent against random opponents:
from train_all_phases import BombermanSingleAgentEnv
from sb3_contrib import MaskablePPO
from til_environment.config import default_config
cfg = default_config()
env = BombermanSingleAgentEnv(cfg=cfg)
model = MaskablePPO.load("bomberman_phase3_final")
obs, _ = env.reset(seed=42)
for _ in range(200):
action, _ = model.predict(obs, action_masks=env.action_masks())
obs, reward, done, truncated, info = env.step(action)
if done or truncated:
break
env.close()
MIT β based on the TIL-26 AE challenge environment.
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "E-Rong/til-26-ae-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.