Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- multi-agent-rl
|
| 5 |
+
- mappo
|
| 6 |
+
- reward-attribution
|
| 7 |
+
- representation-geometry
|
| 8 |
+
- smacv2
|
| 9 |
+
- tribal-village
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# rl-workshop-2026 — Final Policy Checkpoints
|
| 13 |
+
|
| 14 |
+
Final MAPPO policy checkpoints for the workshop paper *"Feedback Attribution
|
| 15 |
+
Determines Representation Geometry in Multi-Agent RL."* Logged alongside W&B
|
| 16 |
+
project [`tashapais/rl_workshop_2026`](https://wandb.ai/tashapais/rl_workshop_2026).
|
| 17 |
+
|
| 18 |
+
Each `.pt` is a dict with key `"model"` (PyTorch `state_dict` for the
|
| 19 |
+
`ActorCritic` defined in the `tribal-village` repo's `experiments/`).
|
| 20 |
+
|
| 21 |
+
## Table 1 — Tribal Village (12 agents, 308 actions), 4M agent-steps
|
| 22 |
+
`tribal_village/<run>/step_4002816.pt`, reward attribution
|
| 23 |
+
`r_i^alpha = (1-alpha) r_i + alpha * mean_j r_j`:
|
| 24 |
+
|
| 25 |
+
| Condition | alpha | seeds |
|
| 26 |
+
|-----------|-------|-------|
|
| 27 |
+
| Individual | 0.0 | 0,1,2 |
|
| 28 |
+
| Mixed | 0.8 | 0,1,2 |
|
| 29 |
+
| Shared | 1.0 | 0,1,2 |
|
| 30 |
+
|
| 31 |
+
## Table 2 — SMACv2 10gen_terran (6 terran units), 2M steps
|
| 32 |
+
`smacv2/<run>/step_2001408.pt`:
|
| 33 |
+
|
| 34 |
+
| Condition | seeds |
|
| 35 |
+
|-----------|-------|
|
| 36 |
+
| Individual (per-agent reward) | 0,1,2 |
|
| 37 |
+
| Shared (team-averaged) | 0,1,2 |
|
| 38 |
+
|
| 39 |
+
## Caveats (see repo `runs.md`)
|
| 40 |
+
- Tribal Village runs **fail the behavior gate** (no-op/random baseline) at 4M
|
| 41 |
+
steps under passive shaping; representation geometry from them is direction-
|
| 42 |
+
consistent (probe declines 0.75→0.50) but not yet behavior-grounded.
|
| 43 |
+
- SMAC `individual` agents are weak (~1.7% win) vs `shared` (~25%); SMAC D_act
|
| 44 |
+
is mask-contaminated and not paper-quotable without a mask-aware recompute.
|