|
|
--- |
|
|
tags: |
|
|
- reinforcement-learning |
|
|
- game-theory |
|
|
- colonel-blotto |
|
|
- neurips-2025 |
|
|
- graph-neural-networks |
|
|
- preference-learning |
|
|
- llm-distillation |
|
|
- meta-learning |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**. |
|
|
The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity. |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
The approach combines: |
|
|
|
|
|
- **Graph Attention Networks** for structured game-state encoding |
|
|
- **Proximal Policy Optimization (PPO)** as the core learning algorithm |
|
|
- **FiLM-based opponent adaptation** for fast response to opponent behavior |
|
|
- **Rollout-grounded preference learning** using two large language models |
|
|
- **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment |
|
|
- **Knowledge distillation** from the aligned teacher into an efficient policy |
|
|
|
|
|
The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play. |
|
|
|
|
|
--- |
|
|
|
|
|
## Game Configuration |
|
|
|
|
|
- **Game**: Colonel Blotto |
|
|
- **Battlefields**: 3 |
|
|
- **Units per round**: 20 |
|
|
- **Rounds per game**: 5 |
|
|
- **Action space size**: 231 valid allocations |
|
|
- **Evaluation protocol**: Fixed scripted and adaptive opponent pool |
|
|
|
|
|
--- |
|
|
|
|
|
## Policy Architecture |
|
|
|
|
|
### Graph-Based State Encoder |
|
|
- Heterogeneous graph with **25–40 nodes** |
|
|
- Node types include: |
|
|
- Battlefield nodes |
|
|
- Recent round summary nodes |
|
|
- Global state node |
|
|
- Node feature dimension: **32** |
|
|
- Encoder: |
|
|
- 3 Graph Attention layers |
|
|
- 6 attention heads |
|
|
- Hidden size 192 |
|
|
|
|
|
### Opponent Modeling and Adaptation |
|
|
- Opponent history encoded via a lightweight MLP |
|
|
- **FiLM adaptation layers** modulate policy activations based on opponent embedding |
|
|
- Enables rapid adjustment to non-stationary strategies |
|
|
|
|
|
### Action Head |
|
|
- Portfolio-based action head with **6 latent strategies** |
|
|
- Strategies mixed via learned attention |
|
|
- Total policy parameters: **~6.8M** |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Pipeline |
|
|
|
|
|
Training follows a multi-stage curriculum: |
|
|
|
|
|
1. **Graph PPO Pretraining** |
|
|
- PPO with clip ratio 0.2 |
|
|
- Discount factor γ = 0.99 |
|
|
- GAE λ = 0.95 |
|
|
- Trained against a diverse scripted opponent pool |
|
|
|
|
|
2. **Preference Generation via Rollouts** |
|
|
- ~800 intermediate states sampled |
|
|
- Candidate actions proposed by: |
|
|
- Llama 3.1 Instruct |
|
|
- Qwen 2.5 Instruct |
|
|
- Each proposal evaluated with 4 stochastic rollouts |
|
|
- Higher-return actions labeled preferred |
|
|
- ~2,300 preference pairs generated |
|
|
|
|
|
3. **Teacher Alignment** |
|
|
- Supervised Fine Tuning on chosen actions |
|
|
- Direct Preference Optimization using frozen reference model |
|
|
|
|
|
4. **Policy Distillation** |
|
|
- Aligned teacher generates state-to-action labels |
|
|
- Graph policy trained via cross-entropy imitation |
|
|
|
|
|
5. **Final PPO Refinement** |
|
|
- PPO resumes using environment rewards |
|
|
- Stabilizes behavior after distillation |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents. |
|
|
|
|
|
| Agent | Win Rate | Risk Metric | |
|
|
|------|---------|------------| |
|
|
| PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% | |
|
|
| PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% | |
|
|
| Full curriculum | 78.4% | Exploitability proxy 0.48 | |
|
|
|
|
|
- **Allocation collapse**: fraction of rounds placing >60% units on one field |
|
|
- Distillation yields a **+9.5 point** win-rate gain over PPO |
|
|
- Full curriculum yields **+20 point** gain with reduced over-specialization |
|
|
|
|
|
These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation. |
|
|
|
|
|
--- |
|
|
|
|
|
## Repository Contents |
|
|
|
|
|
### Policy Checkpoints |
|
|
- `policy_models/policy_after_ppo.pt` |
|
|
- `policy_models/policy_after_distill.pt` |
|
|
- `policy_models/policy_final.pt` |
|
|
|
|
|
### LLM Teacher Models |
|
|
- `sft_model/` – supervised fine-tuned model |
|
|
- `dpo_model/` – preference-aligned model |
|
|
|
|
|
### Configuration and Logs |
|
|
- `master_config.json` – training configuration |
|
|
- `battleground_eval.json` – evaluation summaries |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Load Policy |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from policy import GraphPolicy |
|
|
|
|
|
policy = GraphPolicy(...) |
|
|
policy.load_state_dict(torch.load("policy_models/policy_final.pt")) |
|
|
policy.eval() |
|
|
|
|
|
|
|
|
### Loading Fine-tuned LLM |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load SFT or DPO model |
|
|
tokenizer = AutoTokenizer.from_pretrained("./sft_model") |
|
|
model = AutoModelForCausalLM.from_pretrained("./sft_model") |
|
|
|
|
|
# Use for inference |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=32) |
|
|
``` |
|
|
|
|
|
## 🎓 Research Context |
|
|
|
|
|
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: |
|
|
|
|
|
- Language models function effectively as strategic prior generators when grounded by rollouts |
|
|
- Graph-based representations enable cross-strategy generalization under compact policies |
|
|
- Distillation transfers high-level reasoning into fast, deployable agents |
|
|
|
|
|
### Key Innovations |
|
|
|
|
|
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states |
|
|
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism |
|
|
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings |
|
|
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies |
|
|
|
|
|
|
|
|
## 📄 License |
|
|
|
|
|
MIT License - See LICENSE file for details |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
- Built for **NeurIPS 2025 MindGames Workshop** |
|
|
- Uses PyTorch, HuggingFace Transformers, and PEFT |
|
|
- Training infrastructure: NVIDIA H200 GPU |
|
|
|
|
|
--- |
|
|
|
|
|
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} |
|
|
**Uploaded from**: Notebook Environment |
|
|
|