File size: 6,119 Bytes
2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 4217294 2136269 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
---
tags:
- reinforcement-learning
- game-theory
- colonel-blotto
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
- meta-learning
license: mit
---
# Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation



This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**.
The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity.
---
## Overview
The approach combines:
- **Graph Attention Networks** for structured game-state encoding
- **Proximal Policy Optimization (PPO)** as the core learning algorithm
- **FiLM-based opponent adaptation** for fast response to opponent behavior
- **Rollout-grounded preference learning** using two large language models
- **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment
- **Knowledge distillation** from the aligned teacher into an efficient policy
The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play.
---
## Game Configuration
- **Game**: Colonel Blotto
- **Battlefields**: 3
- **Units per round**: 20
- **Rounds per game**: 5
- **Action space size**: 231 valid allocations
- **Evaluation protocol**: Fixed scripted and adaptive opponent pool
---
## Policy Architecture
### Graph-Based State Encoder
- Heterogeneous graph with **25–40 nodes**
- Node types include:
- Battlefield nodes
- Recent round summary nodes
- Global state node
- Node feature dimension: **32**
- Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192
### Opponent Modeling and Adaptation
- Opponent history encoded via a lightweight MLP
- **FiLM adaptation layers** modulate policy activations based on opponent embedding
- Enables rapid adjustment to non-stationary strategies
### Action Head
- Portfolio-based action head with **6 latent strategies**
- Strategies mixed via learned attention
- Total policy parameters: **~6.8M**
---
## Training Pipeline
Training follows a multi-stage curriculum:
1. **Graph PPO Pretraining**
- PPO with clip ratio 0.2
- Discount factor γ = 0.99
- GAE λ = 0.95
- Trained against a diverse scripted opponent pool
2. **Preference Generation via Rollouts**
- ~800 intermediate states sampled
- Candidate actions proposed by:
- Llama 3.1 Instruct
- Qwen 2.5 Instruct
- Each proposal evaluated with 4 stochastic rollouts
- Higher-return actions labeled preferred
- ~2,300 preference pairs generated
3. **Teacher Alignment**
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
4. **Policy Distillation**
- Aligned teacher generates state-to-action labels
- Graph policy trained via cross-entropy imitation
5. **Final PPO Refinement**
- PPO resumes using environment rewards
- Stabilizes behavior after distillation
---
## Evaluation Results
Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents.
| Agent | Win Rate | Risk Metric |
|------|---------|------------|
| PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% |
| PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% |
| Full curriculum | 78.4% | Exploitability proxy 0.48 |
- **Allocation collapse**: fraction of rounds placing >60% units on one field
- Distillation yields a **+9.5 point** win-rate gain over PPO
- Full curriculum yields **+20 point** gain with reduced over-specialization
These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation.
---
## Repository Contents
### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`
- `policy_models/policy_final.pt`
### LLM Teacher Models
- `sft_model/` – supervised fine-tuned model
- `dpo_model/` – preference-aligned model
### Configuration and Logs
- `master_config.json` – training configuration
- `battleground_eval.json` – evaluation summaries
---
## Usage
### Load Policy
```python
import torch
from policy import GraphPolicy
policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
### Loading Fine-tuned LLM
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```
## 🎓 Research Context
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
- Language models function effectively as strategic prior generators when grounded by rollouts
- Graph-based representations enable cross-strategy generalization under compact policies
- Distillation transfers high-level reasoning into fast, deployable agents
### Key Innovations
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
## 📄 License
MIT License - See LICENSE file for details
## 🙏 Acknowledgments
- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
---
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment
|