File size: 6,119 Bytes
2136269
 
 
 
 
 
 
4217294
 
2136269
 
 
 
4217294
2136269
 
 
 
 
4217294
 
2136269
4217294
 
 
 
 
2136269
4217294
 
 
 
 
 
2136269
4217294
 
 
2136269
4217294
2136269
4217294
 
 
 
 
 
2136269
4217294
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2136269
4217294
2136269
4217294
2136269
4217294
 
 
 
 
2136269
4217294
 
 
 
 
 
 
 
2136269
4217294
 
 
2136269
4217294
 
 
2136269
4217294
 
 
2136269
4217294
2136269
4217294
2136269
4217294
2136269
4217294
 
 
 
 
2136269
4217294
 
 
2136269
4217294
2136269
4217294
2136269
4217294
2136269
4217294
 
 
 
2136269
4217294
 
 
2136269
4217294
 
 
2136269
4217294
2136269
4217294
2136269
4217294
2136269
 
 
4217294
 
 
2136269
 
4217294
2136269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4217294
 
 
2136269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
tags:
- reinforcement-learning
- game-theory
- colonel-blotto
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
- meta-learning
license: mit
---

# Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation

![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)

This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**.  
The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity.

---

## Overview

The approach combines:

- **Graph Attention Networks** for structured game-state encoding  
- **Proximal Policy Optimization (PPO)** as the core learning algorithm  
- **FiLM-based opponent adaptation** for fast response to opponent behavior  
- **Rollout-grounded preference learning** using two large language models  
- **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment  
- **Knowledge distillation** from the aligned teacher into an efficient policy  

The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play.

---

## Game Configuration

- **Game**: Colonel Blotto  
- **Battlefields**: 3  
- **Units per round**: 20  
- **Rounds per game**: 5  
- **Action space size**: 231 valid allocations  
- **Evaluation protocol**: Fixed scripted and adaptive opponent pool  

---

## Policy Architecture

### Graph-Based State Encoder
- Heterogeneous graph with **25–40 nodes**
- Node types include:
  - Battlefield nodes
  - Recent round summary nodes
  - Global state node
- Node feature dimension: **32**
- Encoder:
  - 3 Graph Attention layers
  - 6 attention heads
  - Hidden size 192

### Opponent Modeling and Adaptation
- Opponent history encoded via a lightweight MLP
- **FiLM adaptation layers** modulate policy activations based on opponent embedding
- Enables rapid adjustment to non-stationary strategies

### Action Head
- Portfolio-based action head with **6 latent strategies**
- Strategies mixed via learned attention
- Total policy parameters: **~6.8M**

---

## Training Pipeline

Training follows a multi-stage curriculum:

1. **Graph PPO Pretraining**  
   - PPO with clip ratio 0.2  
   - Discount factor γ = 0.99  
   - GAE λ = 0.95  
   - Trained against a diverse scripted opponent pool  

2. **Preference Generation via Rollouts**  
   - ~800 intermediate states sampled  
   - Candidate actions proposed by:
     - Llama 3.1 Instruct
     - Qwen 2.5 Instruct  
   - Each proposal evaluated with 4 stochastic rollouts  
   - Higher-return actions labeled preferred  
   - ~2,300 preference pairs generated  

3. **Teacher Alignment**  
   - Supervised Fine Tuning on chosen actions  
   - Direct Preference Optimization using frozen reference model  

4. **Policy Distillation**  
   - Aligned teacher generates state-to-action labels  
   - Graph policy trained via cross-entropy imitation  

5. **Final PPO Refinement**  
   - PPO resumes using environment rewards  
   - Stabilizes behavior after distillation  

---

## Evaluation Results

Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents.

| Agent | Win Rate | Risk Metric |
|------|---------|------------|
| PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% |
| PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% |
| Full curriculum | 78.4% | Exploitability proxy 0.48 |

- **Allocation collapse**: fraction of rounds placing >60% units on one field  
- Distillation yields a **+9.5 point** win-rate gain over PPO  
- Full curriculum yields **+20 point** gain with reduced over-specialization  

These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation.

---

## Repository Contents

### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`
- `policy_models/policy_final.pt`

### LLM Teacher Models
- `sft_model/` – supervised fine-tuned model
- `dpo_model/` – preference-aligned model

### Configuration and Logs
- `master_config.json` – training configuration
- `battleground_eval.json` – evaluation summaries

---

## Usage

### Load Policy

```python
import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()


### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```

## 🎓 Research Context

This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:

- Language models function effectively as strategic prior generators when grounded by rollouts
- Graph-based representations enable cross-strategy generalization under compact policies
- Distillation transfers high-level reasoning into fast, deployable agents

### Key Innovations

1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies


## 📄 License

MIT License - See LICENSE file for details

## 🙏 Acknowledgments

- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU

---

**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment