---
tags:
- reinforcement-learning
- game-theory
- colonel-blotto
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
- meta-learning
license: mit
---

# Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation

![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)

This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**.  
The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity.

---

## Overview

The approach combines:

- **Graph Attention Networks** for structured game-state encoding  
- **Proximal Policy Optimization (PPO)** as the core learning algorithm  
- **FiLM-based opponent adaptation** for fast response to opponent behavior  
- **Rollout-grounded preference learning** using two large language models  
- **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment  
- **Knowledge distillation** from the aligned teacher into an efficient policy  

The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play.

---

## Game Configuration

- **Game**: Colonel Blotto  
- **Battlefields**: 3  
- **Units per round**: 20  
- **Rounds per game**: 5  
- **Action space size**: 231 valid allocations  
- **Evaluation protocol**: Fixed scripted and adaptive opponent pool  

---

## Policy Architecture

### Graph-Based State Encoder
- Heterogeneous graph with **25–40 nodes**
- Node types include:
  - Battlefield nodes
  - Recent round summary nodes
  - Global state node
- Node feature dimension: **32**
- Encoder:
  - 3 Graph Attention layers
  - 6 attention heads
  - Hidden size 192

### Opponent Modeling and Adaptation
- Opponent history encoded via a lightweight MLP
- **FiLM adaptation layers** modulate policy activations based on opponent embedding
- Enables rapid adjustment to non-stationary strategies

### Action Head
- Portfolio-based action head with **6 latent strategies**
- Strategies mixed via learned attention
- Total policy parameters: **~6.8M**

---

## Training Pipeline

Training follows a multi-stage curriculum:

1. **Graph PPO Pretraining**  
   - PPO with clip ratio 0.2  
   - Discount factor γ = 0.99  
   - GAE λ = 0.95  
   - Trained against a diverse scripted opponent pool  

2. **Preference Generation via Rollouts**  
   - ~800 intermediate states sampled  
   - Candidate actions proposed by:
     - Llama 3.1 Instruct
     - Qwen 2.5 Instruct  
   - Each proposal evaluated with 4 stochastic rollouts  
   - Higher-return actions labeled preferred  
   - ~2,300 preference pairs generated  

3. **Teacher Alignment**  
   - Supervised Fine Tuning on chosen actions  
   - Direct Preference Optimization using frozen reference model  

4. **Policy Distillation**  
   - Aligned teacher generates state-to-action labels  
   - Graph policy trained via cross-entropy imitation  

5. **Final PPO Refinement**  
   - PPO resumes using environment rewards  
   - Stabilizes behavior after distillation  

---

## Evaluation Results

Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents.

| Agent | Win Rate | Risk Metric |
|------|---------|------------|
| PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% |
| PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% |
| Full curriculum | 78.4% | Exploitability proxy 0.48 |

- **Allocation collapse**: fraction of rounds placing >60% units on one field  
- Distillation yields a **+9.5 point** win-rate gain over PPO  
- Full curriculum yields **+20 point** gain with reduced over-specialization  

These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation.

---

## Repository Contents

### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`
- `policy_models/policy_final.pt`

### LLM Teacher Models
- `sft_model/` – supervised fine-tuned model
- `dpo_model/` – preference-aligned model

### Configuration and Logs
- `master_config.json` – training configuration
- `battleground_eval.json` – evaluation summaries

---

## Usage

### Load Policy

```python
import torch
from policy import GraphPolicy

policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()


### Loading Fine-tuned LLM

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")

# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```

## 🎓 Research Context

This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:

- Language models function effectively as strategic prior generators when grounded by rollouts
- Graph-based representations enable cross-strategy generalization under compact policies
- Distillation transfers high-level reasoning into fast, deployable agents

### Key Innovations

1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies


## 📄 License

MIT License - See LICENSE file for details

## 🙏 Acknowledgments

- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU

---

**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment