GOVINDFROM's picture
Update README.md
4217294 verified
---
tags:
- reinforcement-learning
- game-theory
- colonel-blotto
- neurips-2025
- graph-neural-networks
- preference-learning
- llm-distillation
- meta-learning
license: mit
---
# Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation
![Status](https://img.shields.io/badge/status-trained-success)
![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
![License](https://img.shields.io/badge/license-MIT-blue)
This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**.
The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity.
---
## Overview
The approach combines:
- **Graph Attention Networks** for structured game-state encoding
- **Proximal Policy Optimization (PPO)** as the core learning algorithm
- **FiLM-based opponent adaptation** for fast response to opponent behavior
- **Rollout-grounded preference learning** using two large language models
- **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment
- **Knowledge distillation** from the aligned teacher into an efficient policy
The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play.
---
## Game Configuration
- **Game**: Colonel Blotto
- **Battlefields**: 3
- **Units per round**: 20
- **Rounds per game**: 5
- **Action space size**: 231 valid allocations
- **Evaluation protocol**: Fixed scripted and adaptive opponent pool
---
## Policy Architecture
### Graph-Based State Encoder
- Heterogeneous graph with **25–40 nodes**
- Node types include:
- Battlefield nodes
- Recent round summary nodes
- Global state node
- Node feature dimension: **32**
- Encoder:
- 3 Graph Attention layers
- 6 attention heads
- Hidden size 192
### Opponent Modeling and Adaptation
- Opponent history encoded via a lightweight MLP
- **FiLM adaptation layers** modulate policy activations based on opponent embedding
- Enables rapid adjustment to non-stationary strategies
### Action Head
- Portfolio-based action head with **6 latent strategies**
- Strategies mixed via learned attention
- Total policy parameters: **~6.8M**
---
## Training Pipeline
Training follows a multi-stage curriculum:
1. **Graph PPO Pretraining**
- PPO with clip ratio 0.2
- Discount factor γ = 0.99
- GAE λ = 0.95
- Trained against a diverse scripted opponent pool
2. **Preference Generation via Rollouts**
- ~800 intermediate states sampled
- Candidate actions proposed by:
- Llama 3.1 Instruct
- Qwen 2.5 Instruct
- Each proposal evaluated with 4 stochastic rollouts
- Higher-return actions labeled preferred
- ~2,300 preference pairs generated
3. **Teacher Alignment**
- Supervised Fine Tuning on chosen actions
- Direct Preference Optimization using frozen reference model
4. **Policy Distillation**
- Aligned teacher generates state-to-action labels
- Graph policy trained via cross-entropy imitation
5. **Final PPO Refinement**
- PPO resumes using environment rewards
- Stabilizes behavior after distillation
---
## Evaluation Results
Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents.
| Agent | Win Rate | Risk Metric |
|------|---------|------------|
| PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% |
| PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% |
| Full curriculum | 78.4% | Exploitability proxy 0.48 |
- **Allocation collapse**: fraction of rounds placing >60% units on one field
- Distillation yields a **+9.5 point** win-rate gain over PPO
- Full curriculum yields **+20 point** gain with reduced over-specialization
These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation.
---
## Repository Contents
### Policy Checkpoints
- `policy_models/policy_after_ppo.pt`
- `policy_models/policy_after_distill.pt`
- `policy_models/policy_final.pt`
### LLM Teacher Models
- `sft_model/` – supervised fine-tuned model
- `dpo_model/` – preference-aligned model
### Configuration and Logs
- `master_config.json` – training configuration
- `battleground_eval.json` – evaluation summaries
---
## Usage
### Load Policy
```python
import torch
from policy import GraphPolicy
policy = GraphPolicy(...)
policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
policy.eval()
### Loading Fine-tuned LLM
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load SFT or DPO model
tokenizer = AutoTokenizer.from_pretrained("./sft_model")
model = AutoModelForCausalLM.from_pretrained("./sft_model")
# Use for inference
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=32)
```
## 🎓 Research Context
This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
- Language models function effectively as strategic prior generators when grounded by rollouts
- Graph-based representations enable cross-strategy generalization under compact policies
- Distillation transfers high-level reasoning into fast, deployable agents
### Key Innovations
1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
## 📄 License
MIT License - See LICENSE file for details
## 🙏 Acknowledgments
- Built for **NeurIPS 2025 MindGames Workshop**
- Uses PyTorch, HuggingFace Transformers, and PEFT
- Training infrastructure: NVIDIA H200 GPU
---
**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
**Uploaded from**: Notebook Environment