Update README.md

4217294 verified 26 days ago

6.12 kB

	---
	tags:
	- reinforcement-learning
	- game-theory
	- colonel-blotto
	- neurips-2025
	- graph-neural-networks
	- preference-learning
	- llm-distillation
	- meta-learning
	license: mit
	---

	# Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation

	![Status](https://img.shields.io/badge/status-trained-success)
	![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
	![License](https://img.shields.io/badge/license-MIT-blue)

	This repository contains trained Colonel Blotto agents developed for the NeurIPS 2025 MindGames Workshop.
	The system integrates a compact graph-based reinforcement learning policy with LLM-guided preference learning and distillation, enabling improved strategic adaptation without increasing policy capacity.

	---

	## Overview

	The approach combines:

	- Graph Attention Networks for structured game-state encoding
	- Proximal Policy Optimization (PPO) as the core learning algorithm
	- FiLM-based opponent adaptation for fast response to opponent behavior
	- Rollout-grounded preference learning using two large language models
	- Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
	- Knowledge distillation from the aligned teacher into an efficient policy

	The goal is not to replace RL with language models, but to inject strategic priors learned by LLMs back into a lightweight, fast policy suitable for competitive play.

	---

	## Game Configuration

	- Game: Colonel Blotto
	- Battlefields: 3
	- Units per round: 20
	- Rounds per game: 5
	- Action space size: 231 valid allocations
	- Evaluation protocol: Fixed scripted and adaptive opponent pool

	---

	## Policy Architecture

	### Graph-Based State Encoder
	- Heterogeneous graph with 25–40 nodes
	- Node types include:
	- Battlefield nodes
	- Recent round summary nodes
	- Global state node
	- Node feature dimension: 32
	- Encoder:
	- 3 Graph Attention layers
	- 6 attention heads
	- Hidden size 192

	### Opponent Modeling and Adaptation
	- Opponent history encoded via a lightweight MLP
	- FiLM adaptation layers modulate policy activations based on opponent embedding
	- Enables rapid adjustment to non-stationary strategies

	### Action Head
	- Portfolio-based action head with 6 latent strategies
	- Strategies mixed via learned attention
	- Total policy parameters: ~6.8M

	---

	## Training Pipeline

	Training follows a multi-stage curriculum:

	1. Graph PPO Pretraining
	- PPO with clip ratio 0.2
	- Discount factor γ = 0.99
	- GAE λ = 0.95
	- Trained against a diverse scripted opponent pool

	2. Preference Generation via Rollouts
	- ~800 intermediate states sampled
	- Candidate actions proposed by:
	- Llama 3.1 Instruct
	- Qwen 2.5 Instruct
	- Each proposal evaluated with 4 stochastic rollouts
	- Higher-return actions labeled preferred
	- ~2,300 preference pairs generated

	3. Teacher Alignment
	- Supervised Fine Tuning on chosen actions
	- Direct Preference Optimization using frozen reference model

	4. Policy Distillation
	- Aligned teacher generates state-to-action labels
	- Graph policy trained via cross-entropy imitation

	5. Final PPO Refinement
	- PPO resumes using environment rewards
	- Stabilizes behavior after distillation

	---

	## Evaluation Results

	Evaluation uses 1,000 games against a mixture of scripted and adaptive opponents.

	\| Agent \| Win Rate \| Risk Metric \|
	\|------\|---------\|------------\|
	\| PPO only \| 58.4% ± 2.1 \| Allocation collapse 14.2% \|
	\| PPO + Distillation \| 67.9% ± 1.8 \| Allocation collapse 8.8% \|
	\| Full curriculum \| 78.4% \| Exploitability proxy 0.48 \|

	- Allocation collapse: fraction of rounds placing >60% units on one field
	- Distillation yields a +9.5 point win-rate gain over PPO
	- Full curriculum yields +20 point gain with reduced over-specialization

	These improvements arise from risk calibration and opponent-aware adaptation, not brute-force exploitation.

	---

	## Repository Contents

	### Policy Checkpoints
	- `policy_models/policy_after_ppo.pt`
	- `policy_models/policy_after_distill.pt`
	- `policy_models/policy_final.pt`

	### LLM Teacher Models
	- `sft_model/` – supervised fine-tuned model
	- `dpo_model/` – preference-aligned model

	### Configuration and Logs
	- `master_config.json` – training configuration
	- `battleground_eval.json` – evaluation summaries

	---

	## Usage

	### Load Policy

	```python
	import torch
	from policy import GraphPolicy

	policy = GraphPolicy(...)
	policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
	policy.eval()


	### Loading Fine-tuned LLM

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load SFT or DPO model
	tokenizer = AutoTokenizer.from_pretrained("./sft_model")
	model = AutoModelForCausalLM.from_pretrained("./sft_model")

	# Use for inference
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=32)
	```

	## 🎓 Research Context

	This work targets the NeurIPS 2025 MindGames Workshop with a focus on:

	- Language models function effectively as strategic prior generators when grounded by rollouts
	- Graph-based representations enable cross-strategy generalization under compact policies
	- Distillation transfers high-level reasoning into fast, deployable agents

	### Key Innovations

	1. Heterogeneous Graph Representation: Novel graph structure for Blotto game states
	2. Ground-truth Counterfactual Learning: Exploiting game determinism
	3. Multi-scale Representation: Field-level, round-level, and game-level embeddings
	4. LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies


	## 📄 License

	MIT License - See LICENSE file for details

	## 🙏 Acknowledgments

	- Built for NeurIPS 2025 MindGames Workshop
	- Uses PyTorch, HuggingFace Transformers, and PEFT
	- Training infrastructure: NVIDIA H200 GPU

	---

	Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
	Uploaded from: Notebook Environment