--- tags: - reinforcement-learning - game-theory - colonel-blotto - neurips-2025 - graph-neural-networks - preference-learning - llm-distillation - meta-learning license: mit --- # Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation ![Status](https://img.shields.io/badge/status-trained-success) ![Framework](https://img.shields.io/badge/framework-PyTorch-orange) ![License](https://img.shields.io/badge/license-MIT-blue) This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**. The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity. --- ## Overview The approach combines: - **Graph Attention Networks** for structured game-state encoding - **Proximal Policy Optimization (PPO)** as the core learning algorithm - **FiLM-based opponent adaptation** for fast response to opponent behavior - **Rollout-grounded preference learning** using two large language models - **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment - **Knowledge distillation** from the aligned teacher into an efficient policy The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play. --- ## Game Configuration - **Game**: Colonel Blotto - **Battlefields**: 3 - **Units per round**: 20 - **Rounds per game**: 5 - **Action space size**: 231 valid allocations - **Evaluation protocol**: Fixed scripted and adaptive opponent pool --- ## Policy Architecture ### Graph-Based State Encoder - Heterogeneous graph with **25–40 nodes** - Node types include: - Battlefield nodes - Recent round summary nodes - Global state node - Node feature dimension: **32** - Encoder: - 3 Graph Attention layers - 6 attention heads - Hidden size 192 ### Opponent Modeling and Adaptation - Opponent history encoded via a lightweight MLP - **FiLM adaptation layers** modulate policy activations based on opponent embedding - Enables rapid adjustment to non-stationary strategies ### Action Head - Portfolio-based action head with **6 latent strategies** - Strategies mixed via learned attention - Total policy parameters: **~6.8M** --- ## Training Pipeline Training follows a multi-stage curriculum: 1. **Graph PPO Pretraining** - PPO with clip ratio 0.2 - Discount factor γ = 0.99 - GAE λ = 0.95 - Trained against a diverse scripted opponent pool 2. **Preference Generation via Rollouts** - ~800 intermediate states sampled - Candidate actions proposed by: - Llama 3.1 Instruct - Qwen 2.5 Instruct - Each proposal evaluated with 4 stochastic rollouts - Higher-return actions labeled preferred - ~2,300 preference pairs generated 3. **Teacher Alignment** - Supervised Fine Tuning on chosen actions - Direct Preference Optimization using frozen reference model 4. **Policy Distillation** - Aligned teacher generates state-to-action labels - Graph policy trained via cross-entropy imitation 5. **Final PPO Refinement** - PPO resumes using environment rewards - Stabilizes behavior after distillation --- ## Evaluation Results Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents. | Agent | Win Rate | Risk Metric | |------|---------|------------| | PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% | | PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% | | Full curriculum | 78.4% | Exploitability proxy 0.48 | - **Allocation collapse**: fraction of rounds placing >60% units on one field - Distillation yields a **+9.5 point** win-rate gain over PPO - Full curriculum yields **+20 point** gain with reduced over-specialization These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation. --- ## Repository Contents ### Policy Checkpoints - `policy_models/policy_after_ppo.pt` - `policy_models/policy_after_distill.pt` - `policy_models/policy_final.pt` ### LLM Teacher Models - `sft_model/` – supervised fine-tuned model - `dpo_model/` – preference-aligned model ### Configuration and Logs - `master_config.json` – training configuration - `battleground_eval.json` – evaluation summaries --- ## Usage ### Load Policy ```python import torch from policy import GraphPolicy policy = GraphPolicy(...) policy.load_state_dict(torch.load("policy_models/policy_final.pt")) policy.eval() ### Loading Fine-tuned LLM ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load SFT or DPO model tokenizer = AutoTokenizer.from_pretrained("./sft_model") model = AutoModelForCausalLM.from_pretrained("./sft_model") # Use for inference inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=32) ``` ## 🎓 Research Context This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: - Language models function effectively as strategic prior generators when grounded by rollouts - Graph-based representations enable cross-strategy generalization under compact policies - Distillation transfers high-level reasoning into fast, deployable agents ### Key Innovations 1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states 2. **Ground-truth Counterfactual Learning**: Exploiting game determinism 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies ## 📄 License MIT License - See LICENSE file for details ## 🙏 Acknowledgments - Built for **NeurIPS 2025 MindGames Workshop** - Uses PyTorch, HuggingFace Transformers, and PEFT - Training infrastructure: NVIDIA H200 GPU --- **Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} **Uploaded from**: Notebook Environment