GOVINDFROM
/

MindGamesColonelBlutto

+---
+tags:
+- reinforcement-learning
+- game-theory
+- colonel-blotto
+- neurips-2025
+- graph-neural-networks
+- meta-learning
+license: mit
+---
+# Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025
+![Status](https://img.shields.io/badge/status-trained-success)
+![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
+![License](https://img.shields.io/badge/license-MIT-blue)
+This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
+## 🎯 Model Overview
+This is an advanced system that achieves strong performance on Colonel Blotto through:
+- **Graph Neural Networks** for game state representation
+- **FiLM layers** for fast opponent adaptation
+- **Meta-learning** for strategy portfolios
+- **LLM fine-tuning** (SFT + DPO) for strategic reasoning
+- **Distillation** from LLMs back to efficient RL policies
+### Game Configuration
+- **Fields**: 3
+- **Units per round**: 20
+- **Rounds per game**: 5
+- **Training episodes**: 1000
+## 📊 Performance Results
+## 🏗️ Architecture
+### Policy Network
+The core policy network uses a sophisticated architecture:
+1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
+   - Heterogeneous nodes: field nodes, round nodes, summary node
+   - Multi-head attention with 6 heads
+   - 3 layers of message passing
+2. **Opponent Encoder**: MLP-based encoder for opponent modeling
+   - Processes opponent history
+   - Learns behavioral patterns
+3. **FiLM Layers**: Feature-wise Linear Modulation
+   - Fast adaptation to opponent behavior
+   - Conditioned on opponent encoding
+4. **Portfolio Head**: Multi-strategy selection
+   - 6 specialist strategy heads
+   - Soft attention-based mixing
+### Training Pipeline
+The models were trained through a comprehensive 7-phase pipeline:
+1. **Phase A**: Environment setup and action space generation
+2. **Phase B**: PPO training against diverse scripted opponents
+3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
+4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
+5. **Phase E**: Direct Preference Optimization (DPO)
+6. **Phase F**: Knowledge distillation from LLM to policy
+7. **Phase G**: PPO refinement after distillation
+## 📦 Repository Contents
+### Policy Models
+- `policy_models/policy_final.pt`: PyTorch checkpoint
+- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
+- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint
+### Fine-tuned LLM Models
+- `sft_model/`: SFT model (HuggingFace Transformers compatible)
+- `dpo_model/`: DPO model (HuggingFace Transformers compatible)
+### Configuration & Results
+- `master_config.json`: Complete training configuration
+- `battleground_eval.json`: Comprehensive evaluation results
+- `eval_scripted_after_ppo.json`: Post-PPO evaluation
+## 🚀 Usage
+### Loading Policy Model
+```python
+import torch
+from your_policy_module import PolicyNet
+# Load configuration
+with open("master_config.json", "r") as f:
+    config = json.load(f)
+# Initialize policy
+policy = PolicyNet(
+    Ff=config["F"],
+    n_actions=231,  # For F=3, U=20
+    hidden=config["hidden"],
+    gnn_layers=config["gnn_layers"],
+    gnn_heads=config["gnn_heads"],
+    n_strat=config["n_strat"]
+)
+# Load trained weights
+policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
+policy.eval()
+```
+### Loading Fine-tuned LLM
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load SFT or DPO model
+tokenizer = AutoTokenizer.from_pretrained("./sft_model")
+model = AutoModelForCausalLM.from_pretrained("./sft_model")
+# Use for inference
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=32)
+```
+## 🎓 Research Context
+This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
+- **Strategic game AI** beyond traditional game-theoretic approaches
+- **Hybrid systems** combining neural RL and LLM reasoning
+- **Fast adaptation** to diverse opponents through meta-learning
+- **Efficient deployment** via distillation
+### Key Innovations
+1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states
+2. **Ground-truth Counterfactual Learning**: Exploiting game determinism
+3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
+4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
+## 📝 Citation
+If you use this work, please cite:
+```bibtex
+@misc{colonelblotto2025neurips,
+  title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
+  author={{NeurIPS 2025 MindGames Submission}},
+  year={2025},
+  publisher={HuggingFace Hub},
+  howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
+}
+```
+## 📄 License
+MIT License - See LICENSE file for details
+## 🙏 Acknowledgments
+- Built for **NeurIPS 2025 MindGames Workshop**
+- Uses PyTorch, HuggingFace Transformers, and PEFT
+- Training infrastructure: NVIDIA H200 GPU
+---
+**Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
+**Uploaded from**: Notebook Environment