GOVINDFROM
/

MindGamesColonelBlutto

@@ -5,119 +5,155 @@ tags:
 - colonel-blotto
 - neurips-2025
 - graph-neural-networks
 - meta-learning
 license: mit
 ---
-# Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025
 ![Status](https://img.shields.io/badge/status-trained-success)
 ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
 ![License](https://img.shields.io/badge/license-MIT-blue)
-This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
-## 🎯 Model Overview
-This is an advanced system that achieves strong performance on Colonel Blotto through:
-- **Graph Neural Networks** for game state representation
-- **FiLM layers** for fast opponent adaptation
-- **Meta-learning** for strategy portfolios
-- **LLM fine-tuning** (SFT + DPO) for strategic reasoning
-- **Distillation** from LLMs back to efficient RL policies
-### Game Configuration
-- **Fields**: 3
-- **Units per round**: 20
-- **Rounds per game**: 5
-- **Training episodes**: 1000
-## 📊 Performance Results
-## 🏗️ Architecture
-### Policy Network
-The core policy network uses a sophisticated architecture:
-1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
-   - Heterogeneous nodes: field nodes, round nodes, summary node
-   - Multi-head attention with 6 heads
-   - 3 layers of message passing
-2. **Opponent Encoder**: MLP-based encoder for opponent modeling
-   - Processes opponent history
-   - Learns behavioral patterns
-3. **FiLM Layers**: Feature-wise Linear Modulation
-   - Fast adaptation to opponent behavior
-   - Conditioned on opponent encoding
-4. **Portfolio Head**: Multi-strategy selection
-   - 6 specialist strategy heads
-   - Soft attention-based mixing
-### Training Pipeline
-The models were trained through a comprehensive 7-phase pipeline:
-1. **Phase A**: Environment setup and action space generation
-2. **Phase B**: PPO training against diverse scripted opponents
-3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
-4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
-5. **Phase E**: Direct Preference Optimization (DPO)
-6. **Phase F**: Knowledge distillation from LLM to policy
-7. **Phase G**: PPO refinement after distillation
-## 📦 Repository Contents
-### Policy Models
-- `policy_models/policy_final.pt`: PyTorch checkpoint
-- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
-- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint
-### Fine-tuned LLM Models
-- `sft_model/`: SFT model (HuggingFace Transformers compatible)
-- `dpo_model/`: DPO model (HuggingFace Transformers compatible)
-### Configuration & Results
-- `master_config.json`: Complete training configuration
-- `battleground_eval.json`: Comprehensive evaluation results
-- `eval_scripted_after_ppo.json`: Post-PPO evaluation
-## 🚀 Usage
-### Loading Policy Model
 ```python
 import torch
-from your_policy_module import PolicyNet
-# Load configuration
-with open("master_config.json", "r") as f:
-    config = json.load(f)
-# Initialize policy
-policy = PolicyNet(
-    Ff=config["F"],
-    n_actions=231,  # For F=3, U=20
-    hidden=config["hidden"],
-    gnn_layers=config["gnn_layers"],
-    gnn_heads=config["gnn_heads"],
-    n_strat=config["n_strat"]
-)
-# Load trained weights
 policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
 policy.eval()
-```
 ### Loading Fine-tuned LLM
@@ -137,10 +173,9 @@ outputs = model.generate(**inputs, max_new_tokens=32)
 This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
-- **Strategic game AI** beyond traditional game-theoretic approaches
-- **Hybrid systems** combining neural RL and LLM reasoning
-- **Fast adaptation** to diverse opponents through meta-learning
-- **Efficient deployment** via distillation
 ### Key Innovations
@@ -149,19 +184,6 @@ This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
-## 📝 Citation
-If you use this work, please cite:
-```bibtex
-@misc{colonelblotto2025neurips,
-  title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
-  author={{NeurIPS 2025 MindGames Submission}},
-  year={2025},
-  publisher={HuggingFace Hub},
-  howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
-}
-```
 ## 📄 License

 - colonel-blotto
 - neurips-2025
 - graph-neural-networks
+- preference-learning
+- llm-distillation
 - meta-learning
 license: mit
 ---
+# Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation
 ![Status](https://img.shields.io/badge/status-trained-success)
 ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
 ![License](https://img.shields.io/badge/license-MIT-blue)
+This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**.
+The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity.
+---
+## Overview
+The approach combines:
+- **Graph Attention Networks** for structured game-state encoding
+- **Proximal Policy Optimization (PPO)** as the core learning algorithm
+- **FiLM-based opponent adaptation** for fast response to opponent behavior
+- **Rollout-grounded preference learning** using two large language models
+- **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment
+- **Knowledge distillation** from the aligned teacher into an efficient policy
+The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play.
+---
+## Game Configuration
+- **Game**: Colonel Blotto
+- **Battlefields**: 3
+- **Units per round**: 20
+- **Rounds per game**: 5
+- **Action space size**: 231 valid allocations
+- **Evaluation protocol**: Fixed scripted and adaptive opponent pool
+---
+## Policy Architecture
+### Graph-Based State Encoder
+- Heterogeneous graph with **25–40 nodes**
+- Node types include:
+  - Battlefield nodes
+  - Recent round summary nodes
+  - Global state node
+- Node feature dimension: **32**
+- Encoder:
+  - 3 Graph Attention layers
+  - 6 attention heads
+  - Hidden size 192
+### Opponent Modeling and Adaptation
+- Opponent history encoded via a lightweight MLP
+- **FiLM adaptation layers** modulate policy activations based on opponent embedding
+- Enables rapid adjustment to non-stationary strategies
+### Action Head
+- Portfolio-based action head with **6 latent strategies**
+- Strategies mixed via learned attention
+- Total policy parameters: **~6.8M**
+---
+## Training Pipeline
+Training follows a multi-stage curriculum:
+1. **Graph PPO Pretraining**
+   - PPO with clip ratio 0.2
+   - Discount factor γ = 0.99
+   - GAE λ = 0.95
+   - Trained against a diverse scripted opponent pool
+2. **Preference Generation via Rollouts**
+   - ~800 intermediate states sampled
+   - Candidate actions proposed by:
+     - Llama 3.1 Instruct
+     - Qwen 2.5 Instruct
+   - Each proposal evaluated with 4 stochastic rollouts
+   - Higher-return actions labeled preferred
+   - ~2,300 preference pairs generated
+3. **Teacher Alignment**
+   - Supervised Fine Tuning on chosen actions
+   - Direct Preference Optimization using frozen reference model
+4. **Policy Distillation**
+   - Aligned teacher generates state-to-action labels
+   - Graph policy trained via cross-entropy imitation
+5. **Final PPO Refinement**
+   - PPO resumes using environment rewards
+   - Stabilizes behavior after distillation
+---
+## Evaluation Results
+Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents.
+| Agent | Win Rate | Risk Metric |
+|------|---------|------------|
+| PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% |
+| PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% |
+| Full curriculum | 78.4% | Exploitability proxy 0.48 |
+- **Allocation collapse**: fraction of rounds placing >60% units on one field
+- Distillation yields a **+9.5 point** win-rate gain over PPO
+- Full curriculum yields **+20 point** gain with reduced over-specialization
+These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation.
+---
+## Repository Contents
+### Policy Checkpoints
+- `policy_models/policy_after_ppo.pt`
+- `policy_models/policy_after_distill.pt`
+- `policy_models/policy_final.pt`
+### LLM Teacher Models
+- `sft_model/` – supervised fine-tuned model
+- `dpo_model/` – preference-aligned model
+### Configuration and Logs
+- `master_config.json` – training configuration
+- `battleground_eval.json` – evaluation summaries
+---
+## Usage
+### Load Policy
 ```python
 import torch
+from policy import GraphPolicy
+policy = GraphPolicy(...)
 policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
 policy.eval()
 ### Loading Fine-tuned LLM
 This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
+- Language models function effectively as strategic prior generators when grounded by rollouts
+- Graph-based representations enable cross-strategy generalization under compact policies
+- Distillation transfers high-level reasoning into fast, deployable agents
 ### Key Innovations
 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
 ## 📄 License