GOVINDFROM commited on
Commit
4217294
·
verified ·
1 Parent(s): 2136269

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -89
README.md CHANGED
@@ -5,119 +5,155 @@ tags:
5
  - colonel-blotto
6
  - neurips-2025
7
  - graph-neural-networks
 
 
8
  - meta-learning
9
  license: mit
10
  ---
11
 
12
- # Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025
13
 
14
  ![Status](https://img.shields.io/badge/status-trained-success)
15
  ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
16
  ![License](https://img.shields.io/badge/license-MIT-blue)
17
 
18
- This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning.
 
19
 
20
- ## 🎯 Model Overview
 
 
 
 
21
 
22
- This is an advanced system that achieves strong performance on Colonel Blotto through:
 
 
 
 
 
23
 
24
- - **Graph Neural Networks** for game state representation
25
- - **FiLM layers** for fast opponent adaptation
26
- - **Meta-learning** for strategy portfolios
27
- - **LLM fine-tuning** (SFT + DPO) for strategic reasoning
28
- - **Distillation** from LLMs back to efficient RL policies
29
 
30
- ### Game Configuration
31
 
32
- - **Fields**: 3
33
- - **Units per round**: 20
34
- - **Rounds per game**: 5
35
- - **Training episodes**: 1000
 
 
36
 
37
- ## 📊 Performance Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
 
39
 
40
- ## 🏗️ Architecture
41
 
42
- ### Policy Network
 
 
 
 
43
 
44
- The core policy network uses a sophisticated architecture:
 
 
 
 
 
 
 
45
 
46
- 1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT)
47
- - Heterogeneous nodes: field nodes, round nodes, summary node
48
- - Multi-head attention with 6 heads
49
- - 3 layers of message passing
50
 
51
- 2. **Opponent Encoder**: MLP-based encoder for opponent modeling
52
- - Processes opponent history
53
- - Learns behavioral patterns
54
 
55
- 3. **FiLM Layers**: Feature-wise Linear Modulation
56
- - Fast adaptation to opponent behavior
57
- - Conditioned on opponent encoding
58
 
59
- 4. **Portfolio Head**: Multi-strategy selection
60
- - 6 specialist strategy heads
61
- - Soft attention-based mixing
62
 
63
- ### Training Pipeline
64
 
65
- The models were trained through a comprehensive 7-phase pipeline:
66
 
67
- 1. **Phase A**: Environment setup and action space generation
68
- 2. **Phase B**: PPO training against diverse scripted opponents
69
- 3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts)
70
- 4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM
71
- 5. **Phase E**: Direct Preference Optimization (DPO)
72
- 6. **Phase F**: Knowledge distillation from LLM to policy
73
- 7. **Phase G**: PPO refinement after distillation
74
 
75
- ## 📦 Repository Contents
 
 
76
 
77
- ### Policy Models
78
 
79
- - `policy_models/policy_final.pt`: PyTorch checkpoint
80
- - `policy_models/policy_after_distill.pt`: PyTorch checkpoint
81
- - `policy_models/policy_after_ppo.pt`: PyTorch checkpoint
82
 
83
- ### Fine-tuned LLM Models
84
 
85
- - `sft_model/`: SFT model (HuggingFace Transformers compatible)
86
- - `dpo_model/`: DPO model (HuggingFace Transformers compatible)
 
 
87
 
 
 
 
88
 
89
- ### Configuration & Results
 
 
90
 
91
- - `master_config.json`: Complete training configuration
92
- - `battleground_eval.json`: Comprehensive evaluation results
93
- - `eval_scripted_after_ppo.json`: Post-PPO evaluation
94
 
95
- ## 🚀 Usage
96
 
97
- ### Loading Policy Model
98
 
99
  ```python
100
  import torch
101
- from your_policy_module import PolicyNet
102
-
103
- # Load configuration
104
- with open("master_config.json", "r") as f:
105
- config = json.load(f)
106
-
107
- # Initialize policy
108
- policy = PolicyNet(
109
- Ff=config["F"],
110
- n_actions=231, # For F=3, U=20
111
- hidden=config["hidden"],
112
- gnn_layers=config["gnn_layers"],
113
- gnn_heads=config["gnn_heads"],
114
- n_strat=config["n_strat"]
115
- )
116
-
117
- # Load trained weights
118
  policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
119
  policy.eval()
120
- ```
121
 
122
  ### Loading Fine-tuned LLM
123
 
@@ -137,10 +173,9 @@ outputs = model.generate(**inputs, max_new_tokens=32)
137
 
138
  This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
139
 
140
- - **Strategic game AI** beyond traditional game-theoretic approaches
141
- - **Hybrid systems** combining neural RL and LLM reasoning
142
- - **Fast adaptation** to diverse opponents through meta-learning
143
- - **Efficient deployment** via distillation
144
 
145
  ### Key Innovations
146
 
@@ -149,19 +184,6 @@ This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
149
  3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
150
  4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
151
 
152
- ## 📝 Citation
153
-
154
- If you use this work, please cite:
155
-
156
- ```bibtex
157
- @misc{colonelblotto2025neurips,
158
- title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
159
- author={{NeurIPS 2025 MindGames Submission}},
160
- year={2025},
161
- publisher={HuggingFace Hub},
162
- howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
163
- }
164
- ```
165
 
166
  ## 📄 License
167
 
 
5
  - colonel-blotto
6
  - neurips-2025
7
  - graph-neural-networks
8
+ - preference-learning
9
+ - llm-distillation
10
  - meta-learning
11
  license: mit
12
  ---
13
 
14
+ # Colonel Blotto: Graph-Based RL with LLM-Guided Preference Distillation
15
 
16
  ![Status](https://img.shields.io/badge/status-trained-success)
17
  ![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
18
  ![License](https://img.shields.io/badge/license-MIT-blue)
19
 
20
+ This repository contains trained **Colonel Blotto agents** developed for the **NeurIPS 2025 MindGames Workshop**.
21
+ The system integrates a compact graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, enabling improved strategic adaptation without increasing policy capacity.
22
 
23
+ ---
24
+
25
+ ## Overview
26
+
27
+ The approach combines:
28
 
29
+ - **Graph Attention Networks** for structured game-state encoding
30
+ - **Proximal Policy Optimization (PPO)** as the core learning algorithm
31
+ - **FiLM-based opponent adaptation** for fast response to opponent behavior
32
+ - **Rollout-grounded preference learning** using two large language models
33
+ - **Supervised fine tuning (SFT) and Direct Preference Optimization (DPO)** for teacher alignment
34
+ - **Knowledge distillation** from the aligned teacher into an efficient policy
35
 
36
+ The goal is not to replace RL with language models, but to **inject strategic priors** learned by LLMs back into a lightweight, fast policy suitable for competitive play.
37
+
38
+ ---
 
 
39
 
40
+ ## Game Configuration
41
 
42
+ - **Game**: Colonel Blotto
43
+ - **Battlefields**: 3
44
+ - **Units per round**: 20
45
+ - **Rounds per game**: 5
46
+ - **Action space size**: 231 valid allocations
47
+ - **Evaluation protocol**: Fixed scripted and adaptive opponent pool
48
 
49
+ ---
50
+
51
+ ## Policy Architecture
52
+
53
+ ### Graph-Based State Encoder
54
+ - Heterogeneous graph with **25–40 nodes**
55
+ - Node types include:
56
+ - Battlefield nodes
57
+ - Recent round summary nodes
58
+ - Global state node
59
+ - Node feature dimension: **32**
60
+ - Encoder:
61
+ - 3 Graph Attention layers
62
+ - 6 attention heads
63
+ - Hidden size 192
64
+
65
+ ### Opponent Modeling and Adaptation
66
+ - Opponent history encoded via a lightweight MLP
67
+ - **FiLM adaptation layers** modulate policy activations based on opponent embedding
68
+ - Enables rapid adjustment to non-stationary strategies
69
+
70
+ ### Action Head
71
+ - Portfolio-based action head with **6 latent strategies**
72
+ - Strategies mixed via learned attention
73
+ - Total policy parameters: **~6.8M**
74
+
75
+ ---
76
 
77
+ ## Training Pipeline
78
 
79
+ Training follows a multi-stage curriculum:
80
 
81
+ 1. **Graph PPO Pretraining**
82
+ - PPO with clip ratio 0.2
83
+ - Discount factor γ = 0.99
84
+ - GAE λ = 0.95
85
+ - Trained against a diverse scripted opponent pool
86
 
87
+ 2. **Preference Generation via Rollouts**
88
+ - ~800 intermediate states sampled
89
+ - Candidate actions proposed by:
90
+ - Llama 3.1 Instruct
91
+ - Qwen 2.5 Instruct
92
+ - Each proposal evaluated with 4 stochastic rollouts
93
+ - Higher-return actions labeled preferred
94
+ - ~2,300 preference pairs generated
95
 
96
+ 3. **Teacher Alignment**
97
+ - Supervised Fine Tuning on chosen actions
98
+ - Direct Preference Optimization using frozen reference model
 
99
 
100
+ 4. **Policy Distillation**
101
+ - Aligned teacher generates state-to-action labels
102
+ - Graph policy trained via cross-entropy imitation
103
 
104
+ 5. **Final PPO Refinement**
105
+ - PPO resumes using environment rewards
106
+ - Stabilizes behavior after distillation
107
 
108
+ ---
 
 
109
 
110
+ ## Evaluation Results
111
 
112
+ Evaluation uses **1,000 games** against a mixture of scripted and adaptive opponents.
113
 
114
+ | Agent | Win Rate | Risk Metric |
115
+ |------|---------|------------|
116
+ | PPO only | 58.4% ± 2.1 | Allocation collapse 14.2% |
117
+ | PPO + Distillation | 67.9% ± 1.8 | Allocation collapse 8.8% |
118
+ | Full curriculum | 78.4% | Exploitability proxy 0.48 |
 
 
119
 
120
+ - **Allocation collapse**: fraction of rounds placing >60% units on one field
121
+ - Distillation yields a **+9.5 point** win-rate gain over PPO
122
+ - Full curriculum yields **+20 point** gain with reduced over-specialization
123
 
124
+ These improvements arise from **risk calibration and opponent-aware adaptation**, not brute-force exploitation.
125
 
126
+ ---
 
 
127
 
128
+ ## Repository Contents
129
 
130
+ ### Policy Checkpoints
131
+ - `policy_models/policy_after_ppo.pt`
132
+ - `policy_models/policy_after_distill.pt`
133
+ - `policy_models/policy_final.pt`
134
 
135
+ ### LLM Teacher Models
136
+ - `sft_model/` – supervised fine-tuned model
137
+ - `dpo_model/` – preference-aligned model
138
 
139
+ ### Configuration and Logs
140
+ - `master_config.json` – training configuration
141
+ - `battleground_eval.json` – evaluation summaries
142
 
143
+ ---
 
 
144
 
145
+ ## Usage
146
 
147
+ ### Load Policy
148
 
149
  ```python
150
  import torch
151
+ from policy import GraphPolicy
152
+
153
+ policy = GraphPolicy(...)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
155
  policy.eval()
156
+
157
 
158
  ### Loading Fine-tuned LLM
159
 
 
173
 
174
  This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on:
175
 
176
+ - Language models function effectively as strategic prior generators when grounded by rollouts
177
+ - Graph-based representations enable cross-strategy generalization under compact policies
178
+ - Distillation transfers high-level reasoning into fast, deployable agents
 
179
 
180
  ### Key Innovations
181
 
 
184
  3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings
185
  4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies
186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
 
188
  ## 📄 License
189