ARKAISW commited on
Commit
84ccd7d
·
1 Parent(s): 9a6d252

Clean up unused hackathon markdown files and update setup script link

Browse files
Files changed (10) hide show
  1. README.md +1 -1
  2. blog_temp.md +69 -0
  3. fix.md +0 -336
  4. guidetofollow.md +0 -367
  5. more.md +0 -557
  6. plan.md +0 -63
  7. requirements.md +0 -150
  8. themes.md +0 -134
  9. train_hf.py +0 -438
  10. visualization.md +0 -316
README.md CHANGED
@@ -31,7 +31,7 @@ QuantHive is a PettingZoo AEC (Agent-Environment Cycle) environment where **thre
31
  | 📓 Kaggle Run | [Kaggle Notebook](https://www.kaggle.com/code/arka2930/notebook24ed9f9bff) |
32
  | 📔 **Colab Demo** | [Google Colab Notebook](https://colab.research.google.com/drive/1B-KIlGL9kHLMD1RLhgLV94-modKzPzfy?usp=sharing) |
33
  | 📝 **Submission Blog** | [QuantHive: Multi-Agent Governance (HF)](https://huggingface.co/spaces/ARKAISW/QuantHive/blob/main/blog.md) |
34
- | 🐍 Setup Script | [QuantHive Training Script](https://github.com/ARKAISW/multi-agent-trading-env/blob/master/train_hf.py) |
35
 
36
  ---
37
 
 
31
  | 📓 Kaggle Run | [Kaggle Notebook](https://www.kaggle.com/code/arka2930/notebook24ed9f9bff) |
32
  | 📔 **Colab Demo** | [Google Colab Notebook](https://colab.research.google.com/drive/1B-KIlGL9kHLMD1RLhgLV94-modKzPzfy?usp=sharing) |
33
  | 📝 **Submission Blog** | [QuantHive: Multi-Agent Governance (HF)](https://huggingface.co/spaces/ARKAISW/QuantHive/blob/main/blog.md) |
34
+ | 🐍 Setup Script | [QuantHive Training Notebook](https://github.com/ARKAISW/multi-agent-trading-env/blob/master/mate_training.ipynb) |
35
 
36
  ---
37
 
blog_temp.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "QuantHive: Teaching AI to Survive Being Wrong"
3
+ emoji: "🏛️"
4
+ colorFrom: "blue"
5
+ colorTo: "indigo"
6
+ sdk: "docker"
7
+ pinned: false
8
+ ---
9
+
10
+ # QuantHive: Teaching AI to Survive Being Wrong
11
+
12
+ Most people think trading is about predicting the next price movement.
13
+
14
+ The first lesson I learned from observing a real risk quant was that professional trading isn't primarily about prediction. It's mostly about surviving being wrong.
15
+
16
+ ### The Origin
17
+
18
+ I’m a Grade 12 student in India, and my older cousin is a risk quant. Early on, I got to see what real institutional finance looks like. It wasn’t about chaotic chart reading or betting on the next big breakout. It was a strict, highly disciplined system of constraints and balances. I learned early that real trading is not prediction; it's about controlled risk.
19
+
20
+ When I started experimenting with AI and Reinforcement Learning, I became fascinated by disciplined decision systems. Most AI trading environments in the open-source world are simple single-agent setups. They provide a model with price history and reward it solely for maximizing profit and loss.
21
+
22
+ But that's not how a hedge fund operates. If a human trader goes rogue, the risk desk intervenes forcefully. I wondered how AI would handle that if it were trained properly.
23
+
24
+ ### The Insight
25
+
26
+ That changed my perspective. The intriguing question was not whether AI could predict the next price movement.
27
+
28
+ **It was whether AI could learn institutional discipline.**
29
+
30
+ Could we train an AI not only to pursue profits but also to negotiate, comply, and adjust to shifting oversight? Could we create a system where governance isn’t a rigid rule but a conversation?
31
+
32
+ ### Entering the QuantHive
33
+
34
+ To address this, I built **QuantHive**—a governance-first trading environment that incorporates a multi-agent setup centered around PettingZoo’s AEC model. Instead of one reckless AI, I divided institutional trading into three opposing roles:
35
+
36
+ 1. **The Trader:** Aims to maximize profit and find alpha.
37
+ 2. **The Portfolio Manager:** Controls capital allocation and seeks steady growth without significant drawdowns.
38
+ 3. **The Risk Manager:** Has the authority to limit position sizes and reduce exposure forcefully if risks arise.
39
+
40
+ They interact through structured message passing and governance limits within the environment loop. The environment rewards survival, not recklessness. The Risk Manager is rewarded for limiting trades during risky drawdowns, while the Trader must figure out how to make money within the changing limits set by the others.
41
+
42
+ ### From Floats to Thoughts: Semantic Reasoning
43
+
44
+ The most valuable change came when training the Qwen 2.5 1.5B model with GRPO (Group Relative Policy Optimization).
45
+
46
+ At first, the agents received raw float arrays (e.g., `0.284`). But to truly achieve "Auditable AI," I shifted the environment to use **Semantic Reasoning**. Instead of a vector of 24 numbers, the AI "reads" the market state in human terms: *"RSI is 28.4 (oversold).”*
47
+
48
+ This simple change made the most of the LLM's pre-trained world knowledge. I trained the model against five reward verifiers, enforcing not only profit but also *Format, Alignment, Risk, and Governance.*
49
+
50
+ ### The Smoking Gun
51
+
52
+ After 250 steps of GRPO training, the most interesting result was how the Trader adapted. The Trader began anticipating interventions and made adjustments before being forced to.
53
+
54
+ Governance compliance rose from a random 7% to **88%**, and Risk Limit Adherence reached **93%** across held-out evaluation episodes in the governed environment.
55
+
56
+ But the best part is how it complies. Because I required the model to explain its actions in natural language, the trained agent now outputs statements like:
57
+
58
+ > *"...I also see that the portfolio's allocation of capital is nearing its limit (0.5). Given the Risk Manager's constraint on the size limit, I need to be cautious..."*
59
+
60
+ It doesn’t just follow the rules; it understands and explicitly references them before taking action.
61
+
62
+ ### The Broader Implication
63
+
64
+ Finance serves as a high-pressure test case. The larger question is whether autonomous systems can learn to operate under institutional oversight, justify their actions, and adapt to governance without hurting performance.
65
+
66
+ I set out to determine if AI could be taught institutional discipline. The surprising outcome was not that the model became more profitable first. It became more disciplined first.
67
+
68
+ ---
69
+ *Check out the full project on GitHub and see the live multi-agent choregraphy on our Hugging Face Space! All links are available in the repository [README.md](https://huggingface.co/spaces/ARKAISW/QuantHive/blob/main/README.md).*
fix.md DELETED
@@ -1,336 +0,0 @@
1
- # QuantHive Round2-Copy — Complete Fix List
2
-
3
- > **Context**: This project at `E:\Development\Round2 - Copy` is a PettingZoo AEC multi-agent trading environment for the OpenEnv Hackathon. It was forked from a working Gymnasium single-agent version (`Round2`). The core PettingZoo env (`env/multi_agent_env.py`) and a basic training script (`training/train_multi_agent.py`) have already been created, but several files still reference the old Gym version and critical deliverables are missing.
4
- >
5
- > **Goal**: Make this a complete, submission-ready hackathon entry. All edits happen in `E:\Development\Round2 - Copy`.
6
-
7
- ---
8
-
9
- ## PROJECT ARCHITECTURE (What Already Exists)
10
-
11
- - `env/multi_agent_env.py` — **NEW, DONE** — PettingZoo AECEnv with 3 agents:
12
- - `risk_manager_0`: obs=Box(24), action=Box(3) [size_limit, allow_new, force_reduce]
13
- - `portfolio_manager_0`: obs=Box(27), action=Box(2) [cap_alloc, override_strength]
14
- - `trader_0`: obs=Box(29), action=Dict{direction, size, sl, tp}
15
- - Turn order: RM → PM → Trader per market step
16
- - Inter-agent message passing: RM output → PM obs, RM+PM output → Trader obs
17
- - Adversarial rewards: RM rewarded for restricting during drawdown, Trader rewarded for PnL
18
- - `env/trading_env.py` — OLD Gymnasium env (keep for backward compat, used for data generation)
19
- - `env/state.py` — MarketState, PortfolioState, RiskState (shared by both envs)
20
- - `env/reward.py` — Reward functions + 5 GRPO verifiers (format, alignment, risk, profit, governance)
21
- - `training/train_multi_agent.py` — **NEW, DONE** — REINFORCE-style multi-agent training with rule-based policies
22
- - `training/train_grpo.py` — OLD GRPO training script for the Gym env
23
- - `api/server.py` — **PARTIALLY REWRITTEN** — imports updated to PettingZoo, `make_initial_state()` updated, but SimulationRunner still uses old Gym logic
24
- - `app.py` — Gradio/FastAPI launcher
25
- - `ui/` — React frontend (functional, shows agent messages + chart)
26
- - `openenv.yaml` — **STALE** — still points to `env.trading_env:TradingEnv`
27
- - `README.md` — **STALE** — describes the old Gym governance-in-env design
28
- - `WRITEUP.md` — **STALE** — describes single-agent architecture
29
- - `mate_training.ipynb` — **STALE** — Colab notebook for old Gym env
30
- - `Dockerfile` — Functional but missing `pettingzoo` dependency
31
- - `plots/` — Has old training plots from Gym version
32
-
33
- ---
34
-
35
- ## CHANGES NEEDED (In Priority Order)
36
-
37
- ---
38
-
39
- ### 🔴 1. Fix `openenv.yaml` — Points to Wrong Environment
40
-
41
- **File**: `openenv.yaml`
42
-
43
- Change `entry_point` from the old Gym env to the new PettingZoo env. Update observation space to reflect multi-agent structure.
44
-
45
- ```yaml
46
- # OpenEnv Manifesto
47
- version: "1.0"
48
- name: "QuantHive"
49
- description: "Decentralized multi-agent trading governance — three independent RL agents (Risk Manager, Portfolio Manager, Trader) with adversarial rewards negotiate via PettingZoo AEC turns."
50
- author: "Arka Sarkar"
51
-
52
- # Environment Specification
53
- environment:
54
- entry_point: "env.multi_agent_env:MultiAgentTradingEnv"
55
- type: "pettingzoo_aec"
56
- agents:
57
- - risk_manager_0
58
- - portfolio_manager_0
59
- - trader_0
60
- observation_space:
61
- risk_manager_0: { shape: [24], dtype: "float32", description: "Market + portfolio + risk state" }
62
- portfolio_manager_0: { shape: [27], dtype: "float32", description: "Base obs + RM constraints [size_limit, allow_new, force_reduce]" }
63
- trader_0: { shape: [29], dtype: "float32", description: "Base obs + RM constraints + PM allocation [cap_alloc, override_strength]" }
64
- action_space:
65
- risk_manager_0:
66
- type: "box"
67
- shape: [3]
68
- description: "[size_limit (0-1), allow_new_positions (0-1), force_reduce (0-1)]"
69
- portfolio_manager_0:
70
- type: "box"
71
- shape: [2]
72
- description: "[capital_allocation (0-1), override_strength (0-1)]"
73
- trader_0:
74
- type: "dict"
75
- items:
76
- direction: { type: "int", low: 0, high: 2, description: "0=Hold, 1=Buy, 2=Sell" }
77
- size: { type: "float", low: 0.0, high: 1.0 }
78
- sl: { type: "float", description: "Stop Loss price" }
79
- tp: { type: "float", description: "Take Profit price" }
80
-
81
- server:
82
- port: 7860
83
- endpoints:
84
- reset: "/reset"
85
- step: "/step"
86
- state: "/state"
87
-
88
- tags:
89
- - "PettingZoo AEC"
90
- - "Multi-Agent"
91
- - "Adversarial Rewards"
92
- - "Financial Governance"
93
- - "Inter-Agent Negotiation"
94
- - "Self-Regulation"
95
- ```
96
-
97
- ---
98
-
99
- ### 🔴 2. Add `pettingzoo` to Dependencies
100
-
101
- **File**: `requirements.txt` — add `pettingzoo>=1.24.0`
102
- **File**: `requirements-space.txt` — add `pettingzoo>=1.24.0`
103
-
104
- ---
105
-
106
- ### 🔴 3. Finish `api/server.py` — Complete SimulationRunner Rewrite
107
-
108
- **File**: `api/server.py`
109
-
110
- The imports and `make_initial_state()` have been updated. The `SimulationRunner` class and the API endpoints (`/reset`, `/step`, `/state`) still use the old `TradingEnv.step()` loop. They must be rewritten to:
111
-
112
- 1. **SimulationRunner** must instantiate `MultiAgentTradingEnv` instead of `TradingEnv`
113
- 2. **Each simulation step** must run a full AEC cycle (RM → PM → Trader) using `env.agent_iter()`
114
- 3. Use the rule-based policies from `training/train_multi_agent.py` (`RuleRiskManagerPolicy`, `RulePortfolioManagerPolicy`, `RuleTraderPolicy`) as the default agent policies for the demo
115
- 4. After each AEC cycle, broadcast per-agent messages and negotiation state to the UI via `sim_state`
116
- 5. The `negotiation` field in `sim_state` must be populated with RM and PM messages each cycle
117
- 6. The `flow` field must log the per-agent turn messages (e.g., "RM: Size limit set to 0.35", "PM: Allocation capped at 0.5", "Trader: BUY 0.3 @ 50123.45")
118
-
119
- The OpenEnv facade endpoints must still work:
120
- - `POST /reset` → calls `env.reset()`, returns initial trader observation
121
- - `POST /step` → accepts a trader action dict, runs full AEC cycle (RM and PM use rule policies), returns trader's obs/reward/done/info
122
- - `GET /state` → calls `env.state()`, returns full shared state
123
-
124
- This is the most complex single change. The existing `SimulationRunner` class structure can be adapted — replace the inner loop body.
125
-
126
- ---
127
-
128
- ### 🔴 4. Generate Training Evidence (Plots)
129
-
130
- After the GRPO training pipeline (change #8) is working:
131
-
132
- - Run training for ≥100 GRPO steps
133
- - Save to `plots/`:
134
- - `reward_curve.png` — per-agent reward over training steps (RM, PM, Trader on same axes)
135
- - `loss_curve.png` — policy loss convergence
136
- - `baseline_comparison.png` — random vs trained agent performance per metric
137
- - Each plot must have labeled axes, a title, and a one-line caption
138
- - Commit these `.png` files to the repo
139
-
140
- ---
141
-
142
- ### 🔴 5. Deploy to HF Space
143
-
144
- - Update `Dockerfile` to install `pettingzoo` (add to `requirements-space.txt`)
145
- - Push to HF Space at `https://huggingface.co/spaces/ARKAISW/QuantHive`
146
- - Verify from a logged-out browser that `/reset`, `/step`, `/state` all return valid JSON
147
- - The Space must be public and cloneable
148
-
149
- ---
150
-
151
- ### 🟠 6. Rewrite `README.md`
152
-
153
- **File**: `README.md`
154
-
155
- The current README describes the old Gym-based governance-in-env design. Rewrite it to describe the PettingZoo architecture. Keep the same general structure but update all technical content:
156
-
157
- Key sections to change:
158
- - **Title/Tagline**: "Can three AI agents with conflicting goals learn to govern each other?" or similar
159
- - **The Problem**: Same framing (AI can't self-govern), but add: "Existing 'multi-agent' trading envs are single-agent with hardcoded rules pretending to be agents"
160
- - **The Solution**: Describe PettingZoo AEC with 3 independent agents, adversarial rewards, and inter-agent message passing. Remove all references to "governance lives in env.step()" — that was the old design. Now governance is emergent from agent interaction
161
- - **Environment section**: Update observation dimensions (RM=24, PM=27, Trader=29), explain message passing, show the AEC turn diagram
162
- - **Training section**: Update to reflect multi-agent GRPO, show per-agent reward curves
163
- - **Results section**: Update with new plot embeds and new metrics
164
- - **Theme alignment**: Explicitly cite Theme #1 (Multi-Agent Interactions) and sub-themes (Fleet AI Scalable Oversight, Halluminate Multi-Actor)
165
- - **Quick Launch**: Keep the same curl examples but verify they work with the new server
166
-
167
- Include a code example showing the multi-agent negotiation:
168
- ```python
169
- info["governance"] = {
170
- "rm_message": [0.35, 1.0, 0.0], # RM: limit 35%, allow new, don't force reduce
171
- "pm_message": [0.50, 0.0], # PM: 50% allocation, no override
172
- "proposed": {"direction": 1, "size": 0.7},
173
- "executed": {"direction": 1, "size": 0.35}, # RM clamped size from 0.7 to 0.35
174
- "interventions": [{"agent": "RiskManager", "type": "size_clamp"}]
175
- }
176
- ```
177
-
178
- ---
179
-
180
- ### 🟠 7. Rewrite `WRITEUP.md`
181
-
182
- **File**: `WRITEUP.md`
183
-
184
- Rewrite the narrative:
185
- 1. **Problem**: Single-agent governance is fake — it's just business rules. True governance requires independent actors with conflicting incentives
186
- 2. **Insight**: PettingZoo AEC enables actual decentralized decision-making. RM is rewarded for restricting risk, Trader for profit, PM for balanced growth. Their tension creates emergent regulatory behavior
187
- 3. **Architecture**: 3-agent AEC cycle, inter-agent messages in observation space, adversarial reward structure
188
- 4. **Training**: Multi-agent GRPO with alternating optimization
189
- 5. **Results**: Per-agent reward curves, compliance rate improvement, RM learned to restrict, Trader learned to comply
190
- 6. **Why it matters**: First true PettingZoo multi-agent governance env for finance. Generalizes to healthcare/autonomous systems oversight
191
-
192
- ---
193
-
194
- ### 🟠 8. Build PettingZoo-Compatible GRPO Pipeline for Qwen 2.5
195
-
196
- **New File**: `training/train_grpo_multiagent.py`
197
-
198
- This is the most important training change. Create a GRPO trainer that:
199
-
200
- 1. Uses `MultiAgentTradingEnv` as the environment
201
- 2. Trains the Trader agent as a Qwen 2.5-1.5B model using Unsloth + TRL `GRPOTrainer`
202
- 3. RM and PM can use rule-based policies during Trader training (alternating optimization)
203
- 4. The Trader's prompt must include the RM/PM messages (constraints, allocation) as part of the state description so the LLM can reason about them
204
- 5. Adapt the 5 existing GRPO verifiers from `reward.py`:
205
- - `format_reward_func` — same (check `<thought>` + `<action>` tags)
206
- - `alignment_reward_func` — same (anti-hallucination)
207
- - `risk_reward_func` — update to use RM's `size_limit` from the message instead of hardcoded limit
208
- - `profit_reward_func` — same (direction vs price trend)
209
- - `governance_reward_func` — update to check if Trader's proposed size ≤ RM's size_limit (dynamic, not static)
210
- 6. The key differentiator: the governance verifier now checks compliance against *learned* RM constraints, not hardcoded ones. This means the Trader must learn to read and respect the RM message in its observation
211
-
212
- Example prompt format for Qwen:
213
- ```
214
- You are a trading agent in a multi-agent governance system.
215
- The Risk Manager has set the following constraints: size_limit=0.35, new_positions=allowed, force_reduce=no.
216
- The Portfolio Manager allocated: capital_cap=0.50, override=none.
217
- Market state: [... 24 values ...]
218
- Your task: Propose a trade action that maximizes profit while respecting the governance constraints.
219
- <thought>Your reasoning here</thought>
220
- <action>{"direction": 1, "size": 0.30, "sl": 49000, "tp": 52000}</action>
221
- ```
222
-
223
- ---
224
-
225
- ### 🟠 9. Rewrite `mate_training.ipynb`
226
-
227
- **File**: `mate_training.ipynb`
228
-
229
- Rewrite the Colab notebook to:
230
- 1. Install pettingzoo, openenv, trl, unsloth
231
- 2. Import `MultiAgentTradingEnv`
232
- 3. Run GRPO training via the new `train_grpo_multiagent.py` pipeline
233
- 4. Generate and display loss/reward plots inline
234
- 5. Save plots as `.png` in the `plots/` directory
235
- 6. Must be fully re-runnable on Google Colab T4 GPU
236
-
237
- ---
238
-
239
- ### 🟡 10. Multi-Agent Reward Visualization Script
240
-
241
- **New File**: `training/plot_multiagent.py`
242
-
243
- Create a script that:
244
- - Loads training logs from the GRPO run
245
- - Plots per-agent rewards (RM, PM, Trader) on same axes
246
- - Plots governance intervention rate over training
247
- - Plots compliance rate (% of Trader actions passing without RM/PM override)
248
- - Saves all to `plots/` as `.png` with labeled axes and titles
249
-
250
- ---
251
-
252
- ### 🟡 11. Strengthen Theme #1 Alignment in README
253
-
254
- Add a dedicated section in README:
255
- ```markdown
256
- ## 🎯 Theme Alignment: Multi-Agent Interactions (Theme #1)
257
-
258
- QuantHive directly addresses Theme #1 and both sub-themes:
259
-
260
- - **Fleet AI — Scalable Oversight**: The Risk Manager and Portfolio Manager are oversight agents that monitor and constrain the Trader in real-time, creating scalable governance.
261
- - **Halluminate — Multi-Actor Environments**: Three independent actors with adversarial incentives negotiate through observation message-passing, producing emergent strategic behavior.
262
-
263
- The PettingZoo AEC architecture enables theory-of-mind reasoning: the Trader must model what constraints the Risk Manager will impose based on the current portfolio state.
264
- ```
265
-
266
- ---
267
-
268
- ### 🟡 12. Document Anti-Reward-Hacking in WRITEUP
269
-
270
- Add a section explaining how the adversarial reward structure inherently prevents gaming:
271
- - If the Trader learns to ignore RM limits → RM is rewarded for clamping → arms race
272
- - If RM always blocks → RM gets no upside from portfolio growth → it learns moderation
273
- - Multiple independent reward signals per agent (not one monolithic score)
274
- - Governance intervention log provides process-level reward, not just final outcome
275
-
276
- ---
277
-
278
- ### 🟡 13. Verify Curriculum Learning Works with PettingZoo Env
279
-
280
- Test that `MultiAgentTradingEnv(difficulty="easy")`, `"medium"`, `"hard"` all work correctly:
281
- - Run 10 episodes at each difficulty
282
- - Confirm the Trader gets non-zero reward at "easy" difficulty
283
- - Mention curriculum design in WRITEUP
284
-
285
- ---
286
-
287
- ### 🟢 14. Update UI to Show Agent Negotiation
288
-
289
- Update the React UI (`ui/src/`) to:
290
- - Show RM → PM → Trader turn order visually
291
- - Display RM message [size_limit, allow_new, force_reduce] and PM message [cap_alloc, override] each cycle
292
- - Flash when an intervention occurs (RM clamped size, PM vetoed trade)
293
- - Show per-agent reward bars
294
-
295
- ---
296
-
297
- ### 🟢 15. Prepare Slide Deck for 3-Min Pitch
298
-
299
- Create a 6-slide deck:
300
- 1. Problem: "AI agents can't govern each other"
301
- 2. Solution: PettingZoo AEC with 3 adversarial agents
302
- 3. Architecture: RM → PM → Trader cycle + message passing diagram
303
- 4. Key innovation: Adversarial rewards = emergent self-regulation
304
- 5. Results: Per-agent reward curves + compliance improvement
305
- 6. Demo: Live UI showing negotiation
306
-
307
- ---
308
-
309
- ### 🟢 16. Upload Trained Model to HF Hub
310
-
311
- After training completes:
312
- - Save the LoRA adapter for Qwen 2.5-1.5B
313
- - Upload to HF Hub (e.g., `ARKAISW/quanthive-trader-lora`)
314
- - Link from README
315
-
316
- ---
317
-
318
- ### 🟢 17. Record <2 Min Video Demo
319
-
320
- - Screen record the UI showing multi-agent negotiation
321
- - Show before/after: random Trader vs trained Trader
322
- - Upload to YouTube (URL only, no video files in repo)
323
- - Link from README
324
-
325
- ---
326
-
327
- ### 🟢 18. Run PettingZoo API Test
328
-
329
- Run PettingZoo's built-in compliance test to verify the env is properly implemented:
330
- ```python
331
- from pettingzoo.test import api_test
332
- from env.multi_agent_env import MultiAgentTradingEnv
333
- env = MultiAgentTradingEnv()
334
- api_test(env, num_cycles=50, verbose_progress=True)
335
- ```
336
- Fix any issues that arise. Mention passing this test in README as quality evidence.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
guidetofollow.md DELETED
@@ -1,367 +0,0 @@
1
-
2
-
3
- Hackathon Self-Serve Guide: Build an RL
4
- Environment, Train an LLM, Ship a Demo
5
- 0) What you are building
6
- The core idea is not just to fine-tune a text model, but to build a specialized LLM system that
7
- can act inside an environment, get feedback, and improve through reinforcement learning. The
8
- practical stack discussed here is:
9
- Environment → verifier/reward functions → TRL trainer → Unsloth for efficiency →
10
- deployment on OpenEnv / Spaces.
11
- A strong project usually looks like one of these,
12
- Please refer to for theme guidelines on
13
- [External] Apr ‘26 OpenEnv Hackathon Themes
14
- selecting & forming problem statements.
15
- 1) Start with the right project idea
16
- Pick a task that has all three of these properties:
17
- - The model can act step by step
18
- - You can verify success programmatically
19
- - The task is hard enough to be interesting, but not so hard that the model never
20
- succeeds
21
- This last point matters a lot. RL only works if the probability of getting a good answer is
22
- greater than zero. If your task is so hard that the model never gets any reward, you will burn
23
- compute and learn nothing.
24
- Please refer to for theme guidelines on
25
- [External] Apr ‘26 OpenEnv Hackathon Themes
26
- selecting & forming problem statements.
27
- A useful rule: prefer tasks with crisp verification over tasks that only “look good” to a
28
- human. RL gets easier when the reward is objective.
29
-
30
- 2) Understand the minimum RL loop before you build
31
- At a high level, your loop is:
32
- - Give the model a prompt
33
- - Let it generate an action, strategy, answer, or code
34
- - Execute that output in an environment or verifier
35
- - Convert the result into a reward
36
- - Update the model so higher-reward behavior becomes more likely
37
- That is the practical mental model for RL here. The system samples many outputs, scores
38
- them, and shifts probability mass away from bad outputs and toward better ones.
39
- One especially useful framing is that RL is like a more efficient version of repeated in-context
40
- improvement. Instead of repeatedly stuffing previous examples into the context, you let
41
- backpropagation store what worked into the weights.
42
- 3) Decide whether you need SFT first
43
- Use this simple rule:
44
- ● If you have a lot of good data, use SFT
45
- ● If you do not have data but can verify outputs, use RL
46
- ● In many practical cases, do a little SFT first, then RL
47
- Why this matters:
48
- ● SFT is generally more sample-efficient
49
- ● RL is useful when you can test outcomes but cannot cheaply author ideal traces
50
- ● RL often needs some warm start, formatting priming, or easy tasks first so that good
51
- rollouts happen at all
52
- For hackathon teams, the best path is usually:
53
- - Start from a capable base/instruct model
54
- - Add light formatting or task scaffolding if needed
55
- - Use RL for improvement, not as magic from scratch
56
- 4) Design the environment before you design the trainer
57
-
58
- Treat the environment as a first-class artifact. It should define:
59
- ● reset(): start a fresh episode
60
- ● step(action): apply an action and return the next result
61
- ● state() / observation: what the agent sees
62
- ● reward: what counts as progress or success
63
- OpenEnv standardizes this so the same training code can work across many environments,
64
- instead of every team inventing a different API. That is one of the main reasons to use it in a
65
- hackathon.
66
- Think about your environment in this order:
67
- - What does the agent observe?
68
- - What actions can it take?
69
- - What ends an episode?
70
- - How do you compute reward?
71
- - How do you stop abuse, infinite loops, or cheating?
72
- 5) Build the environment using OpenEnv
73
- The intended workflow is to bootstrap an environment skeleton and then fill in the behavior.
74
- OpenEnv’s CLI creates the scaffolding for you. The environment is implemented as a Python
75
- package and exposed via a FastAPI app.
76
- Your implementation typically defines:
77
- ● action dataclass
78
- ● observation dataclass
79
- ● state representation
80
- ● environment methods like reset and step
81
- ● FastAPI wrapper / client-server interface
82
- That gives you a clean separation:
83
- ● the environment handles world dynamics and scoring,
84
- ● the trainer handles optimization,
85
- ● and the model just learns to act inside the interface.
86
-
87
- 6) Keep the task simple at first
88
- Do not begin with your hardest benchmark. Start with the easiest version of your environment
89
- that still proves the concept. This is where curriculum learning helps.
90
- A good progression:
91
- - easy tasks with short horizons,
92
- - medium tasks with a little more branching,
93
- - harder tasks only after the model starts getting non-zero reward.
94
- The principle is simple: make success possible early. If the model never sees successful
95
- trajectories, learning stalls.
96
- 7) Design rewards carefully
97
- Your reward function is your task specification. If it is weak, incomplete, or easy to exploit, the
98
- model will optimize the wrong thing very efficiently.
99
- A strong reward design usually includes multiple components, for example:
100
- ● execution success,
101
- ● correctness,
102
- ● format compliance,
103
- ● timeouts,
104
- ● resource usage,
105
- ● safety constraints,
106
- ● and anti-cheating checks.
107
- One explicit recommendation was to use multiple independent reward functions, not just one.
108
- If you only have a single reward signal, it is easier for the model to hack it. Multiple
109
- independent checks reduce that risk.
110
- For example, for a coding environment:
111
- ● reward passing tests,
112
- ● penalize timeouts,
113
- ● reward format compliance,
114
- ● reject use of forbidden globals,
115
-
116
- ● and separately verify the function contract.
117
- 8) Protect yourself against reward hacking
118
- Reward hacking is one of the biggest practical failure modes. The model may learn shortcuts
119
- that maximize your reward without solving the real task. Examples mentioned include:
120
- ● editing timers,
121
- ● caching results,
122
- ● abusing globals,
123
- ● mutating protected state,
124
- ● or exploiting environment bugs.
125
- What to do:
126
- - Use multiple independent reward functions
127
- - Lock down execution where possible
128
- - Add time limits
129
- - Avoid unrestricted global state
130
- - Sample outputs frequently and inspect them
131
- - Terminate or roll back runs if behavior drifts badly
132
- A particularly practical recommendation was to use a locked-down function or restricted
133
- execution approach so the model cannot rely on undeclared globals or hidden cached state.
134
- Also, do not just let training run forever without checking generations. Periodic human
135
- inspection is still necessary.
136
- 9) Use process-aware feedback when you can
137
- Naively assigning the same final reward to every token is inefficient. If possible, use richer
138
- supervision that distinguishes good intermediate steps from bad ones. That is the idea behind
139
- process supervision.
140
- In practice, this can be approximated by:
141
- ● line-by-line checks,
142
- ● step-level verifiers,
143
- ● program trace analysis,
144
-
145
- ● or LLM-as-a-judge for intermediate reasoning.
146
- But be careful: LLM-as-a-judge can itself be gamed. Use it as one signal, not the only signal.
147
- For a hackathon, outcome-based verification plus a few lightweight process checks is usually
148
- the sweet spot.
149
- 10) Pick the right training stack
150
- The intended stack here is:
151
- ● TRL for RL training algorithms
152
- ● Unsloth to make RL training and inference more efficient
153
- ● OpenEnv to standardize environment interaction
154
- This combination works because:
155
- ● OpenEnv gives you a common environment interface
156
- ● TRL gives you RL trainers like GRPO
157
- ● Unsloth reduces memory use and improves efficiency on top of TRL
158
- One of the practical examples used the same prompt repeated many times, routed through an
159
- environment, with TRL driving training and Unsloth helping with performance.
160
- 11) Prefer GRPO / RLVR style training for verifiable tasks
161
- The RL setup discussed here leans toward RL with verifiable rewards:
162
- ● instead of a learned reward model,
163
- ● use a verifier, test harness, regex check, executor, or environment.
164
- GRPO was described as a more efficient evolution relative to older PPO-style setups,
165
- especially by simplifying away parts like the value model.
166
- For hackathon purposes, the key practical takeaway is:
167
- ● if the task is verifiable,
168
- ● build the verifier first,
169
- ● then plug that verifier into RL training.
170
-
171
- 12) Keep inference fast
172
- One important point: in RL for LLMs, inference can dominate total runtime. Over time, rollout
173
- generation often becomes the bottleneck, not the optimizer step.
174
- That means your project speed depends heavily on:
175
- ● fast sampling,
176
- ● tight environment loops,
177
- ● low-overhead execution,
178
- ● and efficient model runtime.
179
- This is one reason Unsloth matters in the stack, and another reason to avoid overly heavy
180
- environments early in the hackathon.
181
- 13) Deploy your environment early
182
- OpenEnv environments are designed to be deployed as Hugging Face Spaces, which provide:
183
- ● a running server,
184
- ● a Git repository,
185
- ● and a container registry.
186
- That gives you several ways to work:
187
- ● interact with the remote Space directly,
188
- ● install the client code from the repo,
189
- ● pull and run the container locally,
190
- ● or run the FastAPI app locally via Python/Uvicorn.
191
- Why this is good for a hackathon:
192
- ● one shared source of truth,
193
- ● easier collaboration,
194
- ● easier demos,
195
- ● easier switching between local and remote execution.
196
- A good habit is to deploy an early version of the environment before training seriously. That
197
- catches API and packaging issues early.
198
-
199
- 14) Scale only after the environment is stable
200
- There was a dedicated tutorial flow around:
201
- - environment,
202
- - deployment,
203
- - scaling,
204
- - training with TRL and Wordle.
205
- Follow the same order.
206
- Do not start with scale. First confirm:
207
- ● reset works,
208
- ● step works,
209
- ● rewards are sensible,
210
- ● timeouts work,
211
- ● logs are visible,
212
- ● and the environment can be run locally and remotely.
213
- Only then:
214
- ● increase batch sizes,
215
- ● duplicate prompts or tasks,
216
- ● expand task diversity,
217
- ● and benchmark throughput.
218
- 15) Monitor the right things during training
219
- Do not watch only one scalar. Monitor:
220
- ● overall reward,
221
- ● individual reward function columns,
222
- ● success indicators,
223
- ● timeout frequency,
224
- ● and generated strategies over time.
225
- A very concrete suggestion was:
226
-
227
- ● watch whether the reward is going up,
228
- ● and separately watch critical columns like “function works.”
229
- Also inspect actual generations during training. A rising reward is not enough if the model is
230
- learning to exploit bugs.
231
- 16) Save models correctly
232
- If you use QLoRA / LoRA-style training, be careful when saving. One explicit warning was:
233
- Do not upcast a 4-bit model to 16-bit and then merge the LoRA weights naively. That can
234
- badly damage model quality. Instead, use the proper merged-save path, or use the adapters
235
- directly.
236
- For participants, that means:
237
- ● keep your training save path simple,
238
- ● test post-training inference immediately,
239
- ● and do not leave export until the end.
240
- 17) How to structure your team over the hackathon
241
- A very effective team split is:
242
- ## Person A: Environment
243
- ● builds reset/step/state
244
- ● adds timeouts and safety constraints
245
- ● makes local and remote execution work
246
- ## Person B: Verifier / Rewards
247
- ● writes multiple reward functions
248
- ● adds anti-hacking checks
249
- ● makes failure cases visible
250
- ## Person C: Training
251
- ● sets up TRL + Unsloth
252
- ● runs experiments
253
-
254
- ● tracks metrics and generations
255
- ## Person D: Demo / Product
256
- ● prepares the Space demo
257
- ● creates a simple interface
258
- ● records examples and final benchmarks
259
- This split matches the way the stack naturally decomposes in practice.
260
- 18) A practical 1-day execution plan
261
- Phase 1: Pick a narrow task
262
- Choose a small, verifiable environment. Avoid huge long-horizon tasks first.
263
- Phase 2: Build the environment
264
- Use OpenEnv init, implement reset/step/state, and get a local loop working.
265
- Phase 3: Build rewards
266
- Add at least 2–4 independent reward checks, plus timeout and anti-cheat logic.
267
- ## Phase 4: Deploy
268
- Push to a Space or run locally via container/Uvicorn so teammates can use the same
269
- environment.
270
- Phase 5: Train small
271
- Run a tiny TRL + Unsloth experiment first. Look at outputs, not just metrics.
272
- Phase 6: Inspect for hacking
273
- Sample generations. Check for globals, hacks, environment abuse, or suspicious shortcuts.
274
- Phase 7: Add curriculum
275
- If the model gets zero reward too often, simplify tasks or add easier start states.
276
-
277
- Phase 8: Train bigger
278
- Only after the loop is stable should you increase scale, batch size, or environment diversity.
279
- Phase 9: Save and demo
280
- Export the trained model correctly, test inference, and show before/after behavior.
281
- 19) What judges or reviewers will likely find compelling
282
- The strongest hackathon projects usually show:
283
- ● a clear environment design,
284
- ● objective reward functions,
285
- ● evidence that the model improved,
286
- ● prevention against reward hacking,
287
- ● a reproducible deployment story,
288
- ● and a sharp demo.
289
- A simple but strong demo format is:
290
- - baseline model attempt,
291
- - reward/verifier output,
292
- - trained model attempt,
293
- - measurable improvement,
294
- - short explanation of safeguards.
295
- 20) Suggested problem statement theme directions
296
- Please Refer to
297
- [External] Apr ‘26 OpenEnv Hackathon Themes
298
- 21) Common mistakes to avoid
299
- ● Picking a task so hard that success probability is zero
300
- ● Using only one reward function
301
- ● Not checking for reward hacking
302
- ● Training before the environment is stable
303
- ● Relying only on average reward and not inspecting outputs
304
-
305
- ● Forgetting timeouts and sandbox limits
306
- ● Saving LoRA/QLoRA models incorrectly
307
-
308
- ## 22) Learning Resources
309
-
310
- (Recommended) RL Environment Lecture Chapters:
311
- https://openenv-india-apr-2026.lovable.app/
312
-
313
-
314
- Module 1: Why OpenEnv? (~7 min)
315
- ## ▸ Workshop
316
- ## 8:02–15:05
317
- — https://www.youtube.com/watch?v=1jU05MlENOI&t=482s
318
- ▸ Sanyam: RL loop, fragmented env APIs, OpenEnv as universal interface, Gymnasium spec +
319
- ## Docker
320
- ## ▸ Alt: Mega Lecture
321
- ## 40:01–46:00
322
- — https://www.youtube.com/watch?v=Jew4lhAiqnw&t=2401s
323
-
324
- Module 2: Using Existing Envs (~7.5 min)
325
- ## ▸ Workshop
326
- ## 35:33–43:05
327
- — https://www.youtube.com/watch?v=1jU05MlENOI&t=2133s
328
- ▸ Ben: Hub org, env collections, 3 Space interfaces (server/repo/registry),
329
- from_hub
330
- ## ▸ Alt: Mega Lecture
331
- ## 1:24:11–1:30:00
332
- ## —
333
- https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5051s
334
-
335
- Module 3: Deploying Envs (~9 min)
336
- ## ▸ Mega Lecture
337
- ## 1:30:00–1:39:07
338
- — https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5400s
339
- ▸ Ben: live
340
- openenv init
341
- , scaffold, running locally,
342
- openenv push
343
- , Docker run from Space
344
- ## ▸ Alt: Workshop
345
- ## 43:05–48:30
346
- — https://www.youtube.com/watch?v=1jU05MlENOI&t=2585s
347
-
348
- Module 4: Building Your Own (~6.5 min)
349
- ## ▸ Workshop
350
- ## 43:45–50:20
351
- — https://www.youtube.com/watch?v=1jU05MlENOI&t=2625s
352
- ▸ Ben: scaffold files, business logic (reset/step), models, client, publishing
353
- ## ▸ Alt: Mega Lecture
354
- ## 1:33:30–1:39:07
355
- ## —
356
- https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5610s
357
-
358
- Module 5: Training + TRL (~14 min)
359
- ## ▸ Mega Lecture
360
- ## 1:53:20–2:07:12
361
- — https://www.youtube.com/watch?v=Jew4lhAiqnw&t=6800s
362
-
363
- ▸ Lewis: Wordle GRPO walkthrough — rollout function, reward shaping, GRPOTrainer, live
364
- training
365
- ## ▸ Alt: Workshop
366
- ## 22:24–34:12
367
- — https://www.youtube.com/watch?v=1jU05MlENOI&t=1344s
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
more.md DELETED
@@ -1,557 +0,0 @@
1
- # OpenEnv Hackathon: Build at the Bleeding Edge of AI
2
-
3
- **Event:** India's Biggest Mega AI Hackathon
4
- **Built on:** Meta's OpenEnv (the foundation for next-gen RL environments used by leading AI labs)
5
- **Sponsored by:** Hugging Face, PyTorch
6
-
7
- **Grand Prize:** Winners get an interview opportunity at Meta & Hugging Face AI teams
8
-
9
- **Important Dates:**
10
- - Round 1 Begins: March 25th
11
- - Grand Finale (48-hour sprint in Bangalore): April 25th - 26th
12
-
13
-
14
- ## Results Announcement
15
-
16
- - Top 100 Finalists Announced: Friday, May 1st
17
- - Winners Livestream: Friday, May 8th
18
-
19
-
20
- ## Credits & Resources
21
-
22
- Get your credits for Cursor AI and Hugging Face as early as possible.
23
-
24
- **Cursor AI Credit:** Each participant is eligible. Visit the Scaler Hackathon dashboard to avail credits:
25
- https://tinyurl.com/sclr-openenv-dashboard
26
-
27
- **Hugging Face Credits:** $30 credit per person. Avail credits at:
28
- https://huggingface.co/coupons/claim/hf-openenv-community
29
-
30
- The same links will be shared in the on-campus Discord channels.
31
-
32
-
33
- ## Meet Your Mentors
34
-
35
- **Onsite / Available:**
36
- - Sanyam Bhutani - Partner Engineer, META
37
- - Yash Khare - Partner Engineer, META
38
- - Nilesh Pandey - Partner Engineer, META
39
- - Adithya S Kolavi - Engineer, Hugging Face
40
- - Adarsh Shirawalmath - ML Engineer, Hugging Face
41
- - Arkadip Maitra - ML Engineer, Red Hat
42
- - Aashay Sachdeva - Founding Team, Sarvam
43
- - Deepa Dhevannan - Gen AI Solution Architect
44
- - Soumik Rakshit - ML Engineer, Zomato
45
- - Ayush Satyam - ML Engineer, Red Hat
46
- - Parshant Sharma - ML Engineer, Red Hat
47
-
48
- **Remotely Available:**
49
- - Ben Burtenshaw - Community Education AI, Hugging Face
50
- - Alireza Shamsoshoara - PyTorch, Meta
51
-
52
-
53
- ## Discord Guidelines
54
-
55
- Important: Since global tech leaders and executives are present, a high level of professionalism and decorum must be maintained. Failure to follow the guidelines will lead to strict action and may impact your participation in the hackathon.
56
-
57
-
58
- ## Technical Session Agenda
59
-
60
- - PyTorch Foundation Introduction
61
- - Hackathon Themes
62
- - Submission and Judging Rules
63
- - RL 101 + OpenEnv Recap
64
- - Best Practices
65
- - Q&A
66
-
67
-
68
- ## About PyTorch Foundation
69
-
70
- **Mission:** Democratizing and accelerating the adoption of accessible, high-impact AI technologies by cultivating a robust ecosystem of open-source, vendor-neutral projects spanning the entire AI lifecycle.
71
-
72
- **Hosted Projects:** Multiple open-source projects under the foundation
73
-
74
-
75
- ## Hackathon Goals
76
-
77
- - Learn reinforcement learning (RL)
78
- - Now is a great time to learn RL
79
- - Hack and create cool environments you can use to add skills to models
80
- - Showcase your work on the Hugging Face Hub
81
- - Have fun
82
-
83
- **GitHub Repository:** https://github.com/meta-pytorch/OpenEnv
84
-
85
-
86
- ## Guidelines for Problem Statement
87
-
88
- - It is NOT mandatory to choose the same problem statement as Round 1. Only choose it if it aligns with the provided hackathon themes.
89
- - Before the onsite event (April 25-26): Work on building the environment, agent behaviors, and reward model.
90
- - Onsite (April 25-26): Post-training will be done when you receive compute credits for Hugging Face.
91
-
92
-
93
- ## What Judges Look For (TL;DR)
94
-
95
- Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story.
96
-
97
- A messy but ambitious environment with real training evidence beats a polished but boring one. Pick a problem that excites you (that energy comes through in the pitch).
98
-
99
- **Note:** Only one submission per team. The URL link of your environment must be submitted as judges will pull the environment from the URL to evaluate it. Changes after the deadline will not be considered.
100
-
101
-
102
- ## Judging Criteria
103
-
104
- | Criterion | Weight | What It Means |
105
- |-----------|--------|----------------|
106
- | Environment Innovation | 40% | Is the environment novel, creative, or genuinely challenging? Does it meaningfully test agent behavior in a new way? |
107
- | Showing Improvement in Rewards | 20% | Is there observable evidence of training progress? Reward curves, before/after behavior, baseline comparison. |
108
- | Storytelling & Presentation | 30% | Can you clearly explain the problem, the environment, and what the agent learned? Is the demo engaging for a non-technical audience? |
109
- | Reward & Training Pipeline | 10% | Is the reward logic coherent? Does the pipeline produce meaningful improvement? |
110
-
111
-
112
- ## Minimum Submission Requirements (Non-Negotiable)
113
-
114
- Submissions missing any of these are at a serious disadvantage:
115
-
116
- 1. Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.
117
-
118
- 2. A working training script using Unsloth or Hugging Face TRL, ideally as a Colab notebook so judges can re-run it.
119
-
120
- 3. Evidence that you actually trained: at minimum, loss and reward plots from a real run.
121
-
122
- 4. A short writeup: a mini-blog on Hugging Face, a less than 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck. All materials must be linked from your README.
123
-
124
- 5. Push your environment to a Hugging Face Space so it's discoverable and runnable.
125
-
126
- 6. A README that motivates the problem, explains how the environment works, and shows results.
127
-
128
- 7. README must have a link to the environment in the Hugging Face Space and all additional references to other materials (videos, blog posts, slides, presentations, etc.).
129
-
130
- 8. Do not include large video files in your HF Hub submission. Use URL references instead.
131
-
132
-
133
- ## What Makes a Submission Stand Out
134
-
135
- ### 1. Pick an Ambitious, Original Problem
136
-
137
- Ask yourself:
138
- - Does this environment teach an LLM something it currently can't do well?
139
- - Is the domain underexplored in RL/LLM training?
140
- - Could a researcher write a paper about training on this?
141
-
142
- Avoid clones of chess, snake, tic-tac-toe, and grid-world.
143
-
144
- ### 2. Design a Reward Signal That Actually Teaches
145
-
146
- A great environment has a reward function that:
147
- - Provides a rich, informative signal (not just 0/1 at the end)
148
- - Captures something hard to measure in a clever way
149
- - Uses OpenEnv's Rubric system thoughtfully (composable rubrics are better than monolithic scoring)
150
- - Is hard to game (an agent that exploits the reward without solving the task should not get high scores)
151
-
152
- ### 3. Show Real Training, End to End
153
-
154
- The bar is not "training script exists." The bar is "training script runs against the environment, the agent learns, and you can show it."
155
-
156
- - Your training loop must connect to your environment (not a static dataset)
157
- - Train long enough that the curves mean something
158
- - Compare a trained agent vs. a random/untrained baseline (quantitative and/or qualitative)
159
- - Include plots and numbers in your README and writeup
160
-
161
- ### 4. Make Your Plots Readable
162
-
163
- Reviewers spend seconds, not minutes, on each plot.
164
-
165
- - Label both axes ("training step" or "episode" on x, "reward" or "loss" on y) and include units
166
- - Save plots as .png or .jpg and commit them to the repo (don't leave them only in a Colab cell or a deleted Wandb run)
167
- - If you used Wandb, include the link to that specific run
168
- - Embed key plots in your README with a one-line caption explaining what each one shows
169
- - If you have multiple runs (baseline vs. trained, ablations), put them on the same axes so comparison is obvious
170
-
171
- ### 5. Tell a Story, Not an API Doc
172
-
173
- Your README, blog, and pitch should answer:
174
-
175
- 1. **Problem:** What capability gap or interesting domain are you targeting?
176
- 2. **Environment:** What does the agent see, do, and get rewarded for?
177
- 3. **Results:** What changed after training? Show it.
178
- 4. **Why does it matter:** Who would care, and why?
179
-
180
- A reviewer should be able to read your README in 3-5 minutes and want to try your environment.
181
-
182
- ### 6. Engineer It Cleanly (Table Stakes)
183
-
184
- Engineering quality matters less than ambition, but sloppy work hurts.
185
-
186
- - Use OpenEnv's Environment or MCPEnvironment base classes properly
187
- - Respect client/server separation (clients should never import server internals)
188
- - Follow the standard Gym-style API (reset, step, state)
189
- - Have a valid openenv.yaml manifest
190
- - Don't use reserved tool names (reset, step, state, close) for MCP tools
191
-
192
-
193
- ## OpenEnv Technical Recap
194
-
195
- ### The RL Loop (Conceptual Example: Teaching a Dog to Sit)
196
-
197
- ```
198
- observation = environment.reset() # Start a new episode
199
- while not done:
200
- observation = environment.observe() # What does the agent see?
201
- action = agent.choose(observation) # What does the agent do?
202
- result = environment.step(action) # Environment responds
203
- reward = result.reward # Get feedback
204
- agent.learn(reward) # Agent learns
205
- ```
206
-
207
- ### The Four Key Concepts
208
-
209
- - **reset()** - Start a new episode. Begin a fresh training session.
210
- - **observation** - What the agent sees. The current state of the world.
211
- - **action** - What the agent does. Sit, spin, move left, etc.
212
- - **step(action)** - Execute the action. Returns three things: new observation, reward, and done flag (episode over).
213
-
214
- ### Building Your Environment in 5 Simple Steps
215
-
216
- 1. **Define Types (models.py)** - Action, Observation, State dataclasses
217
- 2. **Implement Environment (server/environment.py)** - reset(), step(), state() methods
218
- 3. **Create Client (client.py)** - HTTPEnvClient subclass
219
- 4. **Create Server (server/app.py)** - app = create_fastapi_app(env)
220
- 5. **Dockerize (Dockerfile)** - Standard container setup
221
-
222
- **Or use the CLI:** `openenv init my_env` - scaffolding ready in seconds.
223
-
224
- ### The Universal Interface
225
-
226
- Every OpenEnv environment implements these 3 methods:
227
-
228
- ```python
229
- class Environment:
230
- def reset(self) -> Observation:
231
- """Start a new episode"""
232
-
233
- def step(self, action: Action) -> Observation:
234
- """Execute action, return observation"""
235
-
236
- def state(self) -> State:
237
- """Get episode metadata"""
238
- ```
239
-
240
- ### Type-Safe by Design
241
-
242
- Define your data structures with Python dataclasses:
243
-
244
- - **Action:** What the agent does (move, jump, click, type, etc.)
245
- - **Observation:** What the agent sees (board state, pixels, text, etc.)
246
- - **State:** Episode metadata (ID, step count, timestamp, etc.)
247
-
248
- ### Connecting to Any Environment
249
-
250
- This pattern works for Chess, Atari, Trading, Android - everything:
251
-
252
- ```python
253
- # Connect to environment (runs in Docker container)
254
- env = SomeEnv.from_docker_image("some-env:latest")
255
-
256
- # Start new episode
257
- result = env.reset()
258
-
259
- # Take action
260
- action = SomeAction(...)
261
- result = env.step(action)
262
-
263
- # Get episode metadata
264
- state = env.state()
265
-
266
- # Clean up
267
- env.close() # Container stops automatically
268
- ```
269
-
270
-
271
- ## Model Context Protocol (MCP) - Adding Tools to Your Environment
272
-
273
- **The Challenge:** Modern AI agents need access to external systems like web search APIs, file operations, database queries, Git operations, and custom integrations.
274
-
275
- **The Solution:** MCP (Model Context Protocol) - a standard protocol for AI agents to discover and call tools. It features a REST-like API (JSON-RPC), works with any AI framework, and has plug-and-play tool servers.
276
-
277
-
278
- ## Deployment Commands
279
-
280
- ```bash
281
- # Initialize a new environment
282
- openenv init my_env
283
- cd my_env
284
-
285
- # Deploy to your namespace
286
- openenv push
287
-
288
- # Deploy to specific repo
289
- openenv push --repo-id username/my-env
290
-
291
- # Deploy as private
292
- openenv push --repo-id username/my-env --private
293
- ```
294
-
295
-
296
- ## Hugging Face Spaces - Three Components
297
-
298
- Every HF Space provides three components:
299
-
300
- ### 1. Server: A Running Environment Endpoint
301
-
302
- Connect directly to the running Space (WebSocket under the hood).
303
-
304
- **Async (recommended):**
305
- ```python
306
- async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
307
- result = await client.reset()
308
- result = await client.step(EchoAction(message="Hello"))
309
- ```
310
-
311
- **Sync (using .sync() wrapper):**
312
- ```python
313
- with EchoEnv(base_url="https://openenv-echo-env.hf.space").sync() as client:
314
- result = client.reset()
315
- result = client.step(EchoAction(message="Hello"))
316
- ```
317
-
318
- **Available Endpoints:**
319
- - /ws - WebSocket persistent session (used by client)
320
- - /health - HTTP GET health check
321
- - /reset - HTTP POST reset environment (stateless)
322
- - /step - HTTP POST execute action (stateless)
323
- - /state - HTTP GET current state
324
- - /docs - HTTP GET OpenAPI documentation
325
- - /web - HTTP GET interactive web UI
326
-
327
- **Check if space is running:**
328
- ```bash
329
- curl https://openenv-echo-env.hf.space/health
330
- # Returns: {"status":"healthy"}
331
- ```
332
-
333
- ### 2. Repository: Installable Python Package
334
-
335
- Every Space is a Git repository. OpenEnv environments include a pyproject.toml, making them pip-installable directly from the Space URL.
336
-
337
- ```bash
338
- # Install client package from Space
339
- pip install git+https://huggingface.co/spaces/openenv/echo-env
340
- ```
341
-
342
- This installs: Client class (EchoEnv), Models (EchoAction, EchoObservation), and Utilities.
343
-
344
- After installation:
345
- ```python
346
- from envs.echo_env import EchoEnv, EchoAction, EchoObservation
347
- action = EchoAction(message="Hello")
348
- ```
349
-
350
- ### 3. Registry: Docker Container Image
351
-
352
- ```bash
353
- # Pull the image
354
- docker pull registry.hf.space/openenv-echo-env:latest
355
-
356
- # Run locally on port 8001
357
- docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
358
- ```
359
-
360
-
361
- ## Client Usage Examples
362
-
363
- ```python
364
- import asyncio
365
- from echo_env import EchoEnv, EchoAction
366
-
367
- async def main():
368
- # Development: connect to remote Space
369
- async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
370
- result = await client.reset()
371
-
372
- # Production: run locally for speed
373
- # docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
374
- async with EchoEnv(base_url="http://localhost:8001") as client:
375
- result = await client.reset()
376
-
377
- # Or let the client manage Docker for you
378
- client = await EchoEnv.from_env("openenv/echo-env") # Auto-pulls and runs
379
- async with client:
380
- result = await client.reset()
381
-
382
- asyncio.run(main())
383
-
384
- # For sync usage, use the .sync() wrapper:
385
- with EchoEnv(base_url="http://localhost:8001").sync() as client:
386
- result = client.reset()
387
- ```
388
-
389
-
390
- ## Clone and Run Environment Locally
391
-
392
- ```bash
393
- # Clone from HF Space
394
- git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
395
- cd openenv-benchmark
396
-
397
- # Install in editable mode
398
- uv sync
399
-
400
- # Start server
401
- uv run server
402
-
403
- # Run isolated from remote Space
404
- uv run --isolated --project https://huggingface.co/spaces/burtenshaw/openenv-benchmark server
405
- ```
406
-
407
-
408
- ## Local Development with Uvicorn
409
-
410
- ```bash
411
- # Full control over uvicorn options
412
- uvicorn benchmark.server.app:app --host "$HOST" --port "$PORT" --workers "$WORKERS"
413
-
414
- # With reload for development
415
- uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --reload
416
-
417
- # Multi-worker mode for better concurrency
418
- uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --workers 4
419
- ```
420
-
421
-
422
- ## Run Container Locally from Space
423
-
424
- ```bash
425
- # Clone from HF Space
426
- git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
427
- cd openenv-benchmark
428
-
429
- # Using OpenEnv CLI (recommended)
430
- openenv build -t openenv-benchmark:latest
431
-
432
- # Or with Docker directly
433
- docker build -t openenv-benchmark:latest -f server/Dockerfile .
434
- ```
435
-
436
-
437
- ## Environment Setup
438
-
439
- ### Using uv venv:
440
- ```bash
441
- uv venv
442
- source .venv/bin/activate
443
- uv pip install openenv-core
444
- ```
445
-
446
- ### Using conda:
447
- ```bash
448
- conda create -n openenv_hackathon python=3.12
449
- conda activate openenv_hackathon
450
- uv pip install openenv-core
451
- ```
452
-
453
- ### Initialize a New Environment:
454
- ```bash
455
- openenv init HackEnv101_AlirezaShamsoshoara
456
- ```
457
-
458
- This creates 11 files and generates uv.lock. Next steps:
459
- ```bash
460
- cd /path/to/HackEnv101_AlirezaShamsoshoara
461
- # Edit environment implementation in server/..._environment.py
462
- # Edit models in models.py
463
- # Install dependencies: uv sync
464
- ```
465
-
466
-
467
- ## Training Resources
468
-
469
- ### Training with TRL (GRPO)
470
-
471
- Hugging Face TRL integrates natively with OpenEnv environments for GRPO training.
472
-
473
- **Resources:**
474
- - TRL OpenEnv Documentation: https://huggingface.co/docs/trl/en/openenv
475
- - Sudoku Example: https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_sudoku_qrpo.ipynb
476
- - Wordle Example: https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_worldle_qrpo.ipynb
477
- - More TRL Examples: https://github.com/huggingface/trl/tree/main/examples/scripts/openenv
478
-
479
- **General Training Examples:**
480
- - Main examples directory: https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial/examples
481
- - Unsloth 2048 example: https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/examples/unsloht_2048.ipynb
482
- - Wordle example (TRL): https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/examples/worldle.py
483
-
484
- ### Training with Unsloth
485
-
486
- Unsloth provides 2x faster training and 70% less memory through custom CUDA kernels. Works as a drop-in replacement - same TRL API, just faster.
487
-
488
- **The Pattern:**
489
- 1. Load model via FastLanguageModel (with 4-bit quantization)
490
- 2. Apply LoRA adapters for parameter-efficient training
491
- 3. Use OpenEnv as the reward function
492
- 4. Train with standard GRPOTrainer
493
-
494
- **Google Colab Ready:** Run on a free T4 GPU. Unsloth + OpenEnv Colab notebook available for the 2048 game environment with 20B parameter models.
495
-
496
- **Also Compatible With:** TRL, torchforge, SkyRL, ART, Oumi, veRL
497
-
498
-
499
- ## Accessing Hugging Face Infrastructure
500
-
501
- Use HF infrastructure to run your training. Hugging Face Jobs provide compute for AI and data workflows.
502
-
503
- **Important Notes:**
504
- - Depends on your model size, choose your GPU model wisely
505
- - Choose wisely so you can run training/inference for a reasonable time with your credits
506
- - A T4 GPU (small/medium) is a good choice
507
-
508
- **Methods to Run Jobs:**
509
- - hf CLI
510
- - huggingface_hub Python client
511
- - Jobs HTTP API
512
-
513
- **Pricing and Billing Resources:**
514
- - Billing settings: https://huggingface.co/settings/billing
515
- - Jobs settings: https://huggingface.co/settings/jobs
516
- - Jobs documentation: https://huggingface.co/docs/hub/jobs
517
- - Job CLI documentation: https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs
518
- - Jobs guide: https://huggingface.co/docs/huggingface_hub/guides/jobs
519
- - Jobs pricing: https://huggingface.co/docs/hub/jobs-pricing
520
- - Jobs examples: https://huggingface.co/docs/hub/jobs-examples
521
-
522
- **Check available hardware:**
523
- ```bash
524
- hf jobs hardware
525
- ```
526
-
527
- ### Example Hardware Options
528
-
529
- | Name | Pretty Name | CPU | RAM | Accelerator | Cost/Hour |
530
- |------|-------------|-----|-----|-------------|-----------|
531
- | cpu-basic | CPU Basic | 2 vCPU | 16 GB | N/A | $0.01 |
532
- | cpu-upgrade | CPU Upgrade | 8 vCPU | 32 GB | N/A | $0.03 |
533
- | t4-small | Nvidia T4 - small | 4 vCPU | 15 GB | 1x T4 (16 GB) | $0.40 |
534
- | t4-medium | Nvidia T4 - medium | 8 vCPU | 30 GB | 1x T4 (16 GB) | $0.60 |
535
- | a10g-small | Nvidia A10G - small | 4 vCPU | 15 GB | 1x A10G (24 GB) | $1.00 |
536
- | a10g-large | Nvidia A10G - large | 12 vCPU | 46 GB | 1x A10G (24 GB) | $1.50 |
537
- | a100-large | Nvidia A100 - large | 12 vCPU | 142 GB | 1x A100 (80 GB) | $2.50 |
538
- | a100x4 | 4x Nvidia A100 | 48 vCPU | 568 GB | 4x A100 (320 GB) | $10.00 |
539
- | a100x8 | 8x Nvidia A100 | 96 vCPU | 1136 GB | 8x A100 (640 GB) | $20.00 |
540
- | h200 | Nvidia H200 | 23 vCPU | 256 GB | 1x H200 (141 GB) | $5.00 |
541
- | h200x2 | Nvidia H200 (2x) | 46 vCPU | 512 GB | 2x H200 (282 GB) | $10.00 |
542
- | h200x4 | Nvidia H200 (4x) | 92 vCPU | 1024 GB | 4x H200 (564 GB) | $20.00 |
543
- | h200x8 | Nvidia H200 (8x) | 184 vCPU | 2048 GB | 8x H200 (1128 GB) | $40.00 |
544
-
545
-
546
- ## Still Have Questions?
547
-
548
- Please mention them in the Discord India OpenEnv Hackathon channels, and the team will do their best to answer.
549
-
550
- ---
551
-
552
- ## Example Model Reference
553
-
554
- - model_name_or_path: Qwen/Qwen2-0.5B (and similar models)
555
- ```
556
-
557
- This is plain markdown text. Just copy everything between the triple backticks and paste it into any markdown editor or document.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
plan.md DELETED
@@ -1,63 +0,0 @@
1
- # Goal Description
2
-
3
- To elevate QuantHive into a definitive Top 15 Hackathon submission, we need to transition from a "single-agent Gym environment with pipeline functions" to a **"decentralized society of interacting agents"** using true multi-agent RL principles.
4
-
5
- This requires rewriting the core environment to use the **PettingZoo** AEC (Agent Environment Cycle) API, giving each agent independent observation/action spaces, conflicting reward functions (emergent behavior), and communication channels.
6
-
7
- ## User Review Required
8
-
9
- > [!WARNING]
10
- > **Massive Architectural Rewrite**
11
- > This change will rip out the foundation of your current, working project.
12
- >
13
- > 1. We will replace [trading_env.py](file:///e:/Development/Round2/env/trading_env.py) (Gym) with `multi_agent_env.py` (PettingZoo).
14
- > 2. The API server ([server.py](file:///e:/Development/Round2/api/server.py)) and UI will break and need to be rewritten to support asynchronous agent steps.
15
- > 3. The current GRPO training script ([train_grpo.py](file:///e:/Development/Round2/training/train_grpo.py)) trains a single policy on JSON. In a true multi-agent setup, we need an *online* RL loop. We will build a multi-agent rollout collector connecting to Unsloth/TRL, but it is experimental and computationally heavy.
16
- >
17
- > If you are close to the submission deadline, doing this is extremely risky. If you proceed, the repository will be in a broken state until all components are rewired.
18
-
19
- ## Proposed Changes
20
-
21
- ### Core Environment (PettingZoo)
22
- Replace the single-agent Gym environment with a multi-agent PettingZoo environment.
23
- #### [NEW] `env/multi_agent_env.py`
24
- - Inherits from `pettingzoo.utils.env.AECEnv`.
25
- - Agents: `["risk_manager_0", "portfolio_manager_0", "trader_0"]`.
26
- - [step()](file:///e:/Development/Round2/api/server.py#108-227) and `observe()` functions that alternate execution between agents.
27
- - **Agent Negotiation:** The observation space of the Trader includes the output messages/constraints from the Risk Manager and PM.
28
- - **Adversarial Rewards:**
29
- - Trader: Rewarded for PnL.
30
- - Risk Manager: Rewarded for capping size when volatility/drawdown is high, penalized when Trader loses money.
31
- #### [MODIFY] [env/trading_env.py](file:///e:/Development/Round2/env/trading_env.py)
32
- - Deprecate or refactor to wrap the PettingZoo environment for legacy compatibility.
33
-
34
- ### Agents & Governance
35
- Modify the agent definitions to act as independent RL policies within the PettingZoo loop.
36
- #### [MODIFY] [agents/risk_model.py](file:///e:/Development/Round2/agents/risk_model.py)
37
- #### [MODIFY] [agents/portfolio_manager.py](file:///e:/Development/Round2/agents/portfolio_manager.py)
38
- #### [MODIFY] [agents/trader.py](file:///e:/Development/Round2/agents/trader.py)
39
- - Refactor agents to accept PettingZoo observations (which include multi-agent messages) and output PettingZoo actions.
40
-
41
- ### Training Loop (Online Multi-Agent RL)
42
- Connect the LLMs to the PettingZoo environment for online rollout collection.
43
- #### [NEW] `training/train_multi_agent.py`
44
- - An online RL loop that steps the `multi_agent_env`.
45
- - Collects trajectories (Observation, Action, Reward) for multiple agents.
46
- - Feeds collected rollout buffers into the GRPO/PPO trainer. Note: Full multi-agent online LLM training is extremely heavy; we may implement it as alternating optimization (freeze RM, train Trader, freeze Trader, train RM).
47
-
48
- ### API Server and UI
49
- Update the server to orchestrate a PettingZoo AEC loop.
50
- #### [MODIFY] [api/server.py](file:///e:/Development/Round2/api/server.py)
51
- - Rewrite [SimulationRunner](file:///e:/Development/Round2/api/server.py#72-227) to step through the PettingZoo `agent_iter()`.
52
- - Broadcast state updates to the UI, showing the negotiation and adversarial interactions.
53
-
54
- ## Verification Plan
55
-
56
- ### Automated Tests
57
- 1. Initialize `MultiAgentEnv` and run the `pettingzoo.test.api_test()`.
58
- 2. Verify that taking actions with `risk_manager_0` updates the observation space of `trader_0`.
59
- 3. Verify that the adversarial reward functions independently return conflicting scores (e.g., RM gets +1 for restricting, Trader gets -1 for missing a trade).
60
-
61
- ### Manual Verification
62
- 1. Run the new API server and step through the UI to see the multi-agent negotiation in real-time.
63
- 2. Run `train_multi_agent.py` for 50 steps to ensure trajectories build correctly and gradients update.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.md DELETED
@@ -1,150 +0,0 @@
1
-
2
-
3
- **What the automated round checks**
4
- These are the items the validation pass looks for. If any is missing or broken at the deadline, the submission won't make it to a human judge; regardless of how strong the underlying idea is. Verify each one explicitly before you submit.
5
-
6
- - Public, cloneable Hugging Face Space at the submitted URL. Test from a logged-out browser. Private spaces, dead links, or 404s are an automatic out.
7
- - Valid OpenEnv structure: proper Environment / MCPEnvironment base class, Gym-style reset / step / state, and a parseable openenv.yaml.
8
- - Training evidence committed to the repo as image files (.png / .jpg): At minimum a loss curve and a reward curve. Wandb-only links and plots that live only in a Colab cell don't count: they may not be reachable when validation runs.
9
- - A runnable training script (Unsloth, HF TRL, or other frameworks), preferably linked as a Colab notebook so it can be re-executed end to end (Python script is acceptable as well).
10
- - A README that links every deliverable: HF Space, training notebook, and your writeup (blog / video / slides), with the key plots embedded inline. If validation can't reach a deliverable from the README, it counts as missing.
11
-
12
-
13
- **TL;DR**
14
-
15
- Build an environment that an LLM could actually be trained on to get measurably better at
16
-
17
- something interesting. Then show that training. Then tell the story.
18
-
19
- A messy but ambitious environment with real training evidence beats a polished but boring one.
20
-
21
- Pick a problem that excites you (that energy comes through in the pitch).
22
-
23
- **Judging Criteria**
24
-
25
- **Criterion: Environment Innovation**Weight: 40%What it means:Is the environment novel, creative, or genuinely challenging?Does it meaningfully test agent behavior **in** a way that hasn't been done before?
26
-
27
- **Criterion: Storytelling & Presentation**Weight: 30%What it means:Can you clearly explain the problem, the environment, and what the agent learned?Is the demo engaging and easy to follow **for** a non-technical audience?
28
-
29
- **Criterion: Showing Improvement in Rewards**Weight: 20%What it means:Is there observable evidence of training progress? Reward curves, before/after behavior,comparison against a baseline -- anything that proves the agent learned something.
30
-
31
- **Criterion: Reward & Training Pipeline**Weight: 10%What it means:Is the reward logic coherent? Does the pipeline produce meaningful improvement **in** the trainedagent's behavior?
32
-
33
- **Minimum Submission Requirements**
34
-
35
- **NOTE:** These are **non-negotiable**. Submissions missing any of these are at a serious disadvantage.
36
-
37
- * **Use OpenEnv** (latest release). Build on top of the framework; don’t reinvent the wheel.
38
-
39
- * **A working training script** using **Unsloth or Hugging Face TRL**, ideally as a Colab notebook so judges can re-run it.
40
-
41
- * **Evidence that you actually trained**; at minimum, loss and reward plots from a real run.
42
-
43
- * **A short writeup**: a mini-blog on Hugging Face or a < 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck of presentation. Please make sure that all materials are linked from your README file so that judges can access them easily.
44
-
45
- * **Push your environment to a Hugging Face Space** so it’s discoverable and runnable.
46
-
47
- * **A README** that motivates the problem, explains how the env works, and shows results.
48
-
49
- * README should have a link to the environment in the Hugging Face Space. It should also have all additional references to other materials (e.g. videos, blog posts, slides, presentations, etc.) that you want to include.
50
-
51
- * Please do not include big video files in your Env submission on HF Hub as we would like to have a small size for each env (Please use url as reference link to additional materials).
52
-
53
-
54
- **What Makes a Submission Stand Out**
55
-
56
- _**Pick an ambitious, original problem**_
57
-
58
- The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
59
-
60
- you need a genuinely fresh angle. Some questions to ask yourself:
61
-
62
- * Does this environment exist to teach an LLM something it currently can’t do well?
63
-
64
- * Is the domain underexplored in RL/LLM training?
65
-
66
- * Could a researcher write a paper about training on this?
67
-
68
-
69
- _**Design a reward signal that actually teaches**_
70
-
71
- A great environment has a reward function that:
72
-
73
- * Provides a **rich, informative signal** (not just 0/1 at the end)
74
-
75
- * Captures something **hard to measure** in a clever way
76
-
77
- * Uses OpenEnv’s **Rubric system** thoughtfully (composable rubrics > monolithic scoring)
78
-
79
- * Is **hard to game**; an agent that exploits the reward without solving the task should not get high scores
80
-
81
-
82
- _**Show real training, end to end**_
83
-
84
- The bar isn’t “training script exists.” The bar is “training script runs against the environment, the
85
-
86
- agent learns, and you can show it.” Concretely:
87
-
88
- * Your training loop should connect to **your** environment (not a static dataset)
89
-
90
- * Train long enough that the curves mean something
91
-
92
- * Compare a **trained agent vs. a random/untrained baseline**; quantitative and/or qualitative
93
-
94
- * Include the plots and numbers in your README and writeup
95
-
96
-
97
- _**Make your plots readable**_
98
-
99
- Reviewers spend seconds, not minutes, on each plot. Help them out:
100
-
101
- * **Label both axes** (e.g. “training step” / “episode” on x, “reward” / “loss” on y) and include units where they apply
102
-
103
- * Save plots as _.png_ or _.jpg_ and **commit them to the repo** (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via Wandb, please include the link to that specific run of your plots)
104
-
105
- * **Embed the key plots in your README** with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
106
-
107
-
108
- _**Tell a story, not an API doc**_
109
-
110
- Your README, blog, and pitch should answer:
111
-
112
- 1. **Problem)** what capability gap or interesting domain are you targeting?
113
-
114
- 2. **Environment)** what does the agent see, do, and get rewarded for?
115
-
116
- 3. **Results)** what changed after training? Show it.
117
-
118
- 4. **Why does it matter)** who would care, and why?
119
-
120
-
121
- _A reviewer should be able to read your README in 3~5 minutes and want to try your_
122
-
123
- _environment._
124
-
125
- **NOTE:** If you have a video, HF post, or anything else interesting, please make sure that it’s linked
126
-
127
-   from your README as a link.
128
-
129
- _**Engineer it cleanly (table stakes)**_
130
-
131
- Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
132
-
133
- * Use OpenEnv’s Environment / MCPEnvironment base classes properly
134
-
135
- * Respect the **client / server separation** (clients should never import server internals)
136
-
137
- * Follow the standard Gym-style API (reset, step, state)
138
-
139
- * Have a valid openenv.yaml manifest
140
-
141
- * Don’t use reserved tool names (reset, step, state, close) for MCP tools
142
-
143
-
144
- **Final Note**
145
-
146
- Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
147
-
148
- ambitious. Pick a problem you find genuinely interesting; that almost always produces better
149
-
150
- work than chasing what you think judges want. Good luck.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
themes.md DELETED
@@ -1,134 +0,0 @@
1
-
2
-
3
- Theme #1 - Multi-Agent Interactions
4
- Environments for this theme involve cooperation, competition, negotiation, and
5
- coalition formation. Learning from these environments will enable agents to model the
6
- beliefs and incentives of others in partially observable settings. This drives
7
- theory-of-mind reasoning and emergent strategic behavior.
8
- Expected Outcome: an environment that can be used to train multi-agent task
9
- handling in a LLM
10
- Example environments: Market simulations, compute-allocation negotiations,
11
- collaborative puzzle worlds, mixed cooperative/competitive strategy games.
12
- Sub-themes with bonus prizes.
13
- - Fleet AI. Scalable Oversight: Environments that train oversight agents to
14
- monitor, analyze, and explain the behavior of other AI agents operating in
15
- complex, multi-agent settings.
16
- - Halluminate. Multi-Actor Environments: Build a realistic environment where an
17
- agent interacts with and manages multiple actors (agents) to discover and
18
- achieve the task
19
- Theme #2 - (Super) Long-Horizon Planning & Instruction
20
- ## Following
21
- You will build environments that require deep, multi-step reasoning with sparse or
22
- delayed rewards. After using these environments, the goal is to enable agents to
23
- decompose goals, track state over extended trajectories, and recover from early
24
- mistakes. The aim is to push beyond shallow next-token reasoning toward structured
25
- planning and durable internal representations.
26
- Expected Outcome: an environment that can capture and improve LLM behaviour on
27
- challenging long horizon tasks that need long running sessions beyond context
28
- memory limits.
29
-
30
- Example environments: Research-planning simulators, large-scale codebase
31
- refactoring tasks, strategic resource management worlds, long-horizon logistics
32
- optimization, extremely complicated long-horizon instruction following (e.g., 300
33
- instructions scattered around).
34
- Sub-themes with bonus prizes.
35
- - Scale AI. Environments for long horizon workflows for non-code use cases
36
- within a business setting: focusing on either Sales, Project management, or HR
37
- ## & IT.
38
- - Mercor. Make an environment with capped/uncapped rewards where frontier
39
- model rewards scale with token output.
40
- ## Theme #3 - World Modeling
41
- ## #3.1 Professional Tasks
42
- Here you will develop environments that require real interaction with tools, APIs, or dynamic
43
- systems where the model is expected to do real hard work instead of exploiting short-cuts to
44
- arrive at the desired outcome. Learning from these environments will enable agents to
45
- maintain consistent internal state, update beliefs based on outcomes, and orchestrate
46
- multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
47
- Expected Outcome: an environment capturing nuances of a defined partially observable world
48
- and improve LLM interaction with it
49
- Example environments: Dynamic browser/API ecosystems, enterprise applications, scientific
50
- workflow loops (papers → code → experiments), economic simulations with feedback,
51
- tool-discovery benchmarks.
52
- Sub-themes with bonus prizes.
53
- - Scaler AI Labs. Multi-App RL Environment for Enterprise Workfl ows: Create RL
54
- environments to demonstrate complex workflows, business rule nuances etc in
55
- a large enterprise
56
-
57
-
58
-
59
- ## #3.2 Personalized Tasks
60
- Here we will develop an environment that offers real personalized task handling,
61
- imagine replying to personal messages or handling dinner conflicts due to work
62
- conflicts, replying to tough emails. Think any personal assistant tasks
63
-
64
- Expected Outcome: An environment that gives the model a realistic simulation of
65
- handling personal tasks, conflicts and managing them as delegations
66
-
67
- Example environments: Executive Assistant Meeting Planner, Dinner and drive
68
- planning, email and message replying, shopping, etc
69
-
70
- Sub-themes with bonus prizes.
71
- - Patronus AI. Consumer Workflows with Schema Drift: Multi-step consumer
72
- workflow environments where the underlying data schemas, API contracts, and
73
- t&cs/policies/rules change.
74
-
75
- Theme #4 - Self-Improvement
76
- The focus here is to create environments where agents can learn to generate new
77
- challenges, escalate difficulty, and improve through self-play or adaptive curricula.
78
- Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own
79
- capability growth. The objective is recursive skill amplific ation.
80
- Expected Outcome: an environment for improving self-play of a LLM over a defined
81
- set of tasks
82
- Example environments: Self-play negotiation arenas, auto-generated math/proof
83
- tasks, evolving coding competitions, adaptive RL curricula.
84
- Sub-themes with bonus prizes.
85
- - Snorkel AI. Simulated Experts-in-the-Loop: Environment that simulates
86
- interactions with real subject-matter experts, with changing requirements /
87
- preferences.
88
-
89
-
90
- ## Theme #5: Wild Card - Impress Us!
91
- We do not want to limit your focus if your idea doesn’t fit the boxes above, we want
92
- and WILL reward out of box tasks, please be creative but remember to add
93
- submissions that meaningfully add value to LLM training on a certain task.
94
-
95
- Guidelines for Problem Statement
96
- ● It is NOT mandatory to choose the same problem statement as Round 1. Only
97
- choose the same problem statement if it aligns with the above provided
98
- Hackathon themes.
99
- ● You can start working on your problem statement once you have finalized it.
100
- Post-training can be done onsite on 25th & 26th when you receive compute
101
- credits for HuggingFace.
102
- ● Before the onsite, we suggest you work on building the environment, agent
103
- behaviours, reward model and evaluate if your work aligns with the judging
104
- criteria given below.
105
-
106
-
107
- ## Judging Criteria
108
- Minimum requirements:
109
- ● Usage of OpenEnv (latest release)
110
- ● Show a minimal training script for your environment using Unsloth or HF TRL in
111
- ## Colab
112
- ● Write a mini-blog on HuggingFace or mini-video on YouTube talking about your
113
- submission, <2 minutes
114
-
115
- ## First Round Judging Overview
116
-
117
- ● Pitch Format: Each team has 3 minutes to pitch, followed by 2 minutes for
118
- Q&A (5 minutes total).
119
- ● Evaluation: Teams will be scored based on the following criteria:
120
- - Environment Innovation (40%): Is the environment novel, creative, or
121
- challenging? Does it meaningfully test the agent’s behavior?
122
- - Storytelling (30%): Does the team clearly explain the problem,
123
- environment, and agent behavior? Is the demo engaging and easy to
124
- follow?
125
- - Showing Improvement in Rewards (20%): Does the demo provide
126
- observable evidence of training progress (reward curves, metrics, or
127
- before/after behavior)?
128
- - Reward and Training Script/Pipeline Setup (10%): Is the reward logic
129
- coherent, and does the pipeline produce meaningful improvement in
130
- the agent’s inference (how it acts in the environment)?
131
- Each evaluator will judge about 10-15 teams during the judging process,
132
- submitting scores individually for each team. Once scores are submitted, the
133
- Cerebral Valley team will aggregate your scores with the other judge's scores to
134
- determine the top 15 finalist projects.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
train_hf.py DELETED
@@ -1,438 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- QuantHive — HF Jobs GRPO Training Script
4
- =========================================
5
- Standalone script to fine-tune Qwen 2.5-1.5B on the multi-agent trading
6
- environment using GRPO. Designed to run on HuggingFace Jobs (A10G / A100).
7
-
8
- Usage (local):
9
- python train_hf.py
10
-
11
- Usage (HF Jobs):
12
- hf jobs run --hardware a10g-small -- python train_hf.py
13
-
14
- The script:
15
- 1. Generates scenarios from the PettingZoo multi-agent env
16
- 2. Trains with GRPO + 5 governance-aware verifiers
17
- 3. Saves LoRA adapters + merged model
18
- 4. Logs sample outputs so you can see the <thought> reasoning
19
- 5. Generates training plots and pushes everything to the HF Hub
20
- """
21
-
22
- from __future__ import annotations
23
-
24
- import inspect
25
- import json
26
- import os
27
- import random
28
- import shutil
29
- import sys
30
- from pathlib import Path
31
-
32
- import numpy as np
33
-
34
- # ── Unsloth JIT-compilation bypass (prevents AttributeError on cloud) ─────────
35
- os.environ["UNSLOTH_DISABLE_COMPILE"] = "1"
36
- os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
37
- os.environ["DISABLE_UNSLOTH_COMPILE"] = "1"
38
- os.environ["OPENBLAS_NUM_THREADS"] = "1"
39
- os.environ["OMP_NUM_THREADS"] = "1"
40
-
41
- # Delete compiled cache if it exists
42
- cache_dir = Path("unsloth_compiled_cache")
43
- if cache_dir.exists():
44
- shutil.rmtree(cache_dir, ignore_errors=True)
45
- print("🗑️ Deleted unsloth_compiled_cache/")
46
-
47
- # ── Ensure project root is importable ─────────────────────────────────────────
48
- ROOT = Path(__file__).resolve().parent
49
- if str(ROOT) not in sys.path:
50
- sys.path.insert(0, str(ROOT))
51
-
52
-
53
- # ═══════════════════════════════════════════════════════════════════════════════
54
- # CONFIGURATION — Edit these for your run
55
- # ═══════════════════════════════════════════════════════════════════════════════
56
- MODEL_NAME = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"
57
- OUTPUT_DIR = "models/grpo_hf_trained"
58
- HF_REPO_ID = "ARKAISW/QuantHive-GRPO-Trader" # Where to push the model
59
-
60
- # Training hyperparameters
61
- NUM_SCENARIOS = 800 # More diverse scenarios
62
- MAX_STEPS = 500 # 2x longer than Kaggle run
63
- BATCH_SIZE = 4
64
- GRAD_ACCUM_STEPS = 2
65
- NUM_GENERATIONS = 8 # 8 candidates per prompt (better GRPO signal)
66
- LEARNING_RATE = 1e-5
67
- MAX_SEQ_LENGTH = 1024
68
- MAX_PROMPT_LENGTH = 768
69
- MAX_COMPLETION_LENGTH = 64
70
- SAVE_STEPS = 100
71
- LOGGING_STEPS = 1
72
- DIFFICULTY = "easy" # "easy", "medium", "hard"
73
- SEED = 3407
74
-
75
- # Sample output logging
76
- NUM_SAMPLE_OUTPUTS = 10 # How many sample outputs to log after training
77
-
78
-
79
- def main():
80
- random.seed(SEED)
81
- np.random.seed(SEED)
82
-
83
- # ── Step 1: Install deps if missing ───────────────────────────────────────
84
- print("=" * 60)
85
- print(" QuantHive — Multi-Agent GRPO Training (HF Jobs)")
86
- print("=" * 60)
87
-
88
- import torch
89
- if not torch.cuda.is_available():
90
- raise SystemExit("❌ CUDA not available. Use GPU hardware.")
91
- print(f"✅ CUDA available: {torch.cuda.get_device_name(0)}")
92
- print(f" VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
93
-
94
- # ── Step 2: Generate scenarios ────────────────────────────────────────────
95
- from training.prompt_utils import (
96
- SYSTEM_PROMPT,
97
- build_prompt_multiagent,
98
- generate_pz_scenarios,
99
- )
100
-
101
- print(f"\n📊 Generating {NUM_SCENARIOS} scenarios (difficulty={DIFFICULTY})...")
102
- scenarios = generate_pz_scenarios(
103
- n=NUM_SCENARIOS, difficulty=DIFFICULTY, max_env_steps=100
104
- )
105
- print(f" Generated {len(scenarios)} scenarios.")
106
-
107
- from datasets import Dataset
108
- prompts = [{"prompt": build_prompt_multiagent(sc)} for sc in scenarios]
109
- dataset = Dataset.from_list(prompts)
110
-
111
- # ── Step 3: Load model natively via Transformers/PEFT ─────────────────────
112
- print(f"\n🤖 Loading model natively: {MODEL_NAME}")
113
- from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
114
- from peft import get_peft_model, LoraConfig, TaskType
115
-
116
- tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
117
- if tokenizer.pad_token is None:
118
- tokenizer.pad_token = tokenizer.eos_token
119
-
120
- bnb_config = BitsAndBytesConfig(
121
- load_in_4bit=True,
122
- bnb_4bit_quant_type="nf4",
123
- bnb_4bit_use_double_quant=True,
124
- bnb_4bit_compute_dtype=torch.float16,
125
- )
126
-
127
- model = AutoModelForCausalLM.from_pretrained(
128
- MODEL_NAME,
129
- quantization_config=bnb_config,
130
- device_map="auto",
131
- dtype=torch.float16,
132
- trust_remote_code=True,
133
- )
134
-
135
- peft_config = LoraConfig(
136
- task_type=TaskType.CAUSAL_LM,
137
- r=16,
138
- lora_alpha=16,
139
- target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
140
- bias="none",
141
- )
142
- model = get_peft_model(model, peft_config)
143
-
144
- # 🚀 Surgical DType Lock (Hard Force FP16 after PEFT wrap)
145
- model = model.to(torch.float16)
146
- if hasattr(model, "lm_head"):
147
- model.lm_head.weight.data = model.lm_head.weight.data.to(torch.float16)
148
- if getattr(model.lm_head, "bias", None) is not None:
149
- model.lm_head.bias.data = model.lm_head.bias.data.to(torch.float16)
150
- model.lm_head.to(torch.float16)
151
-
152
- if hasattr(model, "model") and hasattr(model.model, "embed_tokens"):
153
- model.model.embed_tokens.weight.data = model.model.embed_tokens.weight.data.to(torch.float16)
154
- model.model.embed_tokens.to(torch.float16)
155
-
156
- # 🐛 Fix GRPOTrainer crash by injecting warnings_issued dict
157
- if not hasattr(model, "warnings_issued"):
158
- model.warnings_issued = {}
159
-
160
- print(" Native model loaded + LoRA applied.")
161
-
162
- # ── Step 5: Build trainer ─────────────────────────────────────────────────
163
- from trl.trainer.grpo_config import GRPOConfig
164
-
165
- # 🐛 Fix llm_blender crashing on modern transformers by injecting missing cache var
166
- import transformers.utils.hub
167
- if not hasattr(transformers.utils.hub, "TRANSFORMERS_CACHE"):
168
- try:
169
- transformers.utils.hub.TRANSFORMERS_CACHE = transformers.utils.hub.constants.HF_HUB_CACHE
170
- except AttributeError:
171
- transformers.utils.hub.TRANSFORMERS_CACHE = "/tmp"
172
-
173
- from trl.trainer.grpo_trainer import GRPOTrainer
174
-
175
- from env.reward import (
176
- alignment_reward_func,
177
- format_reward_func,
178
- profit_reward_func,
179
- )
180
- from training.grpo_verifiers_multiagent import (
181
- governance_reward_func_multiagent,
182
- risk_reward_func_multiagent,
183
- )
184
-
185
- training_args = GRPOConfig(
186
- output_dir=OUTPUT_DIR,
187
- learning_rate=LEARNING_RATE,
188
- per_device_train_batch_size=BATCH_SIZE,
189
- gradient_accumulation_steps=GRAD_ACCUM_STEPS,
190
- num_train_epochs=1,
191
- max_steps=MAX_STEPS,
192
- save_steps=SAVE_STEPS,
193
- logging_steps=LOGGING_STEPS,
194
- bf16=False,
195
- fp16=False,
196
- max_grad_norm=0.5,
197
- max_prompt_length=MAX_PROMPT_LENGTH,
198
- max_completion_length=MAX_COMPLETION_LENGTH,
199
- num_generations=NUM_GENERATIONS,
200
- report_to="none",
201
- )
202
-
203
- reward_funcs = [
204
- format_reward_func,
205
- alignment_reward_func,
206
- risk_reward_func_multiagent,
207
- profit_reward_func,
208
- governance_reward_func_multiagent,
209
- ]
210
-
211
- trainer_kwargs = {
212
- "model": model,
213
- "reward_funcs": reward_funcs,
214
- "args": training_args,
215
- "train_dataset": dataset,
216
- }
217
- sig = inspect.signature(GRPOTrainer.__init__)
218
- if "processing_class" in sig.parameters:
219
- trainer_kwargs["processing_class"] = tokenizer
220
- elif "tokenizer" in sig.parameters:
221
- trainer_kwargs["tokenizer"] = tokenizer
222
-
223
- # ── Step 5.5: Verify DTypes before Trainer ───────────────────────────────
224
- print(f"📊 DType Check: lm_head={model.lm_head.weight.dtype}, embed={model.model.embed_tokens.weight.dtype}")
225
-
226
- trainer = GRPOTrainer(**trainer_kwargs)
227
-
228
- # ── Step 6: Train! ────────────────────────────────────────────────────────
229
- print(f"\n🚀 Starting GRPO training — {MAX_STEPS} steps, {NUM_GENERATIONS} generations/prompt")
230
- print(f" Effective batch size: {BATCH_SIZE} × {GRAD_ACCUM_STEPS} × 1 GPU = {BATCH_SIZE * GRAD_ACCUM_STEPS}")
231
- print()
232
- trainer.train()
233
- print("\n✅ Training complete!")
234
-
235
- # ── Step 7: Extract metrics ───────────────────────────────────────────────
236
- history = trainer.state.log_history
237
- rewards = [x["reward"] for x in history if "reward" in x]
238
- losses = [x.get("loss", 0.0) for x in history if "reward" in x]
239
-
240
- os.makedirs(OUTPUT_DIR, exist_ok=True)
241
- metrics_path = Path(OUTPUT_DIR) / "training_metrics.json"
242
- with open(metrics_path, "w") as f:
243
- json.dump({"rewards": rewards, "losses": losses, "log_history": history}, f, indent=2, default=str)
244
- print(f"📈 Metrics saved to {metrics_path}")
245
-
246
- # ── Step 8: Generate sample outputs (CRITICAL for judge review) ───────────
247
- print(f"\n📝 Generating {NUM_SAMPLE_OUTPUTS} sample outputs from trained model...")
248
- model.eval()
249
-
250
- sample_outputs = []
251
- for i in range(min(NUM_SAMPLE_OUTPUTS, len(scenarios))):
252
- prompt_text = build_prompt_multiagent(scenarios[i])
253
- messages = [
254
- {"role": "system", "content": SYSTEM_PROMPT},
255
- {"role": "user", "content": prompt_text},
256
- ]
257
- input_ids = tokenizer.apply_chat_template(
258
- messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
259
- ).to(model.device)
260
-
261
- output_ids = model.generate(
262
- input_ids=input_ids,
263
- max_new_tokens=MAX_COMPLETION_LENGTH,
264
- temperature=0.7,
265
- top_p=0.9,
266
- do_sample=True,
267
- )
268
- response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
269
-
270
- sample_outputs.append({
271
- "scenario_idx": i,
272
- "rm_size_limit": scenarios[i]["rm_size_limit"],
273
- "pm_cap_alloc": scenarios[i]["pm_cap_alloc"],
274
- "model_output": response,
275
- })
276
-
277
- print(f"\n{'─' * 60}")
278
- print(f" Sample {i+1} | RM limit={scenarios[i]['rm_size_limit']:.2f} | PM cap={scenarios[i]['pm_cap_alloc']:.2f}")
279
- print(f"{'─' * 60}")
280
- print(response[:500])
281
-
282
- samples_path = Path(OUTPUT_DIR) / "sample_outputs.json"
283
- with open(samples_path, "w") as f:
284
- json.dump(sample_outputs, f, indent=2, ensure_ascii=False)
285
- print(f"\n💾 Sample outputs saved to {samples_path}")
286
-
287
- # ── Step 9: Generate plots ────────────────────────────────────────────────
288
- print("\n📊 Generating training plots...")
289
- try:
290
- import matplotlib
291
- matplotlib.use("Agg")
292
- import matplotlib.pyplot as plt
293
-
294
- os.makedirs("plots", exist_ok=True)
295
-
296
- fig, axes = plt.subplots(1, 2, figsize=(14, 5))
297
- fig.suptitle("QuantHive Multi-Agent GRPO Training — Qwen 2.5 1.5B", fontsize=14)
298
-
299
- # Loss curve
300
- steps = list(range(1, len(losses) + 1))
301
- axes[0].plot(steps, losses, alpha=0.4, color="salmon", label="Raw")
302
- if len(losses) >= 20:
303
- ma = np.convolve(losses, np.ones(20)/20, mode="valid")
304
- axes[0].plot(range(20, len(losses)+1), ma, color="red", linewidth=2, label="MA-20")
305
- axes[0].set_xlabel("Training Step")
306
- axes[0].set_ylabel("Loss")
307
- axes[0].set_title("GRPO Training Loss")
308
- axes[0].legend()
309
-
310
- # Reward curve
311
- axes[1].plot(steps, rewards, alpha=0.4, color="lightgreen", label="Raw")
312
- if len(rewards) >= 20:
313
- ma = np.convolve(rewards, np.ones(20)/20, mode="valid")
314
- axes[1].plot(range(20, len(rewards)+1), ma, color="green", linewidth=2, label="MA-20")
315
- axes[1].set_xlabel("Training Step")
316
- axes[1].set_ylabel("Mean Reward")
317
- axes[1].set_title("GRPO Mean Reward (5 Verifiers)")
318
- axes[1].legend()
319
-
320
- plt.tight_layout()
321
- fig.savefig("plots/hf_training_curves.png", dpi=150, bbox_inches="tight")
322
- plt.close()
323
- print(" Saved plots/hf_training_curves.png")
324
-
325
- # ── Baseline comparison bar chart ─────────────────────────────────────
326
- # Evaluate trained model vs random baseline on 20 scenarios
327
- print(" Generating baseline comparison...")
328
- eval_scenarios = scenarios[:20]
329
- trained_scores = {
330
- "Format": [], "Alignment": [], "Risk": [], "Profit": [], "Governance": []
331
- }
332
- baseline_scores = {
333
- "Format": [], "Alignment": [], "Risk": [], "Profit": [], "Governance": []
334
- }
335
-
336
- for sc in eval_scenarios:
337
- prompt_text = build_prompt_multiagent(sc)
338
-
339
- # Trained model output
340
- messages = [
341
- {"role": "system", "content": SYSTEM_PROMPT},
342
- {"role": "user", "content": prompt_text},
343
- ]
344
- input_ids = tokenizer.apply_chat_template(
345
- messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
346
- ).to(model.device)
347
- out = model.generate(input_ids=input_ids, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)
348
- completion = tokenizer.decode(out[0][input_ids.shape[1]:], skip_special_tokens=True)
349
-
350
- # Random baseline: gibberish output
351
- random_completion = '{"direction": ' + str(random.choice([0,1,2])) + ', "size": ' + f"{random.random():.2f}" + ', "sl": 0, "tp": 0}'
352
-
353
- # Score both
354
- for name, func in zip(
355
- ["Format", "Alignment", "Risk", "Profit", "Governance"],
356
- reward_funcs
357
- ):
358
- t_score = func([prompt_text], [completion])[0]
359
- b_score = func([prompt_text], [random_completion])[0]
360
- trained_scores[name].append(t_score)
361
- baseline_scores[name].append(b_score)
362
-
363
- # Plot
364
- fig2, ax2 = plt.subplots(figsize=(10, 6))
365
- verifiers = list(trained_scores.keys())
366
- x = np.arange(len(verifiers))
367
- width = 0.35
368
-
369
- trained_means = [np.mean(trained_scores[v]) for v in verifiers]
370
- baseline_means = [np.mean(baseline_scores[v]) for v in verifiers]
371
-
372
- bars1 = ax2.bar(x - width/2, baseline_means, width, label="Random Baseline", color="#ff6b6b", alpha=0.85)
373
- bars2 = ax2.bar(x + width/2, trained_means, width, label="GRPO-Trained", color="#51cf66", alpha=0.85)
374
-
375
- ax2.set_ylabel("Mean Score")
376
- ax2.set_xlabel("Reward Verifier")
377
- ax2.set_title("QuantHive: Trained Agent vs Random Baseline")
378
- ax2.set_xticks(x)
379
- ax2.set_xticklabels(verifiers)
380
- ax2.legend()
381
- ax2.set_ylim(0, 1.1)
382
-
383
- for bar in bars1:
384
- ax2.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,
385
- f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=10)
386
- for bar in bars2:
387
- ax2.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,
388
- f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=10)
389
-
390
- fig2.savefig("plots/hf_baseline_vs_trained.png", dpi=150, bbox_inches="tight")
391
- plt.close()
392
- print(" Saved plots/hf_baseline_vs_trained.png")
393
-
394
- except Exception as e:
395
- print(f" ⚠️ Could not generate plots: {e}")
396
-
397
- # ── Step 10: Save model ───────────────────────────────────────────────────
398
- print(f"\n💾 Saving model to {OUTPUT_DIR}...")
399
- model.save_pretrained(OUTPUT_DIR)
400
- tokenizer.save_pretrained(OUTPUT_DIR)
401
-
402
- # ── Step 11: Push to HF Hub (optional) ────────────────────────────────────
403
- try:
404
- from huggingface_hub import HfApi
405
- api = HfApi()
406
- print(f"\n🚀 Pushing model to {HF_REPO_ID}...")
407
- api.upload_folder(
408
- folder_path=OUTPUT_DIR,
409
- repo_id=HF_REPO_ID,
410
- repo_type="model",
411
- create_pr=False,
412
- )
413
- print(f" ✅ Model pushed to https://huggingface.co/{HF_REPO_ID}")
414
-
415
- # Also push the plots
416
- for plot_file in Path("plots").glob("hf_*.png"):
417
- api.upload_file(
418
- path_or_fileobj=str(plot_file),
419
- path_in_repo=f"plots/{plot_file.name}",
420
- repo_id=HF_REPO_ID,
421
- repo_type="model",
422
- )
423
- print(f" 📊 Uploaded {plot_file.name}")
424
-
425
- except Exception as e:
426
- print(f" ⚠️ Could not push to HF Hub: {e}")
427
- print(f" You can manually push later with: huggingface-cli upload {HF_REPO_ID} {OUTPUT_DIR}")
428
-
429
- print("\n" + "=" * 60)
430
- print(" ✅ QuantHive GRPO Training Complete!")
431
- print(f" 📁 Model: {OUTPUT_DIR}")
432
- print(f" 📊 Plots: plots/hf_training_curves.png, plots/hf_baseline_vs_trained.png")
433
- print(f" 📝 Samples: {OUTPUT_DIR}/sample_outputs.json")
434
- print("=" * 60)
435
-
436
-
437
- if __name__ == "__main__":
438
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
visualization.md DELETED
@@ -1,316 +0,0 @@
1
- # 🎮 UI Design Specification — Cutesy Quant Firm Simulation
2
-
3
- ## Overview
4
-
5
- This module defines a **2D indie-style visualization layer** for the Multi-Agent RL Trading Environment.
6
-
7
- The goal is to transform abstract agent interactions, trading decisions, and reward signals into a **visually intuitive, engaging simulation** resembling a small quant firm office.
8
-
9
- This directly supports:
10
-
11
- * Multi-agent interaction clarity
12
- * Reward and learning visualization
13
- * Storytelling for demo and judging
14
-
15
- ---
16
-
17
- ## 🧠 Core Concept
18
-
19
- A **“living office” simulation** where:
20
-
21
- * Each AI agent is represented as a character
22
- * Agents communicate via visible messages
23
- * Decisions affect a shared portfolio
24
- * Learning is visualized over time
25
-
26
- ---
27
-
28
- ## 🎨 Art Style Specification
29
-
30
- ### Style Choice
31
-
32
- * 2D pixel-art / stylized indie aesthetic
33
- * Soft pastel color palette
34
- * Minimal but expressive character design
35
-
36
- ### Rationale
37
-
38
- * Pixel art is widely used for clarity and simplicity in 2D systems ([gamemaker.io][1])
39
- * It provides a **cozy, interpretable visual layer** rather than overwhelming realism
40
- * Low-resolution sprites enhance readability and system understanding
41
-
42
- ### Style Rules
43
-
44
- Define consistently:
45
-
46
- * Resolution (e.g., 32x32 or 64x64 sprites)
47
- * Color palette (role-based colors)
48
- * Outline thickness
49
- * Animation frame count
50
- * Character proportions
51
-
52
- A consistent style guide improves visual coherence and scalability ([Sprite-AI][2])
53
-
54
- ---
55
-
56
- ## 🏢 Office Layout
57
-
58
- ### Structure
59
-
60
- ```
61
- ┌────────────────────────────┐
62
- │ 📈 Balance Panel │
63
- │ │
64
- │ 🧠 Researcher 💻 Trader │
65
- │ │
66
- │ 📊 Risk Modeler 👑 PM │
67
- │ │
68
- │ 📉 Chart Panel │
69
- └────────────────────────────┘
70
- ```
71
-
72
- ---
73
-
74
- ### Zones
75
-
76
- 1. **Top Panel**
77
-
78
- * Portfolio balance
79
- * Live PnL indicator
80
-
81
- 2. **Agent Floor**
82
-
83
- * Each agent at a fixed workstation
84
- * Communication visible between agents
85
-
86
- 3. **Bottom Panel**
87
-
88
- * Market chart
89
- * Trade markers
90
-
91
- ---
92
-
93
- ## 🤖 Agent Representation
94
-
95
- Each agent is visualized as:
96
-
97
- * A small animated character (sprite)
98
- * A workstation (desk + monitor)
99
- * A role-specific color theme
100
-
101
- ---
102
-
103
- ### Agent Roles
104
-
105
- #### Quant Researcher
106
-
107
- * Visual cues: charts, floating indicators
108
- * Behavior: signal generation
109
-
110
- #### Trader
111
-
112
- * Visual cues: multiple monitors
113
- * Behavior: executes trades
114
-
115
- #### Risk Modeler
116
-
117
- * Visual cues: warning icons
118
- * Behavior: restricts exposure
119
-
120
- #### Portfolio Manager
121
-
122
- * Visual cues: elevated seat / calm posture
123
- * Behavior: override authority
124
-
125
- ---
126
-
127
- ## 💬 Communication System
128
-
129
- ### Objective
130
-
131
- To visually demonstrate **multi-agent reasoning and coordination**, as required by the theme.
132
-
133
- ---
134
-
135
- ### Implementation
136
-
137
- * Speech bubbles above agents
138
- * Message transitions between agents
139
- * Short-term visible history
140
-
141
- ---
142
-
143
- ### Example Flow
144
-
145
- ```
146
- Researcher → "RSI oversold, bullish bias"
147
- Risk → "Volatility high, reduce size"
148
- Trader → "Executing reduced position"
149
- PM → "Approved"
150
- ```
151
-
152
- ---
153
-
154
- ### Design Notes
155
-
156
- * Messages should be concise
157
- * Fade after a short duration
158
- * Color-coded by agent
159
-
160
- ---
161
-
162
- ## 📈 Trading Visualization
163
-
164
- ### Balance Panel (Top Right)
165
-
166
- Displays:
167
-
168
- * Portfolio value (live)
169
- * PnL change
170
-
171
- Animations:
172
-
173
- * Green pulse → profit
174
- * Red flash → loss
175
-
176
- ---
177
-
178
- ### Chart Panel
179
-
180
- Displays:
181
-
182
- * Price time series
183
- * Trade markers:
184
-
185
- * Buy → green marker
186
- * Sell → red marker
187
-
188
- ---
189
-
190
- ### Metrics Panel
191
-
192
- All values normalized to [0, 1]:
193
-
194
- * Reward
195
- * Grade
196
- * Drawdown
197
- * Sharpe proxy
198
-
199
- ---
200
-
201
- ## 🧠 Learning Visualization
202
-
203
- ### Objective
204
-
205
- Clearly demonstrate **agent improvement over time**
206
-
207
- ---
208
-
209
- ### Features
210
-
211
- #### 1. Before vs After Toggle
212
-
213
- * Pre-training behavior
214
- * Post-training behavior
215
-
216
- ---
217
-
218
- #### 2. Performance Graphs
219
-
220
- * Reward vs episode
221
- * Grade vs episode
222
- * Drawdown trend
223
-
224
- ---
225
-
226
- #### 3. Feedback Animation
227
-
228
- * Good trade → green highlight
229
- * Bad trade → red highlight
230
-
231
- ---
232
-
233
- ## ⚙️ System Modes
234
-
235
- ### Fast Mode
236
-
237
- * No animations
238
- * No API calls
239
- * Used for debugging
240
-
241
- ---
242
-
243
- ### Demo Mode
244
-
245
- * Full UI enabled
246
- * All agents active
247
- * Communication visible
248
-
249
- ---
250
-
251
- ## 🔌 Backend → UI Interface
252
-
253
- ### API Contract
254
-
255
- ```json
256
- {
257
- "agents": [
258
- {
259
- "name": "Trader",
260
- "message": "Executing buy",
261
- "confidence": 0.78
262
- }
263
- ],
264
- "portfolio": {
265
- "value": 102000,
266
- "pnl": 2000
267
- },
268
- "metrics": {
269
- "reward": 0.72,
270
- "grade": 0.68
271
- },
272
- "trades": []
273
- }
274
- ```
275
-
276
- ---
277
-
278
- ## 🧠 Design Principles
279
-
280
- 1. **Clarity over realism**
281
- Visuals must explain behavior, not just look good
282
-
283
- 2. **State visibility**
284
- Every important decision should be observable
285
-
286
- 3. **Agent identity**
287
- Each agent must feel distinct
288
-
289
- 4. **Learning visibility**
290
- Improvement must be obvious without explanation
291
-
292
- ---
293
-
294
- ## 🎯 Success Criteria
295
-
296
- The UI is successful if:
297
-
298
- * A viewer can understand agent interaction without reading code
299
- * Decisions and conflicts are visually clear
300
- * Learning progression is observable
301
- * The system feels alive and coordinated
302
-
303
- ---
304
-
305
- ## 🚀 Final Note
306
-
307
- The UI is not just decoration — it is a **core storytelling layer**.
308
-
309
- It should communicate:
310
-
311
- > “This is a system of agents learning, collaborating, and improving under constraints.”
312
-
313
- ---
314
-
315
- [1]: https://gamemaker.io/en/blog/2d-game-art-styles?utm_source=chatgpt.com "The Ultimate Guide To 2D Video Game Art Styles"
316
- [2]: https://www.sprite-ai.art/blog/2d-pixel-art-style-guide?utm_source=chatgpt.com "2D pixel art style guide for games [with examples]"