Spaces:

ARKAISW
/

QuantHive

Sleeping

App Files Files Community

ARKAISW commited on Apr 26

Commit

84ccd7d

1 Parent(s): 9a6d252

Clean up unused hackathon markdown files and update setup script link

Browse files

Files changed (10) hide show

README.md +1 -1
blog_temp.md +69 -0
fix.md +0 -336
guidetofollow.md +0 -367
more.md +0 -557
plan.md +0 -63
requirements.md +0 -150
themes.md +0 -134
train_hf.py +0 -438
visualization.md +0 -316

README.md CHANGED Viewed

@@ -31,7 +31,7 @@ QuantHive is a PettingZoo AEC (Agent-Environment Cycle) environment where **thre
 | 📓 Kaggle Run | [Kaggle Notebook](https://www.kaggle.com/code/arka2930/notebook24ed9f9bff) |
 | 📔 **Colab Demo** | [Google Colab Notebook](https://colab.research.google.com/drive/1B-KIlGL9kHLMD1RLhgLV94-modKzPzfy?usp=sharing) |
 | 📝 **Submission Blog** | [QuantHive: Multi-Agent Governance (HF)](https://huggingface.co/spaces/ARKAISW/QuantHive/blob/main/blog.md) |
-| 🐍 Setup Script | [QuantHive Training Script](https://github.com/ARKAISW/multi-agent-trading-env/blob/master/train_hf.py) |
 ---

 | 📓 Kaggle Run | [Kaggle Notebook](https://www.kaggle.com/code/arka2930/notebook24ed9f9bff) |
 | 📔 **Colab Demo** | [Google Colab Notebook](https://colab.research.google.com/drive/1B-KIlGL9kHLMD1RLhgLV94-modKzPzfy?usp=sharing) |
 | 📝 **Submission Blog** | [QuantHive: Multi-Agent Governance (HF)](https://huggingface.co/spaces/ARKAISW/QuantHive/blob/main/blog.md) |
+| 🐍 Setup Script | [QuantHive Training Notebook](https://github.com/ARKAISW/multi-agent-trading-env/blob/master/mate_training.ipynb) |
 ---

blog_temp.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+title: "QuantHive: Teaching AI to Survive Being Wrong"
+emoji: "🏛️"
+colorFrom: "blue"
+colorTo: "indigo"
+sdk: "docker"
+pinned: false
+---
+# QuantHive: Teaching AI to Survive Being Wrong
+Most people think trading is about predicting the next price movement.
+The first lesson I learned from observing a real risk quant was that professional trading isn't primarily about prediction. It's mostly about surviving being wrong.
+### The Origin
+I’m a Grade 12 student in India, and my older cousin is a risk quant. Early on, I got to see what real institutional finance looks like. It wasn’t about chaotic chart reading or betting on the next big breakout. It was a strict, highly disciplined system of constraints and balances. I learned early that real trading is not prediction; it's about controlled risk.
+When I started experimenting with AI and Reinforcement Learning, I became fascinated by disciplined decision systems. Most AI trading environments in the open-source world are simple single-agent setups. They provide a model with price history and reward it solely for maximizing profit and loss.
+But that's not how a hedge fund operates. If a human trader goes rogue, the risk desk intervenes forcefully. I wondered how AI would handle that if it were trained properly.
+### The Insight
+That changed my perspective. The intriguing question was not whether AI could predict the next price movement.
+**It was whether AI could learn institutional discipline.**
+Could we train an AI not only to pursue profits but also to negotiate, comply, and adjust to shifting oversight? Could we create a system where governance isn’t a rigid rule but a conversation?
+### Entering the QuantHive
+To address this, I built **QuantHive**—a governance-first trading environment that incorporates a multi-agent setup centered around PettingZoo’s AEC model. Instead of one reckless AI, I divided institutional trading into three opposing roles:
+1. **The Trader:** Aims to maximize profit and find alpha.
+2. **The Portfolio Manager:** Controls capital allocation and seeks steady growth without significant drawdowns.
+3. **The Risk Manager:** Has the authority to limit position sizes and reduce exposure forcefully if risks arise.
+They interact through structured message passing and governance limits within the environment loop. The environment rewards survival, not recklessness. The Risk Manager is rewarded for limiting trades during risky drawdowns, while the Trader must figure out how to make money within the changing limits set by the others.
+### From Floats to Thoughts: Semantic Reasoning
+The most valuable change came when training the Qwen 2.5 1.5B model with GRPO (Group Relative Policy Optimization).
+At first, the agents received raw float arrays (e.g., `0.284`). But to truly achieve "Auditable AI," I shifted the environment to use **Semantic Reasoning**. Instead of a vector of 24 numbers, the AI "reads" the market state in human terms: *"RSI is 28.4 (oversold).”*
+This simple change made the most of the LLM's pre-trained world knowledge. I trained the model against five reward verifiers, enforcing not only profit but also *Format, Alignment, Risk, and Governance.*
+### The Smoking Gun
+After 250 steps of GRPO training, the most interesting result was how the Trader adapted. The Trader began anticipating interventions and made adjustments before being forced to.
+Governance compliance rose from a random 7% to **88%**, and Risk Limit Adherence reached **93%** across held-out evaluation episodes in the governed environment.
+But the best part is how it complies. Because I required the model to explain its actions in natural language, the trained agent now outputs statements like:
+> *"...I also see that the portfolio's allocation of capital is nearing its limit (0.5). Given the Risk Manager's constraint on the size limit, I need to be cautious..."*
+It doesn’t just follow the rules; it understands and explicitly references them before taking action.
+### The Broader Implication
+Finance serves as a high-pressure test case. The larger question is whether autonomous systems can learn to operate under institutional oversight, justify their actions, and adapt to governance without hurting performance.
+I set out to determine if AI could be taught institutional discipline. The surprising outcome was not that the model became more profitable first. It became more disciplined first.
+---
+*Check out the full project on GitHub and see the live multi-agent choregraphy on our Hugging Face Space! All links are available in the repository [README.md](https://huggingface.co/spaces/ARKAISW/QuantHive/blob/main/README.md).*

fix.md DELETED Viewed

@@ -1,336 +0,0 @@
-# QuantHive Round2-Copy — Complete Fix List
-> **Context**: This project at `E:\Development\Round2 - Copy` is a PettingZoo AEC multi-agent trading environment for the OpenEnv Hackathon. It was forked from a working Gymnasium single-agent version (`Round2`). The core PettingZoo env (`env/multi_agent_env.py`) and a basic training script (`training/train_multi_agent.py`) have already been created, but several files still reference the old Gym version and critical deliverables are missing.
->
-> **Goal**: Make this a complete, submission-ready hackathon entry. All edits happen in `E:\Development\Round2 - Copy`.
----
-## PROJECT ARCHITECTURE (What Already Exists)
-- `env/multi_agent_env.py` — **NEW, DONE** — PettingZoo AECEnv with 3 agents:
-  - `risk_manager_0`: obs=Box(24), action=Box(3) [size_limit, allow_new, force_reduce]
-  - `portfolio_manager_0`: obs=Box(27), action=Box(2) [cap_alloc, override_strength]
-  - `trader_0`: obs=Box(29), action=Dict{direction, size, sl, tp}
-  - Turn order: RM → PM → Trader per market step
-  - Inter-agent message passing: RM output → PM obs, RM+PM output → Trader obs
-  - Adversarial rewards: RM rewarded for restricting during drawdown, Trader rewarded for PnL
-- `env/trading_env.py` — OLD Gymnasium env (keep for backward compat, used for data generation)
-- `env/state.py` — MarketState, PortfolioState, RiskState (shared by both envs)
-- `env/reward.py` — Reward functions + 5 GRPO verifiers (format, alignment, risk, profit, governance)
-- `training/train_multi_agent.py` — **NEW, DONE** — REINFORCE-style multi-agent training with rule-based policies
-- `training/train_grpo.py` — OLD GRPO training script for the Gym env
-- `api/server.py` — **PARTIALLY REWRITTEN** — imports updated to PettingZoo, `make_initial_state()` updated, but SimulationRunner still uses old Gym logic
-- `app.py` — Gradio/FastAPI launcher
-- `ui/` — React frontend (functional, shows agent messages + chart)
-- `openenv.yaml` — **STALE** — still points to `env.trading_env:TradingEnv`
-- `README.md` — **STALE** — describes the old Gym governance-in-env design
-- `WRITEUP.md` — **STALE** — describes single-agent architecture
-- `mate_training.ipynb` — **STALE** — Colab notebook for old Gym env
-- `Dockerfile` — Functional but missing `pettingzoo` dependency
-- `plots/` — Has old training plots from Gym version
----
-## CHANGES NEEDED (In Priority Order)
----
-### 🔴 1. Fix `openenv.yaml` — Points to Wrong Environment
-**File**: `openenv.yaml`
-Change `entry_point` from the old Gym env to the new PettingZoo env. Update observation space to reflect multi-agent structure.
-```yaml
-# OpenEnv Manifesto
-version: "1.0"
-name: "QuantHive"
-description: "Decentralized multi-agent trading governance — three independent RL agents (Risk Manager, Portfolio Manager, Trader) with adversarial rewards negotiate via PettingZoo AEC turns."
-author: "Arka Sarkar"
-# Environment Specification
-environment:
-  entry_point: "env.multi_agent_env:MultiAgentTradingEnv"
-  type: "pettingzoo_aec"
-  agents:
-    - risk_manager_0
-    - portfolio_manager_0
-    - trader_0
-  observation_space:
-    risk_manager_0: { shape: [24], dtype: "float32", description: "Market + portfolio + risk state" }
-    portfolio_manager_0: { shape: [27], dtype: "float32", description: "Base obs + RM constraints [size_limit, allow_new, force_reduce]" }
-    trader_0: { shape: [29], dtype: "float32", description: "Base obs + RM constraints + PM allocation [cap_alloc, override_strength]" }
-  action_space:
-    risk_manager_0:
-      type: "box"
-      shape: [3]
-      description: "[size_limit (0-1), allow_new_positions (0-1), force_reduce (0-1)]"
-    portfolio_manager_0:
-      type: "box"
-      shape: [2]
-      description: "[capital_allocation (0-1), override_strength (0-1)]"
-    trader_0:
-      type: "dict"
-      items:
-        direction: { type: "int", low: 0, high: 2, description: "0=Hold, 1=Buy, 2=Sell" }
-        size: { type: "float", low: 0.0, high: 1.0 }
-        sl: { type: "float", description: "Stop Loss price" }
-        tp: { type: "float", description: "Take Profit price" }
-server:
-  port: 7860
-  endpoints:
-    reset: "/reset"
-    step: "/step"
-    state: "/state"
-tags:
-  - "PettingZoo AEC"
-  - "Multi-Agent"
-  - "Adversarial Rewards"
-  - "Financial Governance"
-  - "Inter-Agent Negotiation"
-  - "Self-Regulation"
-```
----
-### 🔴 2. Add `pettingzoo` to Dependencies
-**File**: `requirements.txt` — add `pettingzoo>=1.24.0`
-**File**: `requirements-space.txt` — add `pettingzoo>=1.24.0`
----
-### 🔴 3. Finish `api/server.py` — Complete SimulationRunner Rewrite
-**File**: `api/server.py`
-The imports and `make_initial_state()` have been updated. The `SimulationRunner` class and the API endpoints (`/reset`, `/step`, `/state`) still use the old `TradingEnv.step()` loop. They must be rewritten to:
-1. **SimulationRunner** must instantiate `MultiAgentTradingEnv` instead of `TradingEnv`
-2. **Each simulation step** must run a full AEC cycle (RM → PM → Trader) using `env.agent_iter()`
-3. Use the rule-based policies from `training/train_multi_agent.py` (`RuleRiskManagerPolicy`, `RulePortfolioManagerPolicy`, `RuleTraderPolicy`) as the default agent policies for the demo
-4. After each AEC cycle, broadcast per-agent messages and negotiation state to the UI via `sim_state`
-5. The `negotiation` field in `sim_state` must be populated with RM and PM messages each cycle
-6. The `flow` field must log the per-agent turn messages (e.g., "RM: Size limit set to 0.35", "PM: Allocation capped at 0.5", "Trader: BUY 0.3 @ 50123.45")
-The OpenEnv facade endpoints must still work:
-- `POST /reset` → calls `env.reset()`, returns initial trader observation
-- `POST /step` → accepts a trader action dict, runs full AEC cycle (RM and PM use rule policies), returns trader's obs/reward/done/info
-- `GET /state` → calls `env.state()`, returns full shared state
-This is the most complex single change. The existing `SimulationRunner` class structure can be adapted — replace the inner loop body.
----
-### 🔴 4. Generate Training Evidence (Plots)
-After the GRPO training pipeline (change #8) is working:
-- Run training for ≥100 GRPO steps
-- Save to `plots/`:
-  - `reward_curve.png` — per-agent reward over training steps (RM, PM, Trader on same axes)
-  - `loss_curve.png` — policy loss convergence
-  - `baseline_comparison.png` — random vs trained agent performance per metric
-- Each plot must have labeled axes, a title, and a one-line caption
-- Commit these `.png` files to the repo
----
-### 🔴 5. Deploy to HF Space
-- Update `Dockerfile` to install `pettingzoo` (add to `requirements-space.txt`)
-- Push to HF Space at `https://huggingface.co/spaces/ARKAISW/QuantHive`
-- Verify from a logged-out browser that `/reset`, `/step`, `/state` all return valid JSON
-- The Space must be public and cloneable
----
-### 🟠 6. Rewrite `README.md`
-**File**: `README.md`
-The current README describes the old Gym-based governance-in-env design. Rewrite it to describe the PettingZoo architecture. Keep the same general structure but update all technical content:
-Key sections to change:
-- **Title/Tagline**: "Can three AI agents with conflicting goals learn to govern each other?" or similar
-- **The Problem**: Same framing (AI can't self-govern), but add: "Existing 'multi-agent' trading envs are single-agent with hardcoded rules pretending to be agents"
-- **The Solution**: Describe PettingZoo AEC with 3 independent agents, adversarial rewards, and inter-agent message passing. Remove all references to "governance lives in env.step()" — that was the old design. Now governance is emergent from agent interaction
-- **Environment section**: Update observation dimensions (RM=24, PM=27, Trader=29), explain message passing, show the AEC turn diagram
-- **Training section**: Update to reflect multi-agent GRPO, show per-agent reward curves
-- **Results section**: Update with new plot embeds and new metrics
-- **Theme alignment**: Explicitly cite Theme #1 (Multi-Agent Interactions) and sub-themes (Fleet AI Scalable Oversight, Halluminate Multi-Actor)
-- **Quick Launch**: Keep the same curl examples but verify they work with the new server
-Include a code example showing the multi-agent negotiation:
-```python
-info["governance"] = {
-    "rm_message": [0.35, 1.0, 0.0],      # RM: limit 35%, allow new, don't force reduce
-    "pm_message": [0.50, 0.0],             # PM: 50% allocation, no override
-    "proposed": {"direction": 1, "size": 0.7},
-    "executed": {"direction": 1, "size": 0.35},  # RM clamped size from 0.7 to 0.35
-    "interventions": [{"agent": "RiskManager", "type": "size_clamp"}]
-}
-```
----
-### 🟠 7. Rewrite `WRITEUP.md`
-**File**: `WRITEUP.md`
-Rewrite the narrative:
-1. **Problem**: Single-agent governance is fake — it's just business rules. True governance requires independent actors with conflicting incentives
-2. **Insight**: PettingZoo AEC enables actual decentralized decision-making. RM is rewarded for restricting risk, Trader for profit, PM for balanced growth. Their tension creates emergent regulatory behavior
-3. **Architecture**: 3-agent AEC cycle, inter-agent messages in observation space, adversarial reward structure
-4. **Training**: Multi-agent GRPO with alternating optimization
-5. **Results**: Per-agent reward curves, compliance rate improvement, RM learned to restrict, Trader learned to comply
-6. **Why it matters**: First true PettingZoo multi-agent governance env for finance. Generalizes to healthcare/autonomous systems oversight
----
-### 🟠 8. Build PettingZoo-Compatible GRPO Pipeline for Qwen 2.5
-**New File**: `training/train_grpo_multiagent.py`
-This is the most important training change. Create a GRPO trainer that:
-1. Uses `MultiAgentTradingEnv` as the environment
-2. Trains the Trader agent as a Qwen 2.5-1.5B model using Unsloth + TRL `GRPOTrainer`
-3. RM and PM can use rule-based policies during Trader training (alternating optimization)
-4. The Trader's prompt must include the RM/PM messages (constraints, allocation) as part of the state description so the LLM can reason about them
-5. Adapt the 5 existing GRPO verifiers from `reward.py`:
-   - `format_reward_func` — same (check `<thought>` + `<action>` tags)
-   - `alignment_reward_func` — same (anti-hallucination)
-   - `risk_reward_func` — update to use RM's `size_limit` from the message instead of hardcoded limit
-   - `profit_reward_func` — same (direction vs price trend)
-   - `governance_reward_func` — update to check if Trader's proposed size ≤ RM's size_limit (dynamic, not static)
-6. The key differentiator: the governance verifier now checks compliance against *learned* RM constraints, not hardcoded ones. This means the Trader must learn to read and respect the RM message in its observation
-Example prompt format for Qwen:
-```
-You are a trading agent in a multi-agent governance system.
-The Risk Manager has set the following constraints: size_limit=0.35, new_positions=allowed, force_reduce=no.
-The Portfolio Manager allocated: capital_cap=0.50, override=none.
-Market state: [... 24 values ...]
-Your task: Propose a trade action that maximizes profit while respecting the governance constraints.
-<thought>Your reasoning here</thought>
-<action>{"direction": 1, "size": 0.30, "sl": 49000, "tp": 52000}</action>
-```
----
-### 🟠 9. Rewrite `mate_training.ipynb`
-**File**: `mate_training.ipynb`
-Rewrite the Colab notebook to:
-1. Install pettingzoo, openenv, trl, unsloth
-2. Import `MultiAgentTradingEnv`
-3. Run GRPO training via the new `train_grpo_multiagent.py` pipeline
-4. Generate and display loss/reward plots inline
-5. Save plots as `.png` in the `plots/` directory
-6. Must be fully re-runnable on Google Colab T4 GPU
----
-### 🟡 10. Multi-Agent Reward Visualization Script
-**New File**: `training/plot_multiagent.py`
-Create a script that:
-- Loads training logs from the GRPO run
-- Plots per-agent rewards (RM, PM, Trader) on same axes
-- Plots governance intervention rate over training
-- Plots compliance rate (% of Trader actions passing without RM/PM override)
-- Saves all to `plots/` as `.png` with labeled axes and titles
----
-### 🟡 11. Strengthen Theme #1 Alignment in README
-Add a dedicated section in README:
-```markdown
-## 🎯 Theme Alignment: Multi-Agent Interactions (Theme #1)
-QuantHive directly addresses Theme #1 and both sub-themes:
-- **Fleet AI — Scalable Oversight**: The Risk Manager and Portfolio Manager are oversight agents that monitor and constrain the Trader in real-time, creating scalable governance.
-- **Halluminate — Multi-Actor Environments**: Three independent actors with adversarial incentives negotiate through observation message-passing, producing emergent strategic behavior.
-The PettingZoo AEC architecture enables theory-of-mind reasoning: the Trader must model what constraints the Risk Manager will impose based on the current portfolio state.
-```
----
-### 🟡 12. Document Anti-Reward-Hacking in WRITEUP
-Add a section explaining how the adversarial reward structure inherently prevents gaming:
-- If the Trader learns to ignore RM limits → RM is rewarded for clamping → arms race
-- If RM always blocks → RM gets no upside from portfolio growth → it learns moderation
-- Multiple independent reward signals per agent (not one monolithic score)
-- Governance intervention log provides process-level reward, not just final outcome
----
-### 🟡 13. Verify Curriculum Learning Works with PettingZoo Env
-Test that `MultiAgentTradingEnv(difficulty="easy")`, `"medium"`, `"hard"` all work correctly:
-- Run 10 episodes at each difficulty
-- Confirm the Trader gets non-zero reward at "easy" difficulty
-- Mention curriculum design in WRITEUP
----
-### 🟢 14. Update UI to Show Agent Negotiation
-Update the React UI (`ui/src/`) to:
-- Show RM → PM → Trader turn order visually
-- Display RM message [size_limit, allow_new, force_reduce] and PM message [cap_alloc, override] each cycle
-- Flash when an intervention occurs (RM clamped size, PM vetoed trade)
-- Show per-agent reward bars
----
-### 🟢 15. Prepare Slide Deck for 3-Min Pitch
-Create a 6-slide deck:
-1. Problem: "AI agents can't govern each other"
-2. Solution: PettingZoo AEC with 3 adversarial agents
-3. Architecture: RM → PM → Trader cycle + message passing diagram
-4. Key innovation: Adversarial rewards = emergent self-regulation
-5. Results: Per-agent reward curves + compliance improvement
-6. Demo: Live UI showing negotiation
----
-### 🟢 16. Upload Trained Model to HF Hub
-After training completes:
-- Save the LoRA adapter for Qwen 2.5-1.5B
-- Upload to HF Hub (e.g., `ARKAISW/quanthive-trader-lora`)
-- Link from README
----
-### 🟢 17. Record <2 Min Video Demo
-- Screen record the UI showing multi-agent negotiation
-- Show before/after: random Trader vs trained Trader
-- Upload to YouTube (URL only, no video files in repo)
-- Link from README
----
-### 🟢 18. Run PettingZoo API Test
-Run PettingZoo's built-in compliance test to verify the env is properly implemented:
-```python
-from pettingzoo.test import api_test
-from env.multi_agent_env import MultiAgentTradingEnv
-env = MultiAgentTradingEnv()
-api_test(env, num_cycles=50, verbose_progress=True)
-```
-Fix any issues that arise. Mention passing this test in README as quality evidence.

guidetofollow.md DELETED Viewed

@@ -1,367 +0,0 @@
-Hackathon Self-Serve Guide: Build an RL
-Environment, Train an LLM, Ship a Demo
-0) What you are building
-The core idea is not just to fine-tune a text model, but to build a specialized LLM system that
-can act inside an environment, get feedback, and improve through reinforcement learning. The
-practical stack discussed here is:
-Environment → verifier/reward functions → TRL trainer → Unsloth for efficiency →
-deployment on OpenEnv / Spaces.
-A strong project usually looks like one of these,
-Please refer to  for theme guidelines on
-[External] Apr ‘26 OpenEnv Hackathon Themes
-selecting & forming problem statements.
-1) Start with the right project idea
-Pick a task that has all three of these properties:
-- The model can act step by step
-- You can verify success programmatically
-- The task is hard enough to be interesting, but not so hard that the model never
-succeeds
-This last point matters a lot. RL only works if the probability of getting a good answer is
-greater than zero. If your task is so hard that the model never gets any reward, you will burn
-compute and learn nothing.
-Please refer to  for theme guidelines on
-[External] Apr ‘26 OpenEnv Hackathon Themes
-selecting & forming problem statements.
-A useful rule: prefer tasks with crisp verification over tasks that only “look good” to a
-human. RL gets easier when the reward is objective.
-2) Understand the minimum RL loop before you build
-At a high level, your loop is:
-- Give the model a prompt
-- Let it generate an action, strategy, answer, or code
-- Execute that output in an environment or verifier
-- Convert the result into a reward
-- Update the model so higher-reward behavior becomes more likely
-That is the practical mental model for RL here. The system samples many outputs, scores
-them, and shifts probability mass away from bad outputs and toward better ones.
-One especially useful framing is that RL is like a more efficient version of repeated in-context
-improvement. Instead of repeatedly stuffing previous examples into the context, you let
-backpropagation store what worked into the weights.
-3) Decide whether you need SFT first
-Use this simple rule:
-● If you have a lot of good data, use SFT
-● If you do not have data but can verify outputs, use RL
-● In many practical cases, do a little SFT first, then RL
-Why this matters:
-● SFT is generally more sample-efficient
-● RL is useful when you can test outcomes but cannot cheaply author ideal traces
-● RL often needs some warm start, formatting priming, or easy tasks first so that good
-rollouts happen at all
-For hackathon teams, the best path is usually:
-- Start from a capable base/instruct model
-- Add light formatting or task scaffolding if needed
-- Use RL for improvement, not as magic from scratch
-4) Design the environment before you design the trainer
-Treat the environment as a first-class artifact. It should define:
-● reset(): start a fresh episode
-● step(action): apply an action and return the next result
-● state() / observation: what the agent sees
-● reward: what counts as progress or success
-OpenEnv standardizes this so the same training code can work across many environments,
-instead of every team inventing a different API. That is one of the main reasons to use it in a
-hackathon.
-Think about your environment in this order:
-- What does the agent observe?
-- What actions can it take?
-- What ends an episode?
-- How do you compute reward?
-- How do you stop abuse, infinite loops, or cheating?
-5) Build the environment using OpenEnv
-The intended workflow is to bootstrap an environment skeleton and then fill in the behavior.
-OpenEnv’s CLI creates the scaffolding for you. The environment is implemented as a Python
-package and exposed via a FastAPI app.
-Your implementation typically defines:
-● action dataclass
-● observation dataclass
-● state representation
-● environment methods like reset and step
-● FastAPI wrapper / client-server interface
-That gives you a clean separation:
-● the environment handles world dynamics and scoring,
-● the trainer handles optimization,
-● and the model just learns to act inside the interface.
-6) Keep the task simple at first
-Do not begin with your hardest benchmark. Start with the easiest version of your environment
-that still proves the concept. This is where curriculum learning helps.
-A good progression:
-- easy tasks with short horizons,
-- medium tasks with a little more branching,
-- harder tasks only after the model starts getting non-zero reward.
-The principle is simple: make success possible early. If the model never sees successful
-trajectories, learning stalls.
-7) Design rewards carefully
-Your reward function is your task specification. If it is weak, incomplete, or easy to exploit, the
-model will optimize the wrong thing very efficiently.
-A strong reward design usually includes multiple components, for example:
-● execution success,
-● correctness,
-● format compliance,
-● timeouts,
-● resource usage,
-● safety constraints,
-● and anti-cheating checks.
-One explicit recommendation was to use multiple independent reward functions, not just one.
-If you only have a single reward signal, it is easier for the model to hack it. Multiple
-independent checks reduce that risk.
-For example, for a coding environment:
-● reward passing tests,
-● penalize timeouts,
-● reward format compliance,
-● reject use of forbidden globals,
-● and separately verify the function contract.
-8) Protect yourself against reward hacking
-Reward hacking is one of the biggest practical failure modes. The model may learn shortcuts
-that maximize your reward without solving the real task. Examples mentioned include:
-● editing timers,
-● caching results,
-● abusing globals,
-● mutating protected state,
-● or exploiting environment bugs.
-What to do:
-- Use multiple independent reward functions
-- Lock down execution where possible
-- Add time limits
-- Avoid unrestricted global state
-- Sample outputs frequently and inspect them
-- Terminate or roll back runs if behavior drifts badly
-A particularly practical recommendation was to use a locked-down function or restricted
-execution approach so the model cannot rely on undeclared globals or hidden cached state.
-Also, do not just let training run forever without checking generations. Periodic human
-inspection is still necessary.
-9) Use process-aware feedback when you can
-Naively assigning the same final reward to every token is inefficient. If possible, use richer
-supervision that distinguishes good intermediate steps from bad ones. That is the idea behind
-process supervision.
-In practice, this can be approximated by:
-● line-by-line checks,
-● step-level verifiers,
-● program trace analysis,
-● or LLM-as-a-judge for intermediate reasoning.
-But be careful: LLM-as-a-judge can itself be gamed. Use it as one signal, not the only signal.
-For a hackathon, outcome-based verification plus a few lightweight process checks is usually
-the sweet spot.
-10) Pick the right training stack
-The intended stack here is:
-● TRL for RL training algorithms
-● Unsloth to make RL training and inference more efficient
-● OpenEnv to standardize environment interaction
-This combination works because:
-● OpenEnv gives you a common environment interface
-● TRL gives you RL trainers like GRPO
-● Unsloth reduces memory use and improves efficiency on top of TRL
-One of the practical examples used the same prompt repeated many times, routed through an
-environment, with TRL driving training and Unsloth helping with performance.
-11) Prefer GRPO / RLVR style training for verifiable tasks
-The RL setup discussed here leans toward RL with verifiable rewards:
-● instead of a learned reward model,
-● use a verifier, test harness, regex check, executor, or environment.
-GRPO was described as a more efficient evolution relative to older PPO-style setups,
-especially by simplifying away parts like the value model.
-For hackathon purposes, the key practical takeaway is:
-● if the task is verifiable,
-● build the verifier first,
-● then plug that verifier into RL training.
-12) Keep inference fast
-One important point: in RL for LLMs, inference can dominate total runtime. Over time, rollout
-generation often becomes the bottleneck, not the optimizer step.
-That means your project speed depends heavily on:
-● fast sampling,
-● tight environment loops,
-● low-overhead execution,
-● and efficient model runtime.
-This is one reason Unsloth matters in the stack, and another reason to avoid overly heavy
-environments early in the hackathon.
-13) Deploy your environment early
-OpenEnv environments are designed to be deployed as Hugging Face Spaces, which provide:
-● a running server,
-● a Git repository,
-● and a container registry.
-That gives you several ways to work:
-● interact with the remote Space directly,
-● install the client code from the repo,
-● pull and run the container locally,
-● or run the FastAPI app locally via Python/Uvicorn.
-Why this is good for a hackathon:
-● one shared source of truth,
-● easier collaboration,
-● easier demos,
-● easier switching between local and remote execution.
-A good habit is to deploy an early version of the environment before training seriously. That
-catches API and packaging issues early.
-14) Scale only after the environment is stable
-There was a dedicated tutorial flow around:
-- environment,
-- deployment,
-- scaling,
-- training with TRL and Wordle.
-Follow the same order.
-Do not start with scale. First confirm:
-● reset works,
-● step works,
-● rewards are sensible,
-● timeouts work,
-● logs are visible,
-● and the environment can be run locally and remotely.
-Only then:
-● increase batch sizes,
-● duplicate prompts or tasks,
-● expand task diversity,
-● and benchmark throughput.
-15) Monitor the right things during training
-Do not watch only one scalar. Monitor:
-● overall reward,
-● individual reward function columns,
-● success indicators,
-● timeout frequency,
-● and generated strategies over time.
-A very concrete suggestion was:
-● watch whether the reward is going up,
-● and separately watch critical columns like “function works.”
-Also inspect actual generations during training. A rising reward is not enough if the model is
-learning to exploit bugs.
-16) Save models correctly
-If you use QLoRA / LoRA-style training, be careful when saving. One explicit warning was:
-Do not upcast a 4-bit model to 16-bit and then merge the LoRA weights naively. That can
-badly damage model quality. Instead, use the proper merged-save path, or use the adapters
-directly.
-For participants, that means:
-● keep your training save path simple,
-● test post-training inference immediately,
-● and do not leave export until the end.
-17) How to structure your team over the hackathon
-A very effective team split is:
-## Person A: Environment
-● builds reset/step/state
-● adds timeouts and safety constraints
-● makes local and remote execution work
-## Person B: Verifier / Rewards
-● writes multiple reward functions
-● adds anti-hacking checks
-● makes failure cases visible
-## Person C: Training
-● sets up TRL + Unsloth
-● runs experiments
-● tracks metrics and generations
-## Person D: Demo / Product
-● prepares the Space demo
-● creates a simple interface
-● records examples and final benchmarks
-This split matches the way the stack naturally decomposes in practice.
-18) A practical 1-day execution plan
-Phase 1: Pick a narrow task
-Choose a small, verifiable environment. Avoid huge long-horizon tasks first.
-Phase 2: Build the environment
-Use OpenEnv init, implement reset/step/state, and get a local loop working.
-Phase 3: Build rewards
-Add at least 2–4 independent reward checks, plus timeout and anti-cheat logic.
-## Phase 4: Deploy
-Push to a Space or run locally via container/Uvicorn so teammates can use the same
-environment.
-Phase 5: Train small
-Run a tiny TRL + Unsloth experiment first. Look at outputs, not just metrics.
-Phase 6: Inspect for hacking
-Sample generations. Check for globals, hacks, environment abuse, or suspicious shortcuts.
-Phase 7: Add curriculum
-If the model gets zero reward too often, simplify tasks or add easier start states.
-Phase 8: Train bigger
-Only after the loop is stable should you increase scale, batch size, or environment diversity.
-Phase 9: Save and demo
-Export the trained model correctly, test inference, and show before/after behavior.
-19) What judges or reviewers will likely find compelling
-The strongest hackathon projects usually show:
-● a clear environment design,
-● objective reward functions,
-● evidence that the model improved,
-● prevention against reward hacking,
-● a reproducible deployment story,
-● and a sharp demo.
-A simple but strong demo format is:
-- baseline model attempt,
-- reward/verifier output,
-- trained model attempt,
-- measurable improvement,
-- short explanation of safeguards.
-20) Suggested problem statement theme directions
-Please Refer to
-[External] Apr ‘26 OpenEnv Hackathon Themes
-21) Common mistakes to avoid
-● Picking a task so hard that success probability is zero
-● Using only one reward function
-● Not checking for reward hacking
-● Training before the environment is stable
-● Relying only on average reward and not inspecting outputs
-● Forgetting timeouts and sandbox limits
-● Saving LoRA/QLoRA models incorrectly
-## 22) Learning Resources
-(Recommended) RL Environment Lecture Chapters:
-https://openenv-india-apr-2026.lovable.app/
-Module 1: Why OpenEnv? (~7 min)
-## ▸ Workshop
-## 8:02–15:05
-— https://www.youtube.com/watch?v=1jU05MlENOI&t=482s
-▸ Sanyam: RL loop, fragmented env APIs, OpenEnv as universal interface, Gymnasium spec +
-## Docker
-## ▸ Alt: Mega Lecture
-## 40:01–46:00
-— https://www.youtube.com/watch?v=Jew4lhAiqnw&t=2401s
-Module 2: Using Existing Envs (~7.5 min)
-## ▸ Workshop
-## 35:33–43:05
-— https://www.youtube.com/watch?v=1jU05MlENOI&t=2133s
-▸ Ben: Hub org, env collections, 3 Space interfaces (server/repo/registry),
-from_hub
-## ▸ Alt: Mega Lecture
-## 1:24:11–1:30:00
-## —
-https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5051s
-Module 3: Deploying Envs (~9 min)
-## ▸ Mega Lecture
-## 1:30:00–1:39:07
-— https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5400s
-▸ Ben: live
-openenv init
-, scaffold, running locally,
-openenv push
-, Docker run from Space
-## ▸ Alt: Workshop
-## 43:05–48:30
-— https://www.youtube.com/watch?v=1jU05MlENOI&t=2585s
-Module 4: Building Your Own (~6.5 min)
-## ▸ Workshop
-## 43:45–50:20
-— https://www.youtube.com/watch?v=1jU05MlENOI&t=2625s
-▸ Ben: scaffold files, business logic (reset/step), models, client, publishing
-## ▸ Alt: Mega Lecture
-## 1:33:30–1:39:07
-## —
-https://www.youtube.com/watch?v=Jew4lhAiqnw&t=5610s
-Module 5: Training + TRL (~14 min)
-## ▸ Mega Lecture
-## 1:53:20–2:07:12
-— https://www.youtube.com/watch?v=Jew4lhAiqnw&t=6800s
-▸ Lewis: Wordle GRPO walkthrough — rollout function, reward shaping, GRPOTrainer, live
-training
-## ▸ Alt: Workshop
-## 22:24–34:12
-— https://www.youtube.com/watch?v=1jU05MlENOI&t=1344s

more.md DELETED Viewed

@@ -1,557 +0,0 @@
-# OpenEnv Hackathon: Build at the Bleeding Edge of AI
-**Event:** India's Biggest Mega AI Hackathon
-**Built on:** Meta's OpenEnv (the foundation for next-gen RL environments used by leading AI labs)
-**Sponsored by:** Hugging Face, PyTorch
-**Grand Prize:** Winners get an interview opportunity at Meta & Hugging Face AI teams
-**Important Dates:**
-- Round 1 Begins: March 25th
-- Grand Finale (48-hour sprint in Bangalore): April 25th - 26th
-## Results Announcement
-- Top 100 Finalists Announced: Friday, May 1st
-- Winners Livestream: Friday, May 8th
-## Credits & Resources
-Get your credits for Cursor AI and Hugging Face as early as possible.
-**Cursor AI Credit:** Each participant is eligible. Visit the Scaler Hackathon dashboard to avail credits:
-https://tinyurl.com/sclr-openenv-dashboard
-**Hugging Face Credits:** $30 credit per person. Avail credits at:
-https://huggingface.co/coupons/claim/hf-openenv-community
-The same links will be shared in the on-campus Discord channels.
-## Meet Your Mentors
-**Onsite / Available:**
-- Sanyam Bhutani - Partner Engineer, META
-- Yash Khare - Partner Engineer, META
-- Nilesh Pandey - Partner Engineer, META
-- Adithya S Kolavi - Engineer, Hugging Face
-- Adarsh Shirawalmath - ML Engineer, Hugging Face
-- Arkadip Maitra - ML Engineer, Red Hat
-- Aashay Sachdeva - Founding Team, Sarvam
-- Deepa Dhevannan - Gen AI Solution Architect
-- Soumik Rakshit - ML Engineer, Zomato
-- Ayush Satyam - ML Engineer, Red Hat
-- Parshant Sharma - ML Engineer, Red Hat
-**Remotely Available:**
-- Ben Burtenshaw - Community Education AI, Hugging Face
-- Alireza Shamsoshoara - PyTorch, Meta
-## Discord Guidelines
-Important: Since global tech leaders and executives are present, a high level of professionalism and decorum must be maintained. Failure to follow the guidelines will lead to strict action and may impact your participation in the hackathon.
-## Technical Session Agenda
-- PyTorch Foundation Introduction
-- Hackathon Themes
-- Submission and Judging Rules
-- RL 101 + OpenEnv Recap
-- Best Practices
-- Q&A
-## About PyTorch Foundation
-**Mission:** Democratizing and accelerating the adoption of accessible, high-impact AI technologies by cultivating a robust ecosystem of open-source, vendor-neutral projects spanning the entire AI lifecycle.
-**Hosted Projects:** Multiple open-source projects under the foundation
-## Hackathon Goals
-- Learn reinforcement learning (RL)
-- Now is a great time to learn RL
-- Hack and create cool environments you can use to add skills to models
-- Showcase your work on the Hugging Face Hub
-- Have fun
-**GitHub Repository:** https://github.com/meta-pytorch/OpenEnv
-## Guidelines for Problem Statement
-- It is NOT mandatory to choose the same problem statement as Round 1. Only choose it if it aligns with the provided hackathon themes.
-- Before the onsite event (April 25-26): Work on building the environment, agent behaviors, and reward model.
-- Onsite (April 25-26): Post-training will be done when you receive compute credits for Hugging Face.
-## What Judges Look For (TL;DR)
-Build an environment that an LLM could actually be trained on to get measurably better at something interesting. Then show that training. Then tell the story.
-A messy but ambitious environment with real training evidence beats a polished but boring one. Pick a problem that excites you (that energy comes through in the pitch).
-**Note:** Only one submission per team. The URL link of your environment must be submitted as judges will pull the environment from the URL to evaluate it. Changes after the deadline will not be considered.
-## Judging Criteria
-| Criterion | Weight | What It Means |
-|-----------|--------|----------------|
-| Environment Innovation | 40% | Is the environment novel, creative, or genuinely challenging? Does it meaningfully test agent behavior in a new way? |
-| Showing Improvement in Rewards | 20% | Is there observable evidence of training progress? Reward curves, before/after behavior, baseline comparison. |
-| Storytelling & Presentation | 30% | Can you clearly explain the problem, the environment, and what the agent learned? Is the demo engaging for a non-technical audience? |
-| Reward & Training Pipeline | 10% | Is the reward logic coherent? Does the pipeline produce meaningful improvement? |
-## Minimum Submission Requirements (Non-Negotiable)
-Submissions missing any of these are at a serious disadvantage:
-1. Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.
-2. A working training script using Unsloth or Hugging Face TRL, ideally as a Colab notebook so judges can re-run it.
-3. Evidence that you actually trained: at minimum, loss and reward plots from a real run.
-4. A short writeup: a mini-blog on Hugging Face, a less than 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck. All materials must be linked from your README.
-5. Push your environment to a Hugging Face Space so it's discoverable and runnable.
-6. A README that motivates the problem, explains how the environment works, and shows results.
-7. README must have a link to the environment in the Hugging Face Space and all additional references to other materials (videos, blog posts, slides, presentations, etc.).
-8. Do not include large video files in your HF Hub submission. Use URL references instead.
-## What Makes a Submission Stand Out
-### 1. Pick an Ambitious, Original Problem
-Ask yourself:
-- Does this environment teach an LLM something it currently can't do well?
-- Is the domain underexplored in RL/LLM training?
-- Could a researcher write a paper about training on this?
-Avoid clones of chess, snake, tic-tac-toe, and grid-world.
-### 2. Design a Reward Signal That Actually Teaches
-A great environment has a reward function that:
-- Provides a rich, informative signal (not just 0/1 at the end)
-- Captures something hard to measure in a clever way
-- Uses OpenEnv's Rubric system thoughtfully (composable rubrics are better than monolithic scoring)
-- Is hard to game (an agent that exploits the reward without solving the task should not get high scores)
-### 3. Show Real Training, End to End
-The bar is not "training script exists." The bar is "training script runs against the environment, the agent learns, and you can show it."
-- Your training loop must connect to your environment (not a static dataset)
-- Train long enough that the curves mean something
-- Compare a trained agent vs. a random/untrained baseline (quantitative and/or qualitative)
-- Include plots and numbers in your README and writeup
-### 4. Make Your Plots Readable
-Reviewers spend seconds, not minutes, on each plot.
-- Label both axes ("training step" or "episode" on x, "reward" or "loss" on y) and include units
-- Save plots as .png or .jpg and commit them to the repo (don't leave them only in a Colab cell or a deleted Wandb run)
-- If you used Wandb, include the link to that specific run
-- Embed key plots in your README with a one-line caption explaining what each one shows
-- If you have multiple runs (baseline vs. trained, ablations), put them on the same axes so comparison is obvious
-### 5. Tell a Story, Not an API Doc
-Your README, blog, and pitch should answer:
-1. **Problem:** What capability gap or interesting domain are you targeting?
-2. **Environment:** What does the agent see, do, and get rewarded for?
-3. **Results:** What changed after training? Show it.
-4. **Why does it matter:** Who would care, and why?
-A reviewer should be able to read your README in 3-5 minutes and want to try your environment.
-### 6. Engineer It Cleanly (Table Stakes)
-Engineering quality matters less than ambition, but sloppy work hurts.
-- Use OpenEnv's Environment or MCPEnvironment base classes properly
-- Respect client/server separation (clients should never import server internals)
-- Follow the standard Gym-style API (reset, step, state)
-- Have a valid openenv.yaml manifest
-- Don't use reserved tool names (reset, step, state, close) for MCP tools
-## OpenEnv Technical Recap
-### The RL Loop (Conceptual Example: Teaching a Dog to Sit)
-```
-observation = environment.reset()  # Start a new episode
-while not done:
-    observation = environment.observe()  # What does the agent see?
-    action = agent.choose(observation)   # What does the agent do?
-    result = environment.step(action)    # Environment responds
-    reward = result.reward               # Get feedback
-    agent.learn(reward)                  # Agent learns
-```
-### The Four Key Concepts
-- **reset()** - Start a new episode. Begin a fresh training session.
-- **observation** - What the agent sees. The current state of the world.
-- **action** - What the agent does. Sit, spin, move left, etc.
-- **step(action)** - Execute the action. Returns three things: new observation, reward, and done flag (episode over).
-### Building Your Environment in 5 Simple Steps
-1. **Define Types (models.py)** - Action, Observation, State dataclasses
-2. **Implement Environment (server/environment.py)** - reset(), step(), state() methods
-3. **Create Client (client.py)** - HTTPEnvClient subclass
-4. **Create Server (server/app.py)** - app = create_fastapi_app(env)
-5. **Dockerize (Dockerfile)** - Standard container setup
-**Or use the CLI:** `openenv init my_env` - scaffolding ready in seconds.
-### The Universal Interface
-Every OpenEnv environment implements these 3 methods:
-```python
-class Environment:
-    def reset(self) -> Observation:
-        """Start a new episode"""
-    def step(self, action: Action) -> Observation:
-        """Execute action, return observation"""
-    def state(self) -> State:
-        """Get episode metadata"""
-```
-### Type-Safe by Design
-Define your data structures with Python dataclasses:
-- **Action:** What the agent does (move, jump, click, type, etc.)
-- **Observation:** What the agent sees (board state, pixels, text, etc.)
-- **State:** Episode metadata (ID, step count, timestamp, etc.)
-### Connecting to Any Environment
-This pattern works for Chess, Atari, Trading, Android - everything:
-```python
-# Connect to environment (runs in Docker container)
-env = SomeEnv.from_docker_image("some-env:latest")
-# Start new episode
-result = env.reset()
-# Take action
-action = SomeAction(...)
-result = env.step(action)
-# Get episode metadata
-state = env.state()
-# Clean up
-env.close()  # Container stops automatically
-```
-## Model Context Protocol (MCP) - Adding Tools to Your Environment
-**The Challenge:** Modern AI agents need access to external systems like web search APIs, file operations, database queries, Git operations, and custom integrations.
-**The Solution:** MCP (Model Context Protocol) - a standard protocol for AI agents to discover and call tools. It features a REST-like API (JSON-RPC), works with any AI framework, and has plug-and-play tool servers.
-## Deployment Commands
-```bash
-# Initialize a new environment
-openenv init my_env
-cd my_env
-# Deploy to your namespace
-openenv push
-# Deploy to specific repo
-openenv push --repo-id username/my-env
-# Deploy as private
-openenv push --repo-id username/my-env --private
-```
-## Hugging Face Spaces - Three Components
-Every HF Space provides three components:
-### 1. Server: A Running Environment Endpoint
-Connect directly to the running Space (WebSocket under the hood).
-**Async (recommended):**
-```python
-async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
-    result = await client.reset()
-    result = await client.step(EchoAction(message="Hello"))
-```
-**Sync (using .sync() wrapper):**
-```python
-with EchoEnv(base_url="https://openenv-echo-env.hf.space").sync() as client:
-    result = client.reset()
-    result = client.step(EchoAction(message="Hello"))
-```
-**Available Endpoints:**
-- /ws - WebSocket persistent session (used by client)
-- /health - HTTP GET health check
-- /reset - HTTP POST reset environment (stateless)
-- /step - HTTP POST execute action (stateless)
-- /state - HTTP GET current state
-- /docs - HTTP GET OpenAPI documentation
-- /web - HTTP GET interactive web UI
-**Check if space is running:**
-```bash
-curl https://openenv-echo-env.hf.space/health
-# Returns: {"status":"healthy"}
-```
-### 2. Repository: Installable Python Package
-Every Space is a Git repository. OpenEnv environments include a pyproject.toml, making them pip-installable directly from the Space URL.
-```bash
-# Install client package from Space
-pip install git+https://huggingface.co/spaces/openenv/echo-env
-```
-This installs: Client class (EchoEnv), Models (EchoAction, EchoObservation), and Utilities.
-After installation:
-```python
-from envs.echo_env import EchoEnv, EchoAction, EchoObservation
-action = EchoAction(message="Hello")
-```
-### 3. Registry: Docker Container Image
-```bash
-# Pull the image
-docker pull registry.hf.space/openenv-echo-env:latest
-# Run locally on port 8001
-docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
-```
-## Client Usage Examples
-```python
-import asyncio
-from echo_env import EchoEnv, EchoAction
-async def main():
-    # Development: connect to remote Space
-    async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
-        result = await client.reset()
-    # Production: run locally for speed
-    # docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
-    async with EchoEnv(base_url="http://localhost:8001") as client:
-        result = await client.reset()
-    # Or let the client manage Docker for you
-    client = await EchoEnv.from_env("openenv/echo-env")  # Auto-pulls and runs
-    async with client:
-        result = await client.reset()
-asyncio.run(main())
-# For sync usage, use the .sync() wrapper:
-with EchoEnv(base_url="http://localhost:8001").sync() as client:
-    result = client.reset()
-```
-## Clone and Run Environment Locally
-```bash
-# Clone from HF Space
-git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
-cd openenv-benchmark
-# Install in editable mode
-uv sync
-# Start server
-uv run server
-# Run isolated from remote Space
-uv run --isolated --project https://huggingface.co/spaces/burtenshaw/openenv-benchmark server
-```
-## Local Development with Uvicorn
-```bash
-# Full control over uvicorn options
-uvicorn benchmark.server.app:app --host "$HOST" --port "$PORT" --workers "$WORKERS"
-# With reload for development
-uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --reload
-# Multi-worker mode for better concurrency
-uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --workers 4
-```
-## Run Container Locally from Space
-```bash
-# Clone from HF Space
-git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
-cd openenv-benchmark
-# Using OpenEnv CLI (recommended)
-openenv build -t openenv-benchmark:latest
-# Or with Docker directly
-docker build -t openenv-benchmark:latest -f server/Dockerfile .
-```
-## Environment Setup
-### Using uv venv:
-```bash
-uv venv
-source .venv/bin/activate
-uv pip install openenv-core
-```
-### Using conda:
-```bash
-conda create -n openenv_hackathon python=3.12
-conda activate openenv_hackathon
-uv pip install openenv-core
-```
-### Initialize a New Environment:
-```bash
-openenv init HackEnv101_AlirezaShamsoshoara
-```
-This creates 11 files and generates uv.lock. Next steps:
-```bash
-cd /path/to/HackEnv101_AlirezaShamsoshoara
-# Edit environment implementation in server/..._environment.py
-# Edit models in models.py
-# Install dependencies: uv sync
-```
-## Training Resources
-### Training with TRL (GRPO)
-Hugging Face TRL integrates natively with OpenEnv environments for GRPO training.
-**Resources:**
-- TRL OpenEnv Documentation: https://huggingface.co/docs/trl/en/openenv
-- Sudoku Example: https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_sudoku_qrpo.ipynb
-- Wordle Example: https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_worldle_qrpo.ipynb
-- More TRL Examples: https://github.com/huggingface/trl/tree/main/examples/scripts/openenv
-**General Training Examples:**
-- Main examples directory: https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial/examples
-- Unsloth 2048 example: https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/examples/unsloht_2048.ipynb
-- Wordle example (TRL): https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/examples/worldle.py
-### Training with Unsloth
-Unsloth provides 2x faster training and 70% less memory through custom CUDA kernels. Works as a drop-in replacement - same TRL API, just faster.
-**The Pattern:**
-1. Load model via FastLanguageModel (with 4-bit quantization)
-2. Apply LoRA adapters for parameter-efficient training
-3. Use OpenEnv as the reward function
-4. Train with standard GRPOTrainer
-**Google Colab Ready:** Run on a free T4 GPU. Unsloth + OpenEnv Colab notebook available for the 2048 game environment with 20B parameter models.
-**Also Compatible With:** TRL, torchforge, SkyRL, ART, Oumi, veRL
-## Accessing Hugging Face Infrastructure
-Use HF infrastructure to run your training. Hugging Face Jobs provide compute for AI and data workflows.
-**Important Notes:**
-- Depends on your model size, choose your GPU model wisely
-- Choose wisely so you can run training/inference for a reasonable time with your credits
-- A T4 GPU (small/medium) is a good choice
-**Methods to Run Jobs:**
-- hf CLI
-- huggingface_hub Python client
-- Jobs HTTP API
-**Pricing and Billing Resources:**
-- Billing settings: https://huggingface.co/settings/billing
-- Jobs settings: https://huggingface.co/settings/jobs
-- Jobs documentation: https://huggingface.co/docs/hub/jobs
-- Job CLI documentation: https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs
-- Jobs guide: https://huggingface.co/docs/huggingface_hub/guides/jobs
-- Jobs pricing: https://huggingface.co/docs/hub/jobs-pricing
-- Jobs examples: https://huggingface.co/docs/hub/jobs-examples
-**Check available hardware:**
-```bash
-hf jobs hardware
-```
-### Example Hardware Options
-| Name | Pretty Name | CPU | RAM | Accelerator | Cost/Hour |
-|------|-------------|-----|-----|-------------|-----------|
-| cpu-basic | CPU Basic | 2 vCPU | 16 GB | N/A | $0.01 |
-| cpu-upgrade | CPU Upgrade | 8 vCPU | 32 GB | N/A | $0.03 |
-| t4-small | Nvidia T4 - small | 4 vCPU | 15 GB | 1x T4 (16 GB) | $0.40 |
-| t4-medium | Nvidia T4 - medium | 8 vCPU | 30 GB | 1x T4 (16 GB) | $0.60 |
-| a10g-small | Nvidia A10G - small | 4 vCPU | 15 GB | 1x A10G (24 GB) | $1.00 |
-| a10g-large | Nvidia A10G - large | 12 vCPU | 46 GB | 1x A10G (24 GB) | $1.50 |
-| a100-large | Nvidia A100 - large | 12 vCPU | 142 GB | 1x A100 (80 GB) | $2.50 |
-| a100x4 | 4x Nvidia A100 | 48 vCPU | 568 GB | 4x A100 (320 GB) | $10.00 |
-| a100x8 | 8x Nvidia A100 | 96 vCPU | 1136 GB | 8x A100 (640 GB) | $20.00 |
-| h200 | Nvidia H200 | 23 vCPU | 256 GB | 1x H200 (141 GB) | $5.00 |
-| h200x2 | Nvidia H200 (2x) | 46 vCPU | 512 GB | 2x H200 (282 GB) | $10.00 |
-| h200x4 | Nvidia H200 (4x) | 92 vCPU | 1024 GB | 4x H200 (564 GB) | $20.00 |
-| h200x8 | Nvidia H200 (8x) | 184 vCPU | 2048 GB | 8x H200 (1128 GB) | $40.00 |
-## Still Have Questions?
-Please mention them in the Discord India OpenEnv Hackathon channels, and the team will do their best to answer.
----
-## Example Model Reference
-- model_name_or_path: Qwen/Qwen2-0.5B (and similar models)
-```
-This is plain markdown text. Just copy everything between the triple backticks and paste it into any markdown editor or document.

plan.md DELETED Viewed

@@ -1,63 +0,0 @@
-# Goal Description
-To elevate QuantHive into a definitive Top 15 Hackathon submission, we need to transition from a "single-agent Gym environment with pipeline functions" to a **"decentralized society of interacting agents"** using true multi-agent RL principles.
-This requires rewriting the core environment to use the **PettingZoo** AEC (Agent Environment Cycle) API, giving each agent independent observation/action spaces, conflicting reward functions (emergent behavior), and communication channels.
-## User Review Required
-> [!WARNING]
-> **Massive Architectural Rewrite**
-> This change will rip out the foundation of your current, working project.
->
-> 1. We will replace [trading_env.py](file:///e:/Development/Round2/env/trading_env.py) (Gym) with `multi_agent_env.py` (PettingZoo).
-> 2. The API server ([server.py](file:///e:/Development/Round2/api/server.py)) and UI will break and need to be rewritten to support asynchronous agent steps.
-> 3. The current GRPO training script ([train_grpo.py](file:///e:/Development/Round2/training/train_grpo.py)) trains a single policy on JSON. In a true multi-agent setup, we need an *online* RL loop. We will build a multi-agent rollout collector connecting to Unsloth/TRL, but it is experimental and computationally heavy.
->
-> If you are close to the submission deadline, doing this is extremely risky. If you proceed, the repository will be in a broken state until all components are rewired.
-## Proposed Changes
-### Core Environment (PettingZoo)
-Replace the single-agent Gym environment with a multi-agent PettingZoo environment.
-#### [NEW] `env/multi_agent_env.py`
-- Inherits from `pettingzoo.utils.env.AECEnv`.
-- Agents: `["risk_manager_0", "portfolio_manager_0", "trader_0"]`.
-- [step()](file:///e:/Development/Round2/api/server.py#108-227) and `observe()` functions that alternate execution between agents.
-- **Agent Negotiation:** The observation space of the Trader includes the output messages/constraints from the Risk Manager and PM.
-- **Adversarial Rewards:**
-  - Trader: Rewarded for PnL.
-  - Risk Manager: Rewarded for capping size when volatility/drawdown is high, penalized when Trader loses money.
-#### [MODIFY] [env/trading_env.py](file:///e:/Development/Round2/env/trading_env.py)
-- Deprecate or refactor to wrap the PettingZoo environment for legacy compatibility.
-### Agents & Governance
-Modify the agent definitions to act as independent RL policies within the PettingZoo loop.
-#### [MODIFY] [agents/risk_model.py](file:///e:/Development/Round2/agents/risk_model.py)
-#### [MODIFY] [agents/portfolio_manager.py](file:///e:/Development/Round2/agents/portfolio_manager.py)
-#### [MODIFY] [agents/trader.py](file:///e:/Development/Round2/agents/trader.py)
-- Refactor agents to accept PettingZoo observations (which include multi-agent messages) and output PettingZoo actions.
-### Training Loop (Online Multi-Agent RL)
-Connect the LLMs to the PettingZoo environment for online rollout collection.
-#### [NEW] `training/train_multi_agent.py`
-- An online RL loop that steps the `multi_agent_env`.
-- Collects trajectories (Observation, Action, Reward) for multiple agents.
-- Feeds collected rollout buffers into the GRPO/PPO trainer. Note: Full multi-agent online LLM training is extremely heavy; we may implement it as alternating optimization (freeze RM, train Trader, freeze Trader, train RM).
-### API Server and UI
-Update the server to orchestrate a PettingZoo AEC loop.
-#### [MODIFY] [api/server.py](file:///e:/Development/Round2/api/server.py)
-- Rewrite [SimulationRunner](file:///e:/Development/Round2/api/server.py#72-227) to step through the PettingZoo `agent_iter()`.
-- Broadcast state updates to the UI, showing the negotiation and adversarial interactions.
-## Verification Plan
-### Automated Tests
-1. Initialize `MultiAgentEnv` and run the `pettingzoo.test.api_test()`.
-2. Verify that taking actions with `risk_manager_0` updates the observation space of `trader_0`.
-3. Verify that the adversarial reward functions independently return conflicting scores (e.g., RM gets +1 for restricting, Trader gets -1 for missing a trade).
-### Manual Verification
-1. Run the new API server and step through the UI to see the multi-agent negotiation in real-time.
-2. Run `train_multi_agent.py` for 50 steps to ensure trajectories build correctly and gradients update.

requirements.md DELETED Viewed

@@ -1,150 +0,0 @@
-**What the automated round checks**
-These are the items the validation pass looks for. If any is missing or broken at the deadline, the submission won't make it to a human judge; regardless of how strong the underlying idea is. Verify each one explicitly before you submit.
-- Public, cloneable Hugging Face Space at the submitted URL. Test from a logged-out browser. Private spaces, dead links, or 404s are an automatic out.
-- Valid OpenEnv structure: proper Environment / MCPEnvironment base class, Gym-style reset / step / state, and a parseable openenv.yaml.
-- Training evidence committed to the repo as image files (.png / .jpg): At minimum a loss curve and a reward curve. Wandb-only links and plots that live only in a Colab cell don't count: they may not be reachable when validation runs.
-- A runnable training script (Unsloth, HF TRL, or other frameworks), preferably linked as a Colab notebook so it can be re-executed end to end (Python script is acceptable as well).
-- A README that links every deliverable: HF Space, training notebook, and your writeup (blog / video / slides), with the key plots embedded inline. If validation can't reach a deliverable from the README, it counts as missing.
-**TL;DR**
-Build an environment that an LLM could actually be trained on to get measurably better at
-something interesting. Then show that training. Then tell the story.
-A messy but ambitious environment with real training evidence beats a polished but boring one.
-Pick a problem that excites you (that energy comes through in the pitch).
-**Judging Criteria**
-**Criterion: Environment Innovation**Weight: 40%What it means:Is the environment novel, creative, or genuinely challenging?Does it meaningfully test agent behavior **in** a way that hasn't been done before?
-**Criterion: Storytelling & Presentation**Weight: 30%What it means:Can you clearly explain the problem, the environment, and what the agent learned?Is the demo engaging and easy to follow **for** a non-technical audience?
-**Criterion: Showing Improvement in Rewards**Weight: 20%What it means:Is there observable evidence of training progress? Reward curves, before/after behavior,comparison against a baseline -- anything that proves the agent learned something.
-**Criterion: Reward & Training Pipeline**Weight: 10%What it means:Is the reward logic coherent? Does the pipeline produce meaningful improvement **in** the trainedagent's behavior?
-**Minimum Submission Requirements**
-**NOTE:** These are **non-negotiable**. Submissions missing any of these are at a serious disadvantage.
-*   **Use OpenEnv** (latest release). Build on top of the framework; don’t reinvent the wheel.
-*   **A working training script** using **Unsloth or Hugging Face TRL**, ideally as a Colab notebook so judges can re-run it.
-*   **Evidence that you actually trained**; at minimum, loss and reward plots from a real run.
-*   **A short writeup**: a mini-blog on Hugging Face or a < 2 minute video on YouTube explaining what your environment does and what you trained, or a short slide deck of presentation. Please make sure that all materials are linked from your README file so that judges can access them easily.
-*   **Push your environment to a Hugging Face Space** so it’s discoverable and runnable.
-*   **A README** that motivates the problem, explains how the env works, and shows results.
-    *   README should have a link to the environment in the Hugging Face Space. It should also have all additional references to other materials (e.g. videos, blog posts, slides, presentations, etc.) that you want to include.
-*   Please do not include big video files in your Env submission on HF Hub as we would like to have a small size for each env (Please use url as reference link to additional materials).
-**What Makes a Submission Stand Out**
-_**Pick an ambitious, original problem**_
-The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
-you need a genuinely fresh angle. Some questions to ask yourself:
-*   Does this environment exist to teach an LLM something it currently can’t do well?
-*   Is the domain underexplored in RL/LLM training?
-*   Could a researcher write a paper about training on this?
-_**Design a reward signal that actually teaches**_
-A great environment has a reward function that:
-*   Provides a **rich, informative signal** (not just 0/1 at the end)
-*   Captures something **hard to measure** in a clever way
-*   Uses OpenEnv’s **Rubric system** thoughtfully (composable rubrics > monolithic scoring)
-*   Is **hard to game**; an agent that exploits the reward without solving the task should not get high scores
-_**Show real training, end to end**_
-The bar isn’t “training script exists.” The bar is “training script runs against the environment, the
-agent learns, and you can show it.” Concretely:
-*   Your training loop should connect to **your** environment (not a static dataset)
-*   Train long enough that the curves mean something
-*   Compare a **trained agent vs. a random/untrained baseline**; quantitative and/or qualitative
-*   Include the plots and numbers in your README and writeup
-_**Make your plots readable**_
-Reviewers spend seconds, not minutes, on each plot. Help them out:
-*   **Label both axes** (e.g. “training step” / “episode” on x, “reward” / “loss” on y) and include units where they apply
-*   Save plots as _.png_ or _.jpg_ and **commit them to the repo** (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via Wandb, please include the link to that specific run of your plots)
-*   **Embed the key plots in your README** with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
-_**Tell a story, not an API doc**_
-Your README, blog, and pitch should answer:
-1.  **Problem)** what capability gap or interesting domain are you targeting?
-2.  **Environment)** what does the agent see, do, and get rewarded for?
-3.  **Results)** what changed after training? Show it.
-4.  **Why does it matter)** who would care, and why?
-_A reviewer should be able to read your README in 3~5 minutes and want to try your_
-_environment._
-**NOTE:** If you have a video, HF post, or anything else interesting, please make sure that it’s linked
-  from your README as a link.
-_**Engineer it cleanly (table stakes)**_
-Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
-*   Use OpenEnv’s Environment / MCPEnvironment base classes properly
-*   Respect the **client / server separation** (clients should never import server internals)
-*   Follow the standard Gym-style API (reset, step, state)
-*   Have a valid openenv.yaml manifest
-*   Don’t use reserved tool names (reset, step, state, close) for MCP tools
-**Final Note**
-Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
-ambitious. Pick a problem you find genuinely interesting; that almost always produces better
-work than chasing what you think judges want. Good luck.

themes.md DELETED Viewed

@@ -1,134 +0,0 @@
-Theme #1 - Multi-Agent Interactions
-Environments for this theme involve cooperation, competition, negotiation, and
-coalition formation. Learning from these environments will enable agents to model the
-beliefs and incentives of others in partially observable settings. This drives
-theory-of-mind reasoning and emergent strategic behavior.
-Expected Outcome: an environment that can be used to train multi-agent task
-handling in a LLM
-Example environments: Market simulations, compute-allocation negotiations,
-collaborative puzzle worlds, mixed cooperative/competitive strategy games.
-Sub-themes with bonus prizes.
-- Fleet AI. Scalable Oversight: Environments that train oversight agents to
-monitor, analyze, and explain the behavior of other AI agents operating in
-complex, multi-agent settings.
-- Halluminate. Multi-Actor Environments: Build a realistic environment where an
-agent interacts with and manages multiple actors (agents) to discover and
-achieve the task
-Theme #2 - (Super) Long-Horizon Planning & Instruction
-## Following
-You will build environments that require deep, multi-step reasoning with sparse or
-delayed rewards. After using these environments, the goal is to enable agents to
-decompose goals, track state over extended trajectories, and recover from early
-mistakes. The aim is to push beyond shallow next-token reasoning toward structured
-planning and durable internal representations.
-Expected Outcome: an environment that can capture and improve LLM behaviour on
-challenging long horizon tasks that need long running sessions beyond context
-memory limits.
-Example environments: Research-planning simulators, large-scale codebase
-refactoring tasks, strategic resource management worlds, long-horizon logistics
-optimization, extremely complicated long-horizon instruction following (e.g., 300
-instructions scattered around).
-Sub-themes with bonus prizes.
-- Scale AI. Environments for long horizon workflows for non-code use cases
-within a business setting: focusing on either Sales, Project management, or HR
-## & IT.
-- Mercor. Make an environment with capped/uncapped rewards where frontier
-model rewards scale with token output.
-## Theme #3 - World Modeling
-## #3.1 Professional Tasks
-Here you will develop environments that require real interaction with tools, APIs, or dynamic
-systems where the model is expected to do real hard work instead of exploiting short-cuts to
-arrive at the desired outcome. Learning from these environments will enable agents to
-maintain consistent internal state, update beliefs based on outcomes, and orchestrate
-multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
-Expected Outcome: an environment capturing nuances of a defined partially observable world
-and improve LLM interaction with it
-Example environments: Dynamic browser/API ecosystems, enterprise applications, scientific
-workflow loops (papers → code → experiments), economic simulations with feedback,
-tool-discovery benchmarks.
-Sub-themes with bonus prizes.
-- Scaler AI Labs. Multi-App RL Environment for Enterprise Workfl ows: Create RL
-environments to demonstrate complex workflows, business rule nuances etc in
-a large enterprise
-## #3.2 Personalized Tasks
-Here we will develop an environment that offers real personalized task handling,
-imagine replying to personal messages or handling dinner conflicts due to work
-conflicts, replying to tough emails. Think any personal assistant tasks
-Expected Outcome: An environment that gives the model a realistic simulation of
-handling personal tasks, conflicts and managing them as delegations
-Example environments: Executive Assistant Meeting Planner, Dinner and drive
-planning, email and message replying, shopping, etc
-Sub-themes with bonus prizes.
-- Patronus AI. Consumer Workflows with Schema Drift: Multi-step consumer
-workflow environments where the underlying data schemas, API contracts, and
-t&cs/policies/rules change.
-Theme #4 - Self-Improvement
-The focus here is to create environments where agents can learn to generate new
-challenges, escalate difficulty, and improve through self-play or adaptive curricula.
-Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own
-capability growth. The objective is recursive skill amplific ation.
-Expected Outcome: an environment for improving self-play of a LLM over a defined
-set of tasks
-Example environments: Self-play negotiation arenas, auto-generated math/proof
-tasks, evolving coding competitions, adaptive RL curricula.
-Sub-themes with bonus prizes.
-- Snorkel AI. Simulated Experts-in-the-Loop: Environment that simulates
-interactions with real subject-matter experts, with changing requirements /
-preferences.
-## Theme #5: Wild Card - Impress Us!
-We do not want to limit your focus if your idea doesn’t fit the boxes above, we want
-and WILL reward out of box tasks, please be creative but remember to add
-submissions that meaningfully add value to LLM training on a certain task.
-Guidelines for Problem Statement
-● It is NOT mandatory to choose the same problem statement as Round 1. Only
-choose the same problem statement if it aligns with the above provided
-Hackathon themes.
-● You can start working on your problem statement once you have finalized it.
-Post-training can be done onsite on 25th & 26th when you receive compute
-credits for HuggingFace.
-● Before the onsite, we suggest you work on building the environment, agent
-behaviours, reward model and evaluate if your work aligns with the judging
-criteria given below.
-## Judging Criteria
-Minimum requirements:
-● Usage of OpenEnv (latest release)
-● Show a minimal training script for your environment using Unsloth or HF TRL in
-## Colab
-● Write a mini-blog on HuggingFace or mini-video on YouTube talking about your
-submission, <2 minutes
-## First Round Judging Overview
-● Pitch Format: Each team has 3 minutes to pitch, followed by 2 minutes for
-Q&A (5 minutes total).
-● Evaluation: Teams will be scored based on the following criteria:
-- Environment Innovation (40%): Is the environment novel, creative, or
-challenging? Does it meaningfully test the agent’s behavior?
-- Storytelling (30%): Does the team clearly explain the problem,
-environment, and agent behavior? Is the demo engaging and easy to
-follow?
-- Showing Improvement in Rewards (20%): Does the demo provide
-observable evidence of training progress (reward curves, metrics, or
-before/after behavior)?
-- Reward and Training Script/Pipeline Setup (10%): Is the reward logic
-coherent, and does the pipeline produce meaningful improvement in
-the agent’s inference (how it acts in the environment)?
-Each evaluator will judge about 10-15 teams during the judging process,
-submitting scores individually for each team. Once scores are submitted, the
-Cerebral Valley team will aggregate your scores with the other judge's scores to
-determine the top 15 finalist projects.

train_hf.py DELETED Viewed

@@ -1,438 +0,0 @@
-#!/usr/bin/env python3
-"""
-QuantHive — HF Jobs GRPO Training Script
-=========================================
-Standalone script to fine-tune Qwen 2.5-1.5B on the multi-agent trading
-environment using GRPO.  Designed to run on HuggingFace Jobs (A10G / A100).
-Usage (local):
-    python train_hf.py
-Usage (HF Jobs):
-    hf jobs run --hardware a10g-small -- python train_hf.py
-The script:
- 1. Generates scenarios from the PettingZoo multi-agent env
- 2. Trains with GRPO + 5 governance-aware verifiers
- 3. Saves LoRA adapters + merged model
- 4. Logs sample outputs so you can see the <thought> reasoning
- 5. Generates training plots and pushes everything to the HF Hub
-"""
-from __future__ import annotations
-import inspect
-import json
-import os
-import random
-import shutil
-import sys
-from pathlib import Path
-import numpy as np
-# ── Unsloth JIT-compilation bypass (prevents AttributeError on cloud) ─────────
-os.environ["UNSLOTH_DISABLE_COMPILE"] = "1"
-os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
-os.environ["DISABLE_UNSLOTH_COMPILE"] = "1"
-os.environ["OPENBLAS_NUM_THREADS"] = "1"
-os.environ["OMP_NUM_THREADS"] = "1"
-# Delete compiled cache if it exists
-cache_dir = Path("unsloth_compiled_cache")
-if cache_dir.exists():
-    shutil.rmtree(cache_dir, ignore_errors=True)
-    print("🗑️  Deleted unsloth_compiled_cache/")
-# ── Ensure project root is importable ─────────────────────────────────────────
-ROOT = Path(__file__).resolve().parent
-if str(ROOT) not in sys.path:
-    sys.path.insert(0, str(ROOT))
-# ═══════════════════════════════════════════════════════════════════════════════
-#  CONFIGURATION — Edit these for your run
-# ═══════════════════════════════════════════════════════════════════════════════
-MODEL_NAME       = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"
-OUTPUT_DIR       = "models/grpo_hf_trained"
-HF_REPO_ID       = "ARKAISW/QuantHive-GRPO-Trader"  # Where to push the model
-# Training hyperparameters
-NUM_SCENARIOS         = 800          # More diverse scenarios
-MAX_STEPS             = 500          # 2x longer than Kaggle run
-BATCH_SIZE            = 4
-GRAD_ACCUM_STEPS      = 2
-NUM_GENERATIONS       = 8            # 8 candidates per prompt (better GRPO signal)
-LEARNING_RATE         = 1e-5
-MAX_SEQ_LENGTH        = 1024
-MAX_PROMPT_LENGTH     = 768
-MAX_COMPLETION_LENGTH = 64
-SAVE_STEPS            = 100
-LOGGING_STEPS         = 1
-DIFFICULTY            = "easy"       # "easy", "medium", "hard"
-SEED                  = 3407
-# Sample output logging
-NUM_SAMPLE_OUTPUTS    = 10           # How many sample outputs to log after training
-def main():
-    random.seed(SEED)
-    np.random.seed(SEED)
-    # ── Step 1: Install deps if missing ───────────────────────────────────────
-    print("=" * 60)
-    print("  QuantHive — Multi-Agent GRPO Training (HF Jobs)")
-    print("=" * 60)
-    import torch
-    if not torch.cuda.is_available():
-        raise SystemExit("❌ CUDA not available. Use GPU hardware.")
-    print(f"✅ CUDA available: {torch.cuda.get_device_name(0)}")
-    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
-    # ── Step 2: Generate scenarios ────────────────────────────────────────────
-    from training.prompt_utils import (
-        SYSTEM_PROMPT,
-        build_prompt_multiagent,
-        generate_pz_scenarios,
-    )
-    print(f"\n📊 Generating {NUM_SCENARIOS} scenarios (difficulty={DIFFICULTY})...")
-    scenarios = generate_pz_scenarios(
-        n=NUM_SCENARIOS, difficulty=DIFFICULTY, max_env_steps=100
-    )
-    print(f"   Generated {len(scenarios)} scenarios.")
-    from datasets import Dataset
-    prompts = [{"prompt": build_prompt_multiagent(sc)} for sc in scenarios]
-    dataset = Dataset.from_list(prompts)
-    # ── Step 3: Load model natively via Transformers/PEFT ─────────────────────
-    print(f"\n🤖 Loading model natively: {MODEL_NAME}")
-    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-    from peft import get_peft_model, LoraConfig, TaskType
-    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    bnb_config = BitsAndBytesConfig(
-        load_in_4bit=True,
-        bnb_4bit_quant_type="nf4",
-        bnb_4bit_use_double_quant=True,
-        bnb_4bit_compute_dtype=torch.float16,
-    )
-    model = AutoModelForCausalLM.from_pretrained(
-        MODEL_NAME,
-        quantization_config=bnb_config,
-        device_map="auto",
-        dtype=torch.float16,
-        trust_remote_code=True,
-    )
-    peft_config = LoraConfig(
-        task_type=TaskType.CAUSAL_LM,
-        r=16,
-        lora_alpha=16,
-        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
-        bias="none",
-    )
-    model = get_peft_model(model, peft_config)
-    # 🚀 Surgical DType Lock (Hard Force FP16 after PEFT wrap)
-    model = model.to(torch.float16)
-    if hasattr(model, "lm_head"):
-        model.lm_head.weight.data = model.lm_head.weight.data.to(torch.float16)
-        if getattr(model.lm_head, "bias", None) is not None:
-            model.lm_head.bias.data = model.lm_head.bias.data.to(torch.float16)
-        model.lm_head.to(torch.float16)
-    if hasattr(model, "model") and hasattr(model.model, "embed_tokens"):
-        model.model.embed_tokens.weight.data = model.model.embed_tokens.weight.data.to(torch.float16)
-        model.model.embed_tokens.to(torch.float16)
-    # 🐛 Fix GRPOTrainer crash by injecting warnings_issued dict
-    if not hasattr(model, "warnings_issued"):
-        model.warnings_issued = {}
-    print("   Native model loaded + LoRA applied.")
-    # ── Step 5: Build trainer ─────────────────────────────────────────────────
-    from trl.trainer.grpo_config import GRPOConfig
-    # 🐛 Fix llm_blender crashing on modern transformers by injecting missing cache var
-    import transformers.utils.hub
-    if not hasattr(transformers.utils.hub, "TRANSFORMERS_CACHE"):
-        try:
-            transformers.utils.hub.TRANSFORMERS_CACHE = transformers.utils.hub.constants.HF_HUB_CACHE
-        except AttributeError:
-            transformers.utils.hub.TRANSFORMERS_CACHE = "/tmp"
-    from trl.trainer.grpo_trainer import GRPOTrainer
-    from env.reward import (
-        alignment_reward_func,
-        format_reward_func,
-        profit_reward_func,
-    )
-    from training.grpo_verifiers_multiagent import (
-        governance_reward_func_multiagent,
-        risk_reward_func_multiagent,
-    )
-    training_args = GRPOConfig(
-        output_dir=OUTPUT_DIR,
-        learning_rate=LEARNING_RATE,
-        per_device_train_batch_size=BATCH_SIZE,
-        gradient_accumulation_steps=GRAD_ACCUM_STEPS,
-        num_train_epochs=1,
-        max_steps=MAX_STEPS,
-        save_steps=SAVE_STEPS,
-        logging_steps=LOGGING_STEPS,
-        bf16=False,
-        fp16=False,
-        max_grad_norm=0.5,
-        max_prompt_length=MAX_PROMPT_LENGTH,
-        max_completion_length=MAX_COMPLETION_LENGTH,
-        num_generations=NUM_GENERATIONS,
-        report_to="none",
-    )
-    reward_funcs = [
-        format_reward_func,
-        alignment_reward_func,
-        risk_reward_func_multiagent,
-        profit_reward_func,
-        governance_reward_func_multiagent,
-    ]
-    trainer_kwargs = {
-        "model": model,
-        "reward_funcs": reward_funcs,
-        "args": training_args,
-        "train_dataset": dataset,
-    }
-    sig = inspect.signature(GRPOTrainer.__init__)
-    if "processing_class" in sig.parameters:
-        trainer_kwargs["processing_class"] = tokenizer
-    elif "tokenizer" in sig.parameters:
-        trainer_kwargs["tokenizer"] = tokenizer
-    # ── Step 5.5: Verify DTypes before Trainer ───────────────────────────────
-    print(f"📊 DType Check: lm_head={model.lm_head.weight.dtype}, embed={model.model.embed_tokens.weight.dtype}")
-    trainer = GRPOTrainer(**trainer_kwargs)
-    # ── Step 6: Train! ────────────────────────────────────────────────────────
-    print(f"\n🚀 Starting GRPO training — {MAX_STEPS} steps, {NUM_GENERATIONS} generations/prompt")
-    print(f"   Effective batch size: {BATCH_SIZE} × {GRAD_ACCUM_STEPS} × 1 GPU = {BATCH_SIZE * GRAD_ACCUM_STEPS}")
-    print()
-    trainer.train()
-    print("\n✅ Training complete!")
-    # ── Step 7: Extract metrics ───────────────────────────────────────────────
-    history = trainer.state.log_history
-    rewards = [x["reward"] for x in history if "reward" in x]
-    losses = [x.get("loss", 0.0) for x in history if "reward" in x]
-    os.makedirs(OUTPUT_DIR, exist_ok=True)
-    metrics_path = Path(OUTPUT_DIR) / "training_metrics.json"
-    with open(metrics_path, "w") as f:
-        json.dump({"rewards": rewards, "losses": losses, "log_history": history}, f, indent=2, default=str)
-    print(f"📈 Metrics saved to {metrics_path}")
-    # ── Step 8: Generate sample outputs (CRITICAL for judge review) ───────────
-    print(f"\n📝 Generating {NUM_SAMPLE_OUTPUTS} sample outputs from trained model...")
-    model.eval()
-    sample_outputs = []
-    for i in range(min(NUM_SAMPLE_OUTPUTS, len(scenarios))):
-        prompt_text = build_prompt_multiagent(scenarios[i])
-        messages = [
-            {"role": "system", "content": SYSTEM_PROMPT},
-            {"role": "user", "content": prompt_text},
-        ]
-        input_ids = tokenizer.apply_chat_template(
-            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
-        ).to(model.device)
-        output_ids = model.generate(
-            input_ids=input_ids,
-            max_new_tokens=MAX_COMPLETION_LENGTH,
-            temperature=0.7,
-            top_p=0.9,
-            do_sample=True,
-        )
-        response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
-        sample_outputs.append({
-            "scenario_idx": i,
-            "rm_size_limit": scenarios[i]["rm_size_limit"],
-            "pm_cap_alloc": scenarios[i]["pm_cap_alloc"],
-            "model_output": response,
-        })
-        print(f"\n{'─' * 60}")
-        print(f"  Sample {i+1} | RM limit={scenarios[i]['rm_size_limit']:.2f} | PM cap={scenarios[i]['pm_cap_alloc']:.2f}")
-        print(f"{'─' * 60}")
-        print(response[:500])
-    samples_path = Path(OUTPUT_DIR) / "sample_outputs.json"
-    with open(samples_path, "w") as f:
-        json.dump(sample_outputs, f, indent=2, ensure_ascii=False)
-    print(f"\n💾 Sample outputs saved to {samples_path}")
-    # ── Step 9: Generate plots ────────────────────────────────────────────────
-    print("\n📊 Generating training plots...")
-    try:
-        import matplotlib
-        matplotlib.use("Agg")
-        import matplotlib.pyplot as plt
-        os.makedirs("plots", exist_ok=True)
-        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
-        fig.suptitle("QuantHive Multi-Agent GRPO Training — Qwen 2.5 1.5B", fontsize=14)
-        # Loss curve
-        steps = list(range(1, len(losses) + 1))
-        axes[0].plot(steps, losses, alpha=0.4, color="salmon", label="Raw")
-        if len(losses) >= 20:
-            ma = np.convolve(losses, np.ones(20)/20, mode="valid")
-            axes[0].plot(range(20, len(losses)+1), ma, color="red", linewidth=2, label="MA-20")
-        axes[0].set_xlabel("Training Step")
-        axes[0].set_ylabel("Loss")
-        axes[0].set_title("GRPO Training Loss")
-        axes[0].legend()
-        # Reward curve
-        axes[1].plot(steps, rewards, alpha=0.4, color="lightgreen", label="Raw")
-        if len(rewards) >= 20:
-            ma = np.convolve(rewards, np.ones(20)/20, mode="valid")
-            axes[1].plot(range(20, len(rewards)+1), ma, color="green", linewidth=2, label="MA-20")
-        axes[1].set_xlabel("Training Step")
-        axes[1].set_ylabel("Mean Reward")
-        axes[1].set_title("GRPO Mean Reward (5 Verifiers)")
-        axes[1].legend()
-        plt.tight_layout()
-        fig.savefig("plots/hf_training_curves.png", dpi=150, bbox_inches="tight")
-        plt.close()
-        print("   Saved plots/hf_training_curves.png")
-        # ── Baseline comparison bar chart ─────────────────────────────────────
-        # Evaluate trained model vs random baseline on 20 scenarios
-        print("   Generating baseline comparison...")
-        eval_scenarios = scenarios[:20]
-        trained_scores = {
-            "Format": [], "Alignment": [], "Risk": [], "Profit": [], "Governance": []
-        }
-        baseline_scores = {
-            "Format": [], "Alignment": [], "Risk": [], "Profit": [], "Governance": []
-        }
-        for sc in eval_scenarios:
-            prompt_text = build_prompt_multiagent(sc)
-            # Trained model output
-            messages = [
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": prompt_text},
-            ]
-            input_ids = tokenizer.apply_chat_template(
-                messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
-            ).to(model.device)
-            out = model.generate(input_ids=input_ids, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)
-            completion = tokenizer.decode(out[0][input_ids.shape[1]:], skip_special_tokens=True)
-            # Random baseline: gibberish output
-            random_completion = '{"direction": ' + str(random.choice([0,1,2])) + ', "size": ' + f"{random.random():.2f}" + ', "sl": 0, "tp": 0}'
-            # Score both
-            for name, func in zip(
-                ["Format", "Alignment", "Risk", "Profit", "Governance"],
-                reward_funcs
-            ):
-                t_score = func([prompt_text], [completion])[0]
-                b_score = func([prompt_text], [random_completion])[0]
-                trained_scores[name].append(t_score)
-                baseline_scores[name].append(b_score)
-        # Plot
-        fig2, ax2 = plt.subplots(figsize=(10, 6))
-        verifiers = list(trained_scores.keys())
-        x = np.arange(len(verifiers))
-        width = 0.35
-        trained_means = [np.mean(trained_scores[v]) for v in verifiers]
-        baseline_means = [np.mean(baseline_scores[v]) for v in verifiers]
-        bars1 = ax2.bar(x - width/2, baseline_means, width, label="Random Baseline", color="#ff6b6b", alpha=0.85)
-        bars2 = ax2.bar(x + width/2, trained_means, width, label="GRPO-Trained", color="#51cf66", alpha=0.85)
-        ax2.set_ylabel("Mean Score")
-        ax2.set_xlabel("Reward Verifier")
-        ax2.set_title("QuantHive: Trained Agent vs Random Baseline")
-        ax2.set_xticks(x)
-        ax2.set_xticklabels(verifiers)
-        ax2.legend()
-        ax2.set_ylim(0, 1.1)
-        for bar in bars1:
-            ax2.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,
-                    f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=10)
-        for bar in bars2:
-            ax2.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,
-                    f'{bar.get_height():.2f}', ha='center', va='bottom', fontsize=10)
-        fig2.savefig("plots/hf_baseline_vs_trained.png", dpi=150, bbox_inches="tight")
-        plt.close()
-        print("   Saved plots/hf_baseline_vs_trained.png")
-    except Exception as e:
-        print(f"   ⚠️  Could not generate plots: {e}")
-    # ── Step 10: Save model ───────────────────────────────────────────────────
-    print(f"\n💾 Saving model to {OUTPUT_DIR}...")
-    model.save_pretrained(OUTPUT_DIR)
-    tokenizer.save_pretrained(OUTPUT_DIR)
-    # ── Step 11: Push to HF Hub (optional) ────────────────────────────────────
-    try:
-        from huggingface_hub import HfApi
-        api = HfApi()
-        print(f"\n🚀 Pushing model to {HF_REPO_ID}...")
-        api.upload_folder(
-            folder_path=OUTPUT_DIR,
-            repo_id=HF_REPO_ID,
-            repo_type="model",
-            create_pr=False,
-        )
-        print(f"   ✅ Model pushed to https://huggingface.co/{HF_REPO_ID}")
-        # Also push the plots
-        for plot_file in Path("plots").glob("hf_*.png"):
-            api.upload_file(
-                path_or_fileobj=str(plot_file),
-                path_in_repo=f"plots/{plot_file.name}",
-                repo_id=HF_REPO_ID,
-                repo_type="model",
-            )
-            print(f"   📊 Uploaded {plot_file.name}")
-    except Exception as e:
-        print(f"   ⚠️  Could not push to HF Hub: {e}")
-        print(f"   You can manually push later with: huggingface-cli upload {HF_REPO_ID} {OUTPUT_DIR}")
-    print("\n" + "=" * 60)
-    print("  ✅ QuantHive GRPO Training Complete!")
-    print(f"  📁 Model: {OUTPUT_DIR}")
-    print(f"  📊 Plots: plots/hf_training_curves.png, plots/hf_baseline_vs_trained.png")
-    print(f"  📝 Samples: {OUTPUT_DIR}/sample_outputs.json")
-    print("=" * 60)
-if __name__ == "__main__":
-    main()

visualization.md DELETED Viewed

@@ -1,316 +0,0 @@
-# 🎮 UI Design Specification — Cutesy Quant Firm Simulation
-## Overview
-This module defines a **2D indie-style visualization layer** for the Multi-Agent RL Trading Environment.
-The goal is to transform abstract agent interactions, trading decisions, and reward signals into a **visually intuitive, engaging simulation** resembling a small quant firm office.
-This directly supports:
-* Multi-agent interaction clarity
-* Reward and learning visualization
-* Storytelling for demo and judging
----
-## 🧠 Core Concept
-A **“living office” simulation** where:
-* Each AI agent is represented as a character
-* Agents communicate via visible messages
-* Decisions affect a shared portfolio
-* Learning is visualized over time
----
-## 🎨 Art Style Specification
-### Style Choice
-* 2D pixel-art / stylized indie aesthetic
-* Soft pastel color palette
-* Minimal but expressive character design
-### Rationale
-* Pixel art is widely used for clarity and simplicity in 2D systems ([gamemaker.io][1])
-* It provides a **cozy, interpretable visual layer** rather than overwhelming realism
-* Low-resolution sprites enhance readability and system understanding
-### Style Rules
-Define consistently:
-* Resolution (e.g., 32x32 or 64x64 sprites)
-* Color palette (role-based colors)
-* Outline thickness
-* Animation frame count
-* Character proportions
-A consistent style guide improves visual coherence and scalability ([Sprite-AI][2])
----
-## 🏢 Office Layout
-### Structure
-```
-┌────────────────────────────┐
-│        📈 Balance Panel     │
-│                            │
-│  🧠 Researcher     💻 Trader │
-│                            │
-│  📊 Risk Modeler   👑 PM    │
-│                            │
-│        📉 Chart Panel       │
-└────────────────────────────┘
-```
----
-### Zones
-1. **Top Panel**
-   * Portfolio balance
-   * Live PnL indicator
-2. **Agent Floor**
-   * Each agent at a fixed workstation
-   * Communication visible between agents
-3. **Bottom Panel**
-   * Market chart
-   * Trade markers
----
-## 🤖 Agent Representation
-Each agent is visualized as:
-* A small animated character (sprite)
-* A workstation (desk + monitor)
-* A role-specific color theme
----
-### Agent Roles
-#### Quant Researcher
-* Visual cues: charts, floating indicators
-* Behavior: signal generation
-#### Trader
-* Visual cues: multiple monitors
-* Behavior: executes trades
-#### Risk Modeler
-* Visual cues: warning icons
-* Behavior: restricts exposure
-#### Portfolio Manager
-* Visual cues: elevated seat / calm posture
-* Behavior: override authority
----
-## 💬 Communication System
-### Objective
-To visually demonstrate **multi-agent reasoning and coordination**, as required by the theme.
----
-### Implementation
-* Speech bubbles above agents
-* Message transitions between agents
-* Short-term visible history
----
-### Example Flow
-```
-Researcher → "RSI oversold, bullish bias"
-Risk → "Volatility high, reduce size"
-Trader → "Executing reduced position"
-PM → "Approved"
-```
----
-### Design Notes
-* Messages should be concise
-* Fade after a short duration
-* Color-coded by agent
----
-## 📈 Trading Visualization
-### Balance Panel (Top Right)
-Displays:
-* Portfolio value (live)
-* PnL change
-Animations:
-* Green pulse → profit
-* Red flash → loss
----
-### Chart Panel
-Displays:
-* Price time series
-* Trade markers:
-  * Buy → green marker
-  * Sell → red marker
----
-### Metrics Panel
-All values normalized to [0, 1]:
-* Reward
-* Grade
-* Drawdown
-* Sharpe proxy
----
-## 🧠 Learning Visualization
-### Objective
-Clearly demonstrate **agent improvement over time**
----
-### Features
-#### 1. Before vs After Toggle
-* Pre-training behavior
-* Post-training behavior
----
-#### 2. Performance Graphs
-* Reward vs episode
-* Grade vs episode
-* Drawdown trend
----
-#### 3. Feedback Animation
-* Good trade → green highlight
-* Bad trade → red highlight
----
-## ⚙️ System Modes
-### Fast Mode
-* No animations
-* No API calls
-* Used for debugging
----
-### Demo Mode
-* Full UI enabled
-* All agents active
-* Communication visible
----
-## 🔌 Backend → UI Interface
-### API Contract
-```json
-{
-  "agents": [
-    {
-      "name": "Trader",
-      "message": "Executing buy",
-      "confidence": 0.78
-    }
-  ],
-  "portfolio": {
-    "value": 102000,
-    "pnl": 2000
-  },
-  "metrics": {
-    "reward": 0.72,
-    "grade": 0.68
-  },
-  "trades": []
-}
-```
----
-## 🧠 Design Principles
-1. **Clarity over realism**
-   Visuals must explain behavior, not just look good
-2. **State visibility**
-   Every important decision should be observable
-3. **Agent identity**
-   Each agent must feel distinct
-4. **Learning visibility**
-   Improvement must be obvious without explanation
----
-## 🎯 Success Criteria
-The UI is successful if:
-* A viewer can understand agent interaction without reading code
-* Decisions and conflicts are visually clear
-* Learning progression is observable
-* The system feels alive and coordinated
----
-## 🚀 Final Note
-The UI is not just decoration — it is a **core storytelling layer**.
-It should communicate:
-> “This is a system of agents learning, collaborating, and improving under constraints.”
----
-[1]: https://gamemaker.io/en/blog/2d-game-art-styles?utm_source=chatgpt.com "The Ultimate Guide To 2D Video Game Art Styles"
-[2]: https://www.sprite-ai.art/blog/2d-pixel-art-style-guide?utm_source=chatgpt.com "2D pixel art style guide for games [with examples]"