Spaces:

ARKAISW
/

QuantHive

Sleeping

App Files Files Community

ARKAISW commited on Apr 25

Commit

a3c00eb

1 Parent(s): 30a586b

Update training notebook and verifiers

Browse files

Files changed (3) hide show

mate_training.ipynb +318 -424
training/grpo_verifiers_multiagent.py +136 -0
training/train_grpo_multiagent.py +42 -165

mate_training.ipynb CHANGED Viewed

@@ -2,468 +2,415 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "header",
    "metadata": {},
    "source": [
-    "# \ud83c\udfdb\ufe0f QuantHive \u2014 Multi-Agent GRPO Training Notebook\n",
     "\n",
-    "**Re-runnable training pipeline for the OpenEnv April '26 Hackathon.**\n",
-    "\n",
-    "This notebook is the single deliverable that proves QuantHive's training story end-to-end:\n",
-    "\n",
-    "| Section | What It Does | GPU? |\n",
-    "|:---|:---|:---|\n",
-    "| \u00a71-2 | Install deps, clone repo | \u274c |\n",
-    "| \u00a73 | Validate PettingZoo AEC env (3 agents, obs shapes, governance) | \u274c |\n",
-    "| \u00a74 | Run PettingZoo API compliance test | \u274c |\n",
-    "| \u00a75 | Multi-agent REINFORCE training (rule-based warm-up) | \u274c |\n",
-    "| \u00a76 | Generate per-agent reward/loss plots | \u274c |\n",
-    "| \u00a77 | Preview governance-aware GRPO prompt | \u274c |\n",
-    "| \u00a78 | **GRPO training** \u2014 Qwen 2.5-1.5B via Unsloth | \u2705 T4 |\n",
-    "| \u00a79 | Display all committed training evidence | \u274c |\n",
-    "\n",
-    "**Architecture**: PettingZoo AEC with 3 independent RL agents:\n",
-    "- `risk_manager_0` \u2014 obs: 24D, act: Box(3) \u2014 rewarded for restricting risk\n",
-    "- `portfolio_manager_0` \u2014 obs: 27D, act: Box(2) \u2014 rewarded for portfolio grade\n",
-    "- `trader_0` \u2014 obs: 29D, act: Dict \u2014 rewarded for PnL + compliance\n",
-    "\n",
-    "Turn order: **RM \u2192 PM \u2192 Trader** per market cycle. Each agent's output becomes part of the next agent's observation."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec1_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 1. Install Dependencies\n",
     "\n",
-    "Run on Google Colab (T4 GPU recommended for \u00a78). All other sections work on CPU."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "install_deps",
    "metadata": {},
    "outputs": [],
    "source": [
-    "%%capture\n",
-    "# Core environment\n",
-    "%pip install pettingzoo>=1.24.0 gymnasium numpy pandas matplotlib scipy\n",
-    "\n",
-    "# Data sources\n",
-    "%pip install yfinance ccxt\n",
-    "\n",
-    "# ML / Training\n",
-    "%pip install torch transformers trl peft accelerate datasets safetensors\n",
     "\n",
-    "# Server (not needed for training, but imported by some modules)\n",
-    "%pip install fastapi uvicorn python-dotenv openai aiohttp\n",
     "\n",
-    "# Unsloth for GRPO (uncomment for GPU training in \u00a78)\n",
-    "# %pip install unsloth"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec2_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 2. Clone Repository"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "clone_repo",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os, sys\n",
-    "from pathlib import Path\n",
-    "\n",
-    "REPO_URL = \"https://huggingface.co/spaces/ARKAISW/QuantHive\"\n",
-    "REPO_DIR = Path(\"/content/QuantHive\")\n",
-    "\n",
-    "if not REPO_DIR.exists():\n",
-    "    !git clone {REPO_URL} {REPO_DIR}\n",
-    "\n",
-    "%cd {REPO_DIR}\n",
-    "if str(REPO_DIR) not in sys.path:\n",
-    "    sys.path.insert(0, str(REPO_DIR))\n",
     "\n",
-    "print(f\"Working directory: {Path.cwd()}\")\n",
-    "print(f\"Python: {sys.version}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec3_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 3. Validate PettingZoo Multi-Agent Environment\n",
     "\n",
-    "The environment declared in `openenv.yaml` is `env.multi_agent_env:MultiAgentTradingEnv` \u2014 a PettingZoo AEC env with 3 agents that negotiate via inter-agent message passing."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "validate_env_setup",
    "metadata": {},
    "outputs": [],
    "source": [
     "import numpy as np\n",
     "from env.multi_agent_env import (\n",
-    "    MultiAgentTradingEnv,\n",
-    "    RISK_MANAGER,\n",
-    "    PORTFOLIO_MGR,\n",
-    "    TRADER,\n",
     "    ALL_AGENTS,\n",
     "    BASE_OBS_SIZE,\n",
-    "    RM_MSG_SIZE,\n",
     "    PM_MSG_SIZE,\n",
     ")\n",
     "\n",
     "env = MultiAgentTradingEnv(difficulty=\"easy\", max_steps=50)\n",
-    "env.reset()\n",
-    "\n",
-    "print(\"=\"*60)\n",
-    "print(\"  QuantHive \u2014 Multi-Agent Trading Environment\")\n",
-    "print(\"=\"*60)\n",
-    "print(f\"  Agents:       {env.agents}\")\n",
-    "print(f\"  Turn order:   RM \u2192 PM \u2192 Trader\")\n",
-    "print(f\"  RM  obs:      {env.observe(RISK_MANAGER).shape}  (base: {BASE_OBS_SIZE})\")\n",
-    "print(f\"  PM  obs:      {env.observe(PORTFOLIO_MGR).shape}  (base + RM msg: {BASE_OBS_SIZE}+{RM_MSG_SIZE})\")\n",
-    "print(f\"  Trader obs:   {env.observe(TRADER).shape}  (base + RM + PM: {BASE_OBS_SIZE}+{RM_MSG_SIZE}+{PM_MSG_SIZE})\")\n",
-    "print(f\"  RM action:    Box(3) \u2014 [size_limit, allow_new, force_reduce]\")\n",
-    "print(f\"  PM action:    Box(2) \u2014 [cap_alloc, override_strength]\")\n",
-    "print(f\"  Trader action: Dict \u2014 {{direction, size, sl, tp}}\")"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "validate_aec_cycle",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Run one full AEC cycle: RM \u2192 PM \u2192 Trader\n",
-    "# RM sets a tight 20% size limit, allows new positions, no force-reduce\n",
     "rm_action = np.array([0.20, 1.0, 0.0], dtype=np.float32)\n",
     "env.step(rm_action)\n",
-    "print(f\"\u2705 RM acted: size_limit=0.20, allow_new=yes, force_reduce=no\")\n",
     "\n",
-    "# PM allocates 50% capital, no override\n",
-    "pm_action = np.array([0.50, 0.0], dtype=np.float32)\n",
     "env.step(pm_action)\n",
-    "print(f\"\u2705 PM acted: cap_alloc=0.50, override=0.0\")\n",
     "\n",
-    "# Trader proposes a RECKLESS buy: size=0.85 (way over RM's 0.20 limit)\n",
     "trader_action = {\n",
     "    \"direction\": 1,\n",
-    "    \"size\": np.array([0.85], dtype=np.float32),\n",
     "    \"sl\": np.array([0.0], dtype=np.float32),\n",
     "    \"tp\": np.array([0.0], dtype=np.float32),\n",
     "}\n",
     "env.step(trader_action)\n",
-    "print(f\"\u2705 Trader acted: proposed BUY size=0.85\")\n",
-    "\n",
-    "# Inspect governance outcome\n",
-    "info = env.infos[TRADER]\n",
-    "gov = info[\"governance\"]\n",
-    "\n",
-    "print(f\"\\n{'='*60}\")\n",
-    "print(f\"  GOVERNANCE NEGOTIATION RESULT\")\n",
-    "print(f\"{'='*60}\")\n",
-    "print(f\"  Proposed:     direction={gov['proposed']['direction']}, size={gov['proposed']['size']:.2f}\")\n",
-    "print(f\"  Executed:     direction={gov['executed']['direction']}, size={gov['executed']['size']:.2f}\")\n",
-    "print(f\"  Compliant:    {gov['was_compliant']}\")\n",
-    "print(f\"  Interventions ({len(gov['interventions'])}):\") \n",
-    "for iv in gov[\"interventions\"]:\n",
-    "    print(f\"    \u2192 {iv['agent']}: {iv['type']}\")\n",
-    "\n",
-    "print(f\"\\n  Per-Agent Rewards:\")\n",
-    "for agent, r in info[\"rewards\"].items():\n",
-    "    print(f\"    {agent:25s} {r:+.4f}\")\n",
-    "\n",
-    "print(f\"\\n  Portfolio: ${info['portfolio_value']:,.2f}  |  PnL: {info['pnl_pct']:+.2%}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "sec4_header",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "## 4. PettingZoo API Compliance Test\n",
     "\n",
-    "PettingZoo ships a built-in compliance test that verifies our env follows the AEC protocol correctly."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "pz_api_test",
    "metadata": {},
    "outputs": [],
    "source": [
     "from pettingzoo.test import api_test\n",
     "\n",
-    "test_env = MultiAgentTradingEnv(difficulty=\"easy\", max_steps=50)\n",
-    "api_test(test_env, num_cycles=50, verbose_progress=True)\n",
-    "print(\"\\n\u2705 PettingZoo API compliance test PASSED (50 cycles)\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec5_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 5. Multi-Agent REINFORCE Training (Rule-Based Policies)\n",
-    "\n",
-    "This is the CPU-friendly training path. Three rule-based policies (RM, PM, Trader) are trained using alternating optimization with REINFORCE-style policy gradients.\n",
     "\n",
-    "The alternating schedule:\n",
-    "- Episodes 0-9: optimize Trader (RM/PM frozen)\n",
-    "- Episodes 10-19: optimize Risk Manager (Trader/PM frozen)\n",
-    "- Repeat"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "reinforce_training",
    "metadata": {},
    "outputs": [],
    "source": [
     "from training.train_multi_agent import train\n",
     "\n",
     "metrics = train(\n",
-    "    n_episodes=100,\n",
-    "    max_steps_ep=200,\n",
     "    gamma=0.99,\n",
     "    alternating_freq=10,\n",
     "    difficulty=\"easy\",\n",
     "    output_dir=\"outputs/multi_agent\",\n",
-    "    save_every=50,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "sec5b_header",
-   "metadata": {},
-   "source": [
-    "### 5.1 Curriculum Verification\n",
     "\n",
-    "Verify that the environment works across all difficulty levels."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "curriculum_check",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from training.train_multi_agent import collect_rollout, RuleRiskManagerPolicy, RulePortfolioManagerPolicy, RuleTraderPolicy\n",
     "\n",
     "policies = {\n",
-    "    RISK_MANAGER:  RuleRiskManagerPolicy(),\n",
     "    PORTFOLIO_MGR: RulePortfolioManagerPolicy(),\n",
-    "    TRADER:        RuleTraderPolicy(),\n",
     "}\n",
     "\n",
-    "print(f\"{'Difficulty':<12} {'Episodes':<10} {'Mean Trader Return':<22} {'Mean PnL':<15} {'Mean DD':<12}\")\n",
-    "print(\"-\" * 71)\n",
     "\n",
     "for diff in [\"easy\", \"medium\", \"hard\"]:\n",
     "    returns, pnls, dds = [], [], []\n",
     "    test_env = MultiAgentTradingEnv(difficulty=diff, max_steps=100)\n",
     "    for _ in range(10):\n",
     "        buffers, info = collect_rollout(test_env, policies, max_steps=100)\n",
-    "        trader_ret = float(np.mean(buffers[TRADER].discounted_returns()))\n",
-    "        returns.append(trader_ret)\n",
-    "        pnls.append(info.get(\"pnl_pct\", 0.0))\n",
-    "        dds.append(info.get(\"max_drawdown\", 0.0))\n",
-    "    print(f\"{diff:<12} {10:<10} {np.mean(returns):+.6f}            {np.mean(pnls):+.4%}        {np.mean(dds):.4%}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec6_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 6. Generate Per-Agent Reward & Loss Plots\n",
     "\n",
-    "Generates the required plots from training metrics and saves them to `plots/`."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "generate_plots",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import json, matplotlib\n",
     "matplotlib.use(\"Agg\")\n",
     "import matplotlib.pyplot as plt\n",
     "\n",
     "metrics_path = Path(\"outputs/multi_agent/metrics_final.json\")\n",
     "if not metrics_path.exists():\n",
-    "    # Fallback: use the metrics dict from \u00a75 directly\n",
-    "    with open(metrics_path, \"w\") as f:\n",
-    "        json.dump(dict(metrics), f, indent=2)\n",
     "\n",
-    "with open(metrics_path) as f:\n",
-    "    m = json.load(f)\n",
     "\n",
     "episodes = m[\"episode\"]\n",
     "n_eps = len(episodes)\n",
     "print(f\"Loaded {n_eps} episodes from {metrics_path}\")\n",
     "\n",
-    "# \u2500\u2500 Per-Agent Reward Curves \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n",
-    "fig, ax = plt.subplots(figsize=(12, 6))\n",
     "\n",
-    "def smooth(vals, w=10):\n",
-    "    if len(vals) < w: return vals\n",
-    "    return np.convolve(vals, np.ones(w)/w, mode=\"valid\").tolist()\n",
-    "\n",
-    "w = max(1, n_eps // 15)\n",
-    "trader_s = smooth(m[\"trader_return\"], w)\n",
-    "rm_s     = smooth(m[\"rm_return\"],     w)\n",
-    "pm_s     = smooth(m[\"pm_return\"],     w)\n",
-    "ep_s     = episodes[:len(trader_s)]\n",
-    "\n",
-    "ax.plot(ep_s, trader_s, label=\"Trader\",            color=\"#2ecc71\", linewidth=2)\n",
-    "ax.plot(ep_s, rm_s,     label=\"Risk Manager\",      color=\"#e74c3c\", linewidth=2)\n",
-    "ax.plot(ep_s, pm_s,     label=\"Portfolio Manager\",  color=\"#3498db\", linewidth=2)\n",
-    "ax.set_xlabel(\"Episode\"); ax.set_ylabel(\"Discounted Return\")\n",
-    "ax.set_title(\"QuantHive: Per-Agent Reward Curves (Multi-Agent Training)\")\n",
-    "ax.legend(); ax.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
-    "fig.savefig(\"plots/reward_curve.png\", dpi=150)\n",
     "plt.show()\n",
-    "print(\"Saved: plots/reward_curve.png\")\n",
     "\n",
-    "# \u2500\u2500 Loss Curve (PnL convergence) \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n",
     "fig2, ax2 = plt.subplots(figsize=(12, 6))\n",
-    "pnl_s = smooth(m[\"pnl_pct\"], w)\n",
-    "ax2.plot(episodes[:len(pnl_s)], pnl_s, color=\"#e74c3c\", linewidth=2)\n",
-    "ax2.axhline(y=0, color=\"gray\", linestyle=\"--\", alpha=0.5)\n",
     "pnl_arr = np.array(pnl_s)\n",
-    "ax2.fill_between(episodes[:len(pnl_s)], 0, pnl_s,\n",
-    "                  where=pnl_arr>0, color=\"#2ecc71\", alpha=0.2)\n",
-    "ax2.fill_between(episodes[:len(pnl_s)], 0, pnl_s,\n",
-    "                  where=pnl_arr<=0, color=\"#e74c3c\", alpha=0.2)\n",
-    "ax2.set_xlabel(\"Episode\"); ax2.set_ylabel(\"PnL %\")\n",
-    "ax2.set_title(\"QuantHive: PnL Over Training (Policy Convergence)\")\n",
     "ax2.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
-    "fig2.savefig(\"plots/loss_curve.png\", dpi=150)\n",
     "plt.show()\n",
-    "print(\"Saved: plots/loss_curve.png\")\n",
     "\n",
-    "# \u2500\u2500 Baseline vs Trained \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n",
     "if n_eps >= 20:\n",
     "    fig3, ax3 = plt.subplots(figsize=(10, 6))\n",
     "    names = [\"Trader Return\", \"Grade\", \"Max Drawdown\", \"Sharpe\"]\n",
-    "    early = [np.mean(m[k][:20]) for k in [\"trader_return\", \"grade\", \"max_drawdown\", \"sharpe\"]]\n",
-    "    late  = [np.mean(m[k][-20:]) for k in [\"trader_return\", \"grade\", \"max_drawdown\", \"sharpe\"]]\n",
     "    x = np.arange(len(names))\n",
     "    ax3.bar(x - 0.175, early, 0.35, label=\"First 20 eps\", color=\"#e74c3c\", alpha=0.8)\n",
-    "    ax3.bar(x + 0.175, late,  0.35, label=\"Last 20 eps\",  color=\"#2ecc71\", alpha=0.8)\n",
-    "    ax3.set_ylabel(\"Value\"); ax3.set_title(\"QuantHive: Early vs Late Training\")\n",
-    "    ax3.set_xticks(x); ax3.set_xticklabels(names)\n",
-    "    ax3.legend(); ax3.grid(True, alpha=0.3, axis=\"y\")\n",
     "    plt.tight_layout()\n",
-    "    fig3.savefig(\"plots/baseline_comparison.png\", dpi=150)\n",
     "    plt.show()\n",
-    "    print(\"Saved: plots/baseline_comparison.png\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec7_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 7. Preview Governance-Aware GRPO Prompt\n",
     "\n",
-    "The GRPO pipeline generates scenarios directly from `MultiAgentTradingEnv`.\n",
-    "Each prompt includes the RM and PM governance messages \u2014 the Trader must learn to read and respect them."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "preview_prompt",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from training.train_grpo_multiagent import generate_pz_scenarios, build_prompt_multiagent\n",
     "\n",
     "scenarios = generate_pz_scenarios(n=3, difficulty=\"easy\", max_env_steps=30)\n",
-    "print(f\"Generated {len(scenarios)} scenarios from PZ env\\n\")\n",
-    "\n",
-    "for i, sc in enumerate(scenarios):\n",
-    "    print(f\"{'='*60}\")\n",
-    "    print(f\"  Scenario {i+1}\")\n",
-    "    print(f\"  RM: size_limit={sc['rm_size_limit']:.2f}, allow_new={sc['rm_allow_new']}, force_reduce={sc['rm_force_reduce']}\")\n",
-    "    print(f\"  PM: cap_alloc={sc['pm_cap_alloc']:.2f}, override={sc['pm_override']:.2f}\")\n",
-    "    print(f\"{'='*60}\")\n",
-    "\n",
-    "# Show full prompt for one scenario\n",
-    "print(\"\\n\" + \"=\"*60)\n",
-    "print(\"  FULL PROMPT (Scenario 1)\")\n",
-    "print(\"=\"*60)\n",
-    "print(build_prompt_multiagent(scenarios[0]))"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "verify_verifiers",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Verify the updated GRPO verifiers can parse governance constraints\n",
-    "from training.train_grpo_multiagent import (\n",
-    "    risk_reward_func_multiagent,\n",
     "    governance_reward_func_multiagent,\n",
     ")\n",
-    "from env.reward import format_reward_func, alignment_reward_func, profit_reward_func\n",
     "\n",
     "test_prompt = build_prompt_multiagent(scenarios[0])\n",
     "\n",
-    "# A compliant completion\n",
     "compliant = (\n",
-    "    '<thought>\\n'\n",
-    "    'RSI is 0.28, indicating oversold conditions. EMA20 crossing above EMA50 suggests bullish momentum. '\n",
-    "    'However, the Risk Manager restricts allocation to {limit:.2f} given current market conditions. '\n",
-    "    'The Portfolio Manager allocated {cap:.0%} capital. I will propose a conservative position '\n",
-    "    'within the governance constraints to avoid intervention.\\n'\n",
-    "    '</thought>\\n'\n",
-    "    '<action>\\n'\n",
-    "    '{{\"direction\": 1, \"size\": {size:.2f}, \"sl\": 49000, \"tp\": 52000}}\\n'\n",
-    "    '</action>'\n",
-    ").format(\n",
-    "    limit=scenarios[0][\"rm_size_limit\"],\n",
-    "    cap=scenarios[0][\"pm_cap_alloc\"],\n",
-    "    size=min(scenarios[0][\"rm_size_limit\"] * 0.7, scenarios[0][\"pm_cap_alloc\"] * 0.7),\n",
-    ")\n",
     "\n",
-    "# A reckless completion\n",
     "reckless = (\n",
-    "    '<thought>\\nMarket is moving up. Going all in.\\n</thought>\\n'\n",
-    "    '<action>\\n{\"direction\": 1, \"size\": 0.95, \"sl\": 0, \"tp\": 0}\\n</action>'\n",
     ")\n",
     "\n",
     "prompts = [test_prompt, test_prompt]\n",
@@ -472,241 +419,188 @@
     "print(f\"{'Verifier':<25} {'Compliant':<12} {'Reckless':<12}\")\n",
     "print(\"-\" * 49)\n",
     "for name, func in [\n",
-    "    (\"Format\",            format_reward_func),\n",
-    "    (\"Alignment\",         alignment_reward_func),\n",
-    "    (\"Risk (dynamic RM)\", risk_reward_func_multiagent),\n",
-    "    (\"Profit\",            profit_reward_func),\n",
-    "    (\"Governance (RM+PM)\", governance_reward_func_multiagent),\n",
     "]:\n",
     "    scores = func(prompts, completions)\n",
-    "    print(f\"{name:<25} {scores[0]:<12.2f} {scores[1]:<12.2f}\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec8_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 8. GRPO Training \u2014 Qwen 2.5-1.5B (GPU Required)\n",
-    "\n",
-    "This section trains the Trader agent as a language model using **GRPO** (Group Relative Policy Optimization) via Unsloth + TRL.\n",
-    "\n",
-    "**Requirements**: CUDA GPU (Colab T4 is sufficient).\n",
     "\n",
-    "The 5 verifiers:\n",
-    "1. **Format** \u2014 valid `<thought>` + `<action>` XML tags\n",
-    "2. **Alignment** \u2014 reasoning matches market signals (anti-hallucination)\n",
-    "3. **Risk** \u2014 size \u2264 RM's dynamic `size_limit` (reads from governance in prompt)\n",
-    "4. **Profit** \u2014 direction matches price trend\n",
-    "5. **Governance** \u2014 would this action pass without intervention? Checks compliance against *both* RM and PM constraints"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "grpo_training",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# \u26a1 Uncomment the entire cell to run GRPO training on a T4 GPU.\n",
-    "# This takes ~20-30 minutes for 100 steps on T4.\n",
-    "\n",
-    "import torch\n",
-    "assert torch.cuda.is_available(), \"GPU required for GRPO training\"\n",
-    "print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
-    "\n",
-    "# # Install Unsloth if not already\n",
-    "# # %pip install unsloth\n",
-    "\n",
-    "#\n",
-    "# --- Phase 1: Generate scenarios from the PZ env ---\n",
-    "from training.train_grpo_multiagent import (\n",
-    "    generate_pz_scenarios,\n",
-    "    build_prompt_multiagent,\n",
-    "    load_model,\n",
-    "    make_trainer,\n",
-    "    save_model,\n",
-    "    risk_reward_func_multiagent,\n",
-    "    governance_reward_func_multiagent,\n",
-    ")\n",
-    "from datasets import Dataset\n",
-    "import json\n",
-    "#\n",
-    "print(\"Generating 256 scenarios from MultiAgentTradingEnv...\")\n",
-    "scenarios = generate_pz_scenarios(n=256, difficulty=\"easy\", max_env_steps=50)\n",
-    "prompts = [{\"prompt\": build_prompt_multiagent(sc)} for sc in scenarios]\n",
-    "dataset = Dataset.from_list(prompts)\n",
-    "print(f\"Dataset: {len(dataset)} prompts\")\n",
-    "#\n",
-    "# --- Phase 2: Load Qwen 2.5-1.5B with LoRA ---\n",
-    "model, tokenizer = load_model(\n",
-    "    \"unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit\",\n",
-    "    max_seq_length=1024,\n",
-    ")\n",
-    "#\n",
-    "# --- Phase 3: Train with 5 verifiers ---\n",
-    "from types import SimpleNamespace\n",
-    "args = SimpleNamespace(\n",
-    "    output_dir=\"models/local_policy_grpo_multiagent\",\n",
-    "    learning_rate=5e-5,\n",
-    "    per_device_batch_size=2,\n",
-    "    gradient_accumulation_steps=2,\n",
-    "    max_steps=100,\n",
-    "    save_steps=25,\n",
-    "    logging_steps=1,\n",
-    "    max_prompt_length=768,\n",
-    "    max_completion_length=200,\n",
-    "    num_generations=4,\n",
-    ")\n",
-    "#\n",
-    "trainer = make_trainer(model, tokenizer, dataset, args, torch)\n",
-    "print(f\"Starting GRPO training ({args.max_steps} steps)...\")\n",
-    "train_result = trainer.train()\n",
-    "#\n",
-    "# --- Phase 4: Extract metrics and generate plots ---\n",
-    "history = trainer.state.log_history\n",
-    "rewards = [x[\"reward\"] for x in history if \"reward\" in x]\n",
-    "# losses  = [x[\"loss\"]   for x in history if \"loss\"   in x]\n",
-    "#\n",
-    "from utils.plotting import plot_training_results\n",
-    "plot_training_results(rewards, losses, output_dir=\"plots\")\n",
-    "#\n",
-    "# --- Phase 5: Save model ---\n",
-    "save_model(model, tokenizer, args.output_dir)\n",
-    "print(f\"Model saved to {args.output_dir}\")\n",
-    "#\n",
-    "# # Save metrics for later analysis\n",
-    "with open(\"outputs/grpo_metrics.json\", \"w\") as f:\n",
-    "    json.dump({\"rewards\": rewards, \"losses\": losses}, f, indent=2)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec9_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 9. Display All Training Evidence\n",
-    "\n",
-    "The repository ships committed training plots. This section displays them (whether generated in this notebook run or previously committed)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "display_plots",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from IPython.display import Image, display, Markdown\n",
     "\n",
     "plot_files = [\n",
-    "    (\"plots/reward_curve.png\",        \"Per-Agent Reward Curves (RM, PM, Trader)\"),\n",
-    "    (\"plots/loss_curve.png\",          \"PnL / Policy Loss Convergence\"),\n",
-    "    (\"plots/baseline_comparison.png\", \"Early vs Late Training Performance\"),\n",
-    "    (\"plots/grade_progression.png\",   \"Portfolio Grade + Sharpe Over Training\"),\n",
     "]\n",
     "\n",
     "for path, title in plot_files:\n",
     "    if Path(path).exists():\n",
     "        display(Markdown(f\"### {title}\"))\n",
-    "        display(Image(filename=path, width=800))\n",
     "    else:\n",
-    "        print(f\"\u26a0  {path} not found \u2014 run training (\u00a75/\u00a78) to generate\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "sec10_header",
    "metadata": {},
    "source": [
-    "---\n",
-    "## 10. Submission Checklist"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "checklist",
    "metadata": {},
    "outputs": [],
    "source": [
     "checks = [\n",
-    "    (\"openenv.yaml (PettingZoo entry_point)\",  Path(\"openenv.yaml\")),\n",
-    "    (\"README.md\",                               Path(\"README.md\")),\n",
-    "    (\"WRITEUP.md\",                              Path(\"WRITEUP.md\")),\n",
-    "    (\"Multi-agent env\",                         Path(\"env/multi_agent_env.py\")),\n",
-    "    (\"REINFORCE trainer\",                       Path(\"training/train_multi_agent.py\")),\n",
-    "    (\"GRPO trainer (multi-agent)\",              Path(\"training/train_grpo_multiagent.py\")),\n",
-    "    (\"Plot generator\",                          Path(\"training/plot_multiagent.py\")),\n",
-    "    (\"Training notebook\",                       Path(\"mate_training.ipynb\")),\n",
-    "    (\"Reward curve\",                            Path(\"plots/reward_curve.png\")),\n",
-    "    (\"Loss curve\",                              Path(\"plots/loss_curve.png\")),\n",
-    "    (\"Baseline comparison\",                     Path(\"plots/baseline_comparison.png\")),\n",
-    "    (\"Dockerfile\",                              Path(\"Dockerfile\")),\n",
-    "    (\"requirements-space.txt\",                  Path(\"requirements-space.txt\")),\n",
     "]\n",
     "\n",
-    "print(f\"{'Deliverable':<40} {'Status':<10}\")\n",
-    "print(\"-\" * 50)\n",
     "for name, path in checks:\n",
-    "    status = \"\u2705 OK\" if path.exists() else \"\u274c MISSING\"\n",
-    "    print(f\"{name:<40} {status}\")\n",
-    "\n",
-    "# Verify openenv.yaml points to PettingZoo\n",
-    "import yaml\n",
-    "try:\n",
-    "    with open(\"openenv.yaml\") as f:\n",
-    "        manifest = yaml.safe_load(f)\n",
-    "    ep = manifest.get(\"environment\", {}).get(\"entry_point\", \"\")\n",
-    "    is_pz = \"multi_agent_env\" in ep\n",
-    "    print(f\"\\nopenenv.yaml entry_point: {ep}\")\n",
-    "    print(f\"{'\u2705' if is_pz else '\u274c'} Points to {'PettingZoo' if is_pz else 'WRONG'} env\")\n",
-    "except Exception:\n",
-    "    print(\"\\n\u26a0  Could not parse openenv.yaml (pyyaml may not be installed)\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "footer",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "\n",
-    "## Summary\n",
-    "\n",
-    "This notebook establishes the full QuantHive training evidence chain:\n",
-    "\n",
-    "1. **Environment validated** \u2014 `MultiAgentTradingEnv` passes PettingZoo's official API test (50 cycles)\n",
-    "2. **Governance works** \u2014 RM clamped a reckless 85% trade down to 20%, logged 3 interventions\n",
-    "3. **Multi-agent training** \u2014 alternating optimization with per-agent reward curves\n",
-    "4. **GRPO ready** \u2014 prompts include RM/PM constraints; verifiers check dynamic compliance\n",
-    "5. **Curriculum works** \u2014 env runs at easy/medium/hard with decreasing Trader returns\n",
-    "\n",
-    "**Key differentiator**: GRPO verifiers #3 (Risk) and #5 (Governance) check compliance against the RM's *learned, dynamic* `size_limit` \u2014 not a hardcoded constant. This means the Trader must learn to **read and respect** governance messages.\n",
-    "\n",
-    "---\n",
-    "*Built for the OpenEnv April '26 Hackathon | Theme 1: Multi-Agent Interactions*  \n",
-    "*Author: Arka Sarkar*"
    ]
   }
  ],
  "metadata": {
-  "accelerator": "GPU",
-  "colab": {
-   "gpuType": "T4",
-   "provenance": []
-  },
   "kernelspec": {
    "display_name": "Python 3",
    "name": "python3"
   },
   "language_info": {
-   "name": "python",
-   "version": "3.12.0"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}

  "cells": [
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# QuantHive Multi-Agent Training Notebook\n",
     "\n",
+    "Rerunnable notebook for validating the PettingZoo environment, running the rule-based multi-agent trainer, previewing governance-aware prompts, and optionally launching GRPO training on a GPU runtime.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 1. Bootstrap The Repo\n",
     "\n",
+    "Clone from GitHub when the notebook is running outside the repo. If the current working directory already looks like the repository, reuse it.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "import os\n",
+    "import subprocess\n",
+    "import sys\n",
+    "from pathlib import Path\n",
     "\n",
+    "REPO_URL = \"https://github.com/ARKAISW/multi-agent-trading-env.git\"\n",
+    "\n",
+    "def looks_like_repo(path: Path) -> bool:\n",
+    "    return (\n",
+    "        (path / \"openenv.yaml\").exists()\n",
+    "        and (path / \"env\").exists()\n",
+    "        and (path / \"training\").exists()\n",
+    "    )\n",
+    "\n",
+    "start_dir = Path.cwd()\n",
+    "if looks_like_repo(start_dir):\n",
+    "    REPO_DIR = start_dir\n",
+    "else:\n",
+    "    clone_parent = Path(\"/content\") if Path(\"/content\").exists() else start_dir\n",
+    "    REPO_DIR = clone_parent / \"multi-agent-trading-env\"\n",
+    "    if not REPO_DIR.exists():\n",
+    "        subprocess.check_call([\"git\", \"clone\", REPO_URL, str(REPO_DIR)])\n",
+    "\n",
+    "os.chdir(REPO_DIR)\n",
+    "if str(REPO_DIR) not in sys.path:\n",
+    "    sys.path.insert(0, str(REPO_DIR))\n",
     "\n",
+    "commit = subprocess.check_output([\"git\", \"rev-parse\", \"--short\", \"HEAD\"], text=True).strip()\n",
+    "print(f\"Working directory: {Path.cwd()}\")\n",
+    "print(f\"Repo commit: {commit}\")\n",
+    "print(f\"Python: {sys.version.split()[0]}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 2. Install Notebook Dependencies\n",
+    "\n",
+    "Install the lightweight stack needed for environment validation, rule-based training, plots, and prompt preview. The heavier GRPO packages stay in the optional GPU section.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "BASE_PACKAGES = [\n",
+    "    \"openenv\",\n",
+    "    \"pyyaml\",\n",
+    "    \"pettingzoo>=1.24.0\",\n",
+    "    \"gymnasium\",\n",
+    "    \"numpy\",\n",
+    "    \"pandas\",\n",
+    "    \"matplotlib\",\n",
+    "    \"scipy\",\n",
+    "    \"torch\",\n",
+    "    \"yfinance\",\n",
+    "    \"ccxt\",\n",
+    "]\n",
     "\n",
+    "subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *BASE_PACKAGES])\n",
+    "print(\"Installed base notebook dependencies.\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 3. Validate The PettingZoo Multi-Agent Environment\n",
     "\n",
+    "Probe the constructor first so the notebook does not assume the wrong signature, then instantiate and inspect observation and action shapes.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "import inspect\n",
     "import numpy as np\n",
+    "\n",
     "from env.multi_agent_env import (\n",
     "    ALL_AGENTS,\n",
     "    BASE_OBS_SIZE,\n",
+    "    MultiAgentTradingEnv,\n",
     "    PM_MSG_SIZE,\n",
+    "    PORTFOLIO_MGR,\n",
+    "    RISK_MANAGER,\n",
+    "    RM_MSG_SIZE,\n",
+    "    TRADER,\n",
     ")\n",
     "\n",
+    "print(\"MultiAgentTradingEnv signature:\")\n",
+    "print(inspect.signature(MultiAgentTradingEnv))\n",
+    "\n",
     "env = MultiAgentTradingEnv(difficulty=\"easy\", max_steps=50)\n",
+    "reset_result = env.reset(seed=7)\n",
+    "\n",
+    "print(\"\\nEnvironment summary\")\n",
+    "print(\"-\" * 60)\n",
+    "print(f\"reset() returned: {reset_result}\")\n",
+    "print(f\"Agents: {env.agents}\")\n",
+    "print(f\"Turn order: RM -> PM -> Trader\")\n",
+    "print(f\"RM obs: {env.observe(RISK_MANAGER).shape} (base={BASE_OBS_SIZE})\")\n",
+    "print(f\"PM obs: {env.observe(PORTFOLIO_MGR).shape} (base+rm={BASE_OBS_SIZE + RM_MSG_SIZE})\")\n",
+    "print(f\"Trader obs: {env.observe(TRADER).shape} (base+rm+pm={BASE_OBS_SIZE + RM_MSG_SIZE + PM_MSG_SIZE})\")\n",
+    "print(f\"RM action space: {env.action_space(RISK_MANAGER)}\")\n",
+    "print(f\"PM action space: {env.action_space(PORTFOLIO_MGR)}\")\n",
+    "print(f\"Trader action space: {env.action_space(TRADER)}\")\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "env = MultiAgentTradingEnv(difficulty=\"easy\", max_steps=10)\n",
+    "env.reset(seed=7)\n",
+    "\n",
     "rm_action = np.array([0.20, 1.0, 0.0], dtype=np.float32)\n",
     "env.step(rm_action)\n",
     "\n",
+    "pm_action = np.array([0.35, 0.0], dtype=np.float32)\n",
     "env.step(pm_action)\n",
     "\n",
     "trader_action = {\n",
     "    \"direction\": 1,\n",
+    "    \"size\": np.array([0.10], dtype=np.float32),\n",
     "    \"sl\": np.array([0.0], dtype=np.float32),\n",
     "    \"tp\": np.array([0.0], dtype=np.float32),\n",
     "}\n",
     "env.step(trader_action)\n",
     "\n",
+    "print(\"Completed one AEC cycle.\")\n",
+    "print(f\"Current step: {env._current_step}\")\n",
+    "print(f\"Current agent selection: {env.agent_selection}\")\n",
+    "print(f\"Latest rewards: {env.rewards}\")\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "from pettingzoo.test import api_test\n",
     "\n",
+    "api_env = MultiAgentTradingEnv(difficulty=\"easy\", max_steps=20)\n",
+    "api_test(api_env, num_cycles=20, verbose_progress=True)\n",
+    "print(\"PettingZoo API test passed.\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 4. Run The Rule-Based Multi-Agent Trainer\n",
     "\n",
+    "This section exercises the multi-agent training loop without needing the GRPO stack.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "from training.train_multi_agent import train\n",
     "\n",
     "metrics = train(\n",
+    "    n_episodes=40,\n",
+    "    max_steps_ep=150,\n",
     "    gamma=0.99,\n",
     "    alternating_freq=10,\n",
     "    difficulty=\"easy\",\n",
     "    output_dir=\"outputs/multi_agent\",\n",
+    "    save_every=20,\n",
+    ")\n",
     "\n",
+    "print(f\"Saved training outputs to: {Path('outputs/multi_agent').resolve()}\")\n",
+    "if metrics.get(\"trader_return\"):\n",
+    "    print(f\"Final trader return: {metrics['trader_return'][-1]:+.4f}\")\n",
+    "if metrics.get(\"grade\"):\n",
+    "    print(f\"Final grade: {metrics['grade'][-1]:.4f}\")\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "from training.train_multi_agent import (\n",
+    "    RulePortfolioManagerPolicy,\n",
+    "    RuleRiskManagerPolicy,\n",
+    "    RuleTraderPolicy,\n",
+    "    collect_rollout,\n",
+    ")\n",
     "\n",
     "policies = {\n",
+    "    RISK_MANAGER: RuleRiskManagerPolicy(),\n",
     "    PORTFOLIO_MGR: RulePortfolioManagerPolicy(),\n",
+    "    TRADER: RuleTraderPolicy(),\n",
     "}\n",
     "\n",
+    "print(f\"{'Difficulty':<12} {'Episodes':<10} {'Mean Trader Return':<22} {'Mean PnL':<12} {'Mean DD':<12}\")\n",
+    "print(\"-\" * 74)\n",
     "\n",
     "for diff in [\"easy\", \"medium\", \"hard\"]:\n",
     "    returns, pnls, dds = [], [], []\n",
     "    test_env = MultiAgentTradingEnv(difficulty=diff, max_steps=100)\n",
     "    for _ in range(10):\n",
     "        buffers, info = collect_rollout(test_env, policies, max_steps=100)\n",
+    "        returns.append(float(np.mean(buffers[TRADER].discounted_returns())))\n",
+    "        pnls.append(float(info.get(\"pnl_pct\", 0.0)))\n",
+    "        dds.append(float(info.get(\"max_drawdown\", 0.0)))\n",
+    "    print(f\"{diff:<12} {10:<10} {np.mean(returns):+.6f}            {np.mean(pnls):+.4%}     {np.mean(dds):.4%}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 5. Generate Training Plots\n",
     "\n",
+    "Use the saved metrics when available, or fall back to the in-memory `metrics` object from the training cell.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "import json\n",
+    "\n",
+    "import matplotlib\n",
     "matplotlib.use(\"Agg\")\n",
     "import matplotlib.pyplot as plt\n",
     "\n",
     "metrics_path = Path(\"outputs/multi_agent/metrics_final.json\")\n",
+    "metrics_path.parent.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
     "if not metrics_path.exists():\n",
+    "    if \"metrics\" not in globals():\n",
+    "        raise RuntimeError(\"Run the training cell first so metrics are available.\")\n",
+    "    with open(metrics_path, \"w\", encoding=\"utf-8\") as handle:\n",
+    "        json.dump(dict(metrics), handle, indent=2)\n",
     "\n",
+    "with open(metrics_path, encoding=\"utf-8\") as handle:\n",
+    "    m = json.load(handle)\n",
     "\n",
+    "plots_dir = Path(\"plots\")\n",
+    "plots_dir.mkdir(parents=True, exist_ok=True)\n",
     "episodes = m[\"episode\"]\n",
     "n_eps = len(episodes)\n",
     "print(f\"Loaded {n_eps} episodes from {metrics_path}\")\n",
     "\n",
+    "def smooth(values, window=10):\n",
+    "    if len(values) < window:\n",
+    "        return values\n",
+    "    kernel = np.ones(window) / window\n",
+    "    return np.convolve(values, kernel, mode=\"valid\").tolist()\n",
     "\n",
+    "window = max(1, n_eps // 15)\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(12, 6))\n",
+    "trader_s = smooth(m[\"trader_return\"], window)\n",
+    "rm_s = smooth(m[\"rm_return\"], window)\n",
+    "pm_s = smooth(m[\"pm_return\"], window)\n",
+    "ep_s = episodes[: len(trader_s)]\n",
+    "ax.plot(ep_s, trader_s, label=\"Trader\", color=\"#2ecc71\", linewidth=2)\n",
+    "ax.plot(ep_s, rm_s, label=\"Risk Manager\", color=\"#e74c3c\", linewidth=2)\n",
+    "ax.plot(ep_s, pm_s, label=\"Portfolio Manager\", color=\"#3498db\", linewidth=2)\n",
+    "ax.set_xlabel(\"Episode\")\n",
+    "ax.set_ylabel(\"Discounted Return\")\n",
+    "ax.set_title(\"QuantHive: Per-Agent Reward Curves\")\n",
+    "ax.legend()\n",
+    "ax.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
+    "fig.savefig(plots_dir / \"reward_curve.png\", dpi=150)\n",
     "plt.show()\n",
     "\n",
     "fig2, ax2 = plt.subplots(figsize=(12, 6))\n",
+    "pnl_s = smooth(m[\"pnl_pct\"], window)\n",
     "pnl_arr = np.array(pnl_s)\n",
+    "ax2.plot(episodes[: len(pnl_s)], pnl_s, color=\"#e74c3c\", linewidth=2)\n",
+    "ax2.axhline(y=0, color=\"gray\", linestyle=\"--\", alpha=0.5)\n",
+    "ax2.fill_between(episodes[: len(pnl_s)], 0, pnl_s, where=pnl_arr > 0, color=\"#2ecc71\", alpha=0.2)\n",
+    "ax2.fill_between(episodes[: len(pnl_s)], 0, pnl_s, where=pnl_arr <= 0, color=\"#e74c3c\", alpha=0.2)\n",
+    "ax2.set_xlabel(\"Episode\")\n",
+    "ax2.set_ylabel(\"PnL %\")\n",
+    "ax2.set_title(\"QuantHive: PnL Over Training\")\n",
     "ax2.grid(True, alpha=0.3)\n",
     "plt.tight_layout()\n",
+    "fig2.savefig(plots_dir / \"loss_curve.png\", dpi=150)\n",
     "plt.show()\n",
     "\n",
     "if n_eps >= 20:\n",
     "    fig3, ax3 = plt.subplots(figsize=(10, 6))\n",
     "    names = [\"Trader Return\", \"Grade\", \"Max Drawdown\", \"Sharpe\"]\n",
+    "    early = [np.mean(m[key][:20]) for key in [\"trader_return\", \"grade\", \"max_drawdown\", \"sharpe\"]]\n",
+    "    late = [np.mean(m[key][-20:]) for key in [\"trader_return\", \"grade\", \"max_drawdown\", \"sharpe\"]]\n",
     "    x = np.arange(len(names))\n",
     "    ax3.bar(x - 0.175, early, 0.35, label=\"First 20 eps\", color=\"#e74c3c\", alpha=0.8)\n",
+    "    ax3.bar(x + 0.175, late, 0.35, label=\"Last 20 eps\", color=\"#2ecc71\", alpha=0.8)\n",
+    "    ax3.set_ylabel(\"Value\")\n",
+    "    ax3.set_title(\"QuantHive: Early vs Late Training\")\n",
+    "    ax3.set_xticks(x)\n",
+    "    ax3.set_xticklabels(names)\n",
+    "    ax3.legend()\n",
+    "    ax3.grid(True, alpha=0.3, axis=\"y\")\n",
     "    plt.tight_layout()\n",
+    "    fig3.savefig(plots_dir / \"baseline_comparison.png\", dpi=150)\n",
     "    plt.show()\n",
+    "\n",
+    "print(f\"Saved plots to: {plots_dir.resolve()}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 6. Preview Governance-Aware GRPO Prompts\n",
     "\n",
+    "Import only the lightweight prompt helpers so prompt preview does not depend on the trainer stack.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "from training.prompt_utils import build_prompt_multiagent, generate_pz_scenarios\n",
     "\n",
     "scenarios = generate_pz_scenarios(n=3, difficulty=\"easy\", max_env_steps=30)\n",
+    "print(f\"Generated {len(scenarios)} scenarios.\\n\")\n",
+    "\n",
+    "for i, scenario in enumerate(scenarios, start=1):\n",
+    "    print(\"=\" * 60)\n",
+    "    print(f\"Scenario {i}\")\n",
+    "    print(\n",
+    "        f\"RM: size_limit={scenario['rm_size_limit']:.2f}, \"\n",
+    "        f\"allow_new={scenario['rm_allow_new']}, force_reduce={scenario['rm_force_reduce']}\"\n",
+    "    )\n",
+    "    print(\n",
+    "        f\"PM: cap_alloc={scenario['pm_cap_alloc']:.2f}, \"\n",
+    "        f\"override={scenario['pm_override']:.2f}\"\n",
+    "    )\n",
+    "\n",
+    "print(\"\\nFull prompt for scenario 1\")\n",
+    "print(\"=\" * 60)\n",
+    "print(build_prompt_multiagent(scenarios[0]))\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "from env.reward import alignment_reward_func, format_reward_func, profit_reward_func\n",
+    "from training.grpo_verifiers_multiagent import (\n",
     "    governance_reward_func_multiagent,\n",
+    "    risk_reward_func_multiagent,\n",
     ")\n",
     "\n",
     "test_prompt = build_prompt_multiagent(scenarios[0])\n",
+    "effective_limit = min(scenarios[0][\"rm_size_limit\"], scenarios[0][\"pm_cap_alloc\"])\n",
     "\n",
     "compliant = (\n",
+    "    \"<thought>\\n\"\n",
+    "    \"Risk limits are active, so I will stay inside both the RM size cap and the PM allocation.\\n\"\n",
+    "    \"</thought>\\n\"\n",
+    "    \"<action>\\n\"\n",
+    "    '{{\"direction\": 1, \"size\": %.2f, \"sl\": 49000, \"tp\": 52000}}\\n'\n",
+    "    \"</action>\"\n",
+    ") % (effective_limit * 0.7)\n",
     "\n",
     "reckless = (\n",
+    "    \"<thought>\\nMarket is moving up. Going all in.\\n</thought>\\n\"\n",
+    "    \"<action>\\n{\\\"direction\\\": 1, \\\"size\\\": 0.95, \\\"sl\\\": 0, \\\"tp\\\": 0}\\n</action>\"\n",
     ")\n",
     "\n",
     "prompts = [test_prompt, test_prompt]\n",
     "print(f\"{'Verifier':<25} {'Compliant':<12} {'Reckless':<12}\")\n",
     "print(\"-\" * 49)\n",
     "for name, func in [\n",
+    "    (\"Format\", format_reward_func),\n",
+    "    (\"Alignment\", alignment_reward_func),\n",
+    "    (\"Risk\", risk_reward_func_multiagent),\n",
+    "    (\"Profit\", profit_reward_func),\n",
+    "    (\"Governance\", governance_reward_func_multiagent),\n",
     "]:\n",
     "    scores = func(prompts, completions)\n",
+    "    print(f\"{name:<25} {scores[0]:<12.2f} {scores[1]:<12.2f}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 7. Optional GPU GRPO Training\n",
     "\n",
+    "Leave `RUN_GRPO` as `False` unless the runtime has a GPU and you want to install the heavier TRL and Unsloth stack.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "RUN_GRPO = False\n",
+    "\n",
+    "if not RUN_GRPO:\n",
+    "    print(\"Set RUN_GRPO = True after enabling a GPU runtime.\")\n",
+    "else:\n",
+    "    import json\n",
+    "    import torch\n",
+    "    from types import SimpleNamespace\n",
+    "\n",
+    "    assert torch.cuda.is_available(), \"GPU required for GRPO training\"\n",
+    "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "\n",
+    "    extra_packages = [\n",
+    "        \"datasets\",\n",
+    "        \"transformers\",\n",
+    "        \"trl\",\n",
+    "        \"peft\",\n",
+    "        \"accelerate\",\n",
+    "        \"safetensors\",\n",
+    "        \"unsloth\",\n",
+    "    ]\n",
+    "    subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *extra_packages])\n",
+    "\n",
+    "    from datasets import Dataset\n",
+    "    from training.train_grpo_multiagent import load_model, make_trainer, save_model\n",
+    "\n",
+    "    print(\"Generating 256 scenarios from MultiAgentTradingEnv...\")\n",
+    "    scenarios = generate_pz_scenarios(n=256, difficulty=\"easy\", max_env_steps=50)\n",
+    "    prompts = [{\"prompt\": build_prompt_multiagent(sc)} for sc in scenarios]\n",
+    "    dataset = Dataset.from_list(prompts)\n",
+    "    print(f\"Dataset: {len(dataset)} prompts\")\n",
+    "\n",
+    "    model, tokenizer = load_model(\n",
+    "        \"unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit\",\n",
+    "        max_seq_length=1024,\n",
+    "    )\n",
+    "\n",
+    "    args = SimpleNamespace(\n",
+    "        output_dir=\"models/local_policy_grpo_multiagent\",\n",
+    "        learning_rate=5e-5,\n",
+    "        per_device_batch_size=2,\n",
+    "        gradient_accumulation_steps=2,\n",
+    "        max_steps=100,\n",
+    "        save_steps=25,\n",
+    "        logging_steps=1,\n",
+    "        max_prompt_length=768,\n",
+    "        max_completion_length=200,\n",
+    "        num_generations=4,\n",
+    "    )\n",
+    "\n",
+    "    trainer = make_trainer(model, tokenizer, dataset, args, torch)\n",
+    "    print(f\"Starting GRPO training ({args.max_steps} steps)...\")\n",
+    "    trainer.train()\n",
+    "\n",
+    "    history = trainer.state.log_history\n",
+    "    rewards = [item[\"reward\"] for item in history if \"reward\" in item]\n",
+    "    losses = [item[\"loss\"] for item in history if \"loss\" in item]\n",
+    "    if not losses:\n",
+    "        losses = [0.0]\n",
+    "\n",
+    "    from utils.plotting import plot_training_results\n",
+    "\n",
+    "    plot_training_results(rewards, losses, output_dir=\"plots\")\n",
+    "    save_model(model, tokenizer, args.output_dir)\n",
+    "\n",
+    "    Path(\"outputs\").mkdir(parents=True, exist_ok=True)\n",
+    "    with open(\"outputs/grpo_metrics.json\", \"w\", encoding=\"utf-8\") as handle:\n",
+    "        json.dump({\"rewards\": rewards, \"losses\": losses}, handle, indent=2)\n",
+    "\n",
+    "    print(f\"Model saved to {args.output_dir}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 8. Display Generated Artifacts\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "from IPython.display import Image, Markdown, display\n",
     "\n",
     "plot_files = [\n",
+    "    (\"plots/reward_curve.png\", \"Per-Agent Reward Curves\"),\n",
+    "    (\"plots/loss_curve.png\", \"PnL Or Policy Loss Curve\"),\n",
+    "    (\"plots/baseline_comparison.png\", \"Early vs Late Training\"),\n",
     "]\n",
     "\n",
     "for path, title in plot_files:\n",
     "    if Path(path).exists():\n",
     "        display(Markdown(f\"### {title}\"))\n",
+    "        display(Image(filename=path, width=700))\n",
     "    else:\n",
+    "        print(f\"Missing: {path}\")\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 9. Submission Checklist\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "import yaml\n",
+    "\n",
     "checks = [\n",
+    "    (\"openenv.yaml\", Path(\"openenv.yaml\")),\n",
+    "    (\"README.md\", Path(\"README.md\")),\n",
+    "    (\"WRITEUP.md\", Path(\"WRITEUP.md\")),\n",
+    "    (\"Multi-agent env\", Path(\"env/multi_agent_env.py\")),\n",
+    "    (\"REINFORCE trainer\", Path(\"training/train_multi_agent.py\")),\n",
+    "    (\"GRPO trainer\", Path(\"training/train_grpo_multiagent.py\")),\n",
+    "    (\"Prompt helpers\", Path(\"training/prompt_utils.py\")),\n",
+    "    (\"GRPO verifiers\", Path(\"training/grpo_verifiers_multiagent.py\")),\n",
+    "    (\"Training notebook\", Path(\"mate_training.ipynb\")),\n",
+    "    (\"Reward curve\", Path(\"plots/reward_curve.png\")),\n",
+    "    (\"Loss curve\", Path(\"plots/loss_curve.png\")),\n",
+    "    (\"Baseline comparison\", Path(\"plots/baseline_comparison.png\")),\n",
+    "    (\"Dockerfile\", Path(\"Dockerfile\")),\n",
+    "    (\"requirements-space.txt\", Path(\"requirements-space.txt\")),\n",
     "]\n",
     "\n",
+    "print(f\"{'Deliverable':<28} {'Status':<10}\")\n",
+    "print(\"-\" * 42)\n",
     "for name, path in checks:\n",
+    "    status = \"OK\" if path.exists() else \"MISSING\"\n",
+    "    print(f\"{name:<28} {status:<10}\")\n",
+    "\n",
+    "with open(\"openenv.yaml\", encoding=\"utf-8\") as handle:\n",
+    "    manifest = yaml.safe_load(handle)\n",
+    "entry_point = manifest.get(\"environment\", {}).get(\"entry_point\", \"\")\n",
+    "print(f\"\\nopenenv.yaml entry_point: {entry_point}\")\n",
+    "print(f\"PettingZoo env configured: {'multi_agent_env' in entry_point}\")\n"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 3",
+   "language": "python",
    "name": "python3"
   },
   "language_info": {
+   "name": "python"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 5
+}

training/grpo_verifiers_multiagent.py ADDED Viewed

	@@ -0,0 +1,136 @@

+"""
+Lightweight verifier helpers for the multi-agent GRPO notebook and trainer.
+These functions intentionally avoid importing the training stack so notebooks can
+preview prompts and reward functions without loading model or trainer deps.
+"""
+from __future__ import annotations
+import json
+import re
+import numpy as np
+def _extract_json_action(completion: str):
+    match = re.search(r"<action>\s*({.*?})\s*</action>", completion, re.DOTALL)
+    if not match:
+        return None
+    return json.loads(match.group(1))
+def _extract_signal_value(prompt: str, key: str):
+    json_match = re.search(rf'"{key}"\s*:\s*(-?[\d\.]+)', prompt)
+    if json_match:
+        return float(json_match.group(1))
+    plain_match = re.search(rf"{key}\s*[:=]\s*(-?[\d\.]+)", prompt)
+    if plain_match:
+        return float(plain_match.group(1))
+    return None
+def risk_reward_func_multiagent(prompts, completions, **kwargs) -> list[float]:
+    """Read the Risk Manager limit from the prompt and reward compliant sizing."""
+    rewards = []
+    for prompt, completion in zip(prompts, completions):
+        try:
+            limit = _extract_signal_value(prompt, "rm_size_limit")
+            if limit is None:
+                limit = _extract_signal_value(prompt, "position_limit")
+            if limit is None:
+                limit = 1.0
+            data = _extract_json_action(completion)
+            if data is None:
+                rewards.append(0.0)
+                continue
+            size = float(data.get("size", 0.0))
+            score = 0.7 if size <= limit else 0.0
+            try:
+                thought = completion.split("<thought>")[1].split("</thought>")[0].lower()
+                if any(kw in thought for kw in ["risk", "limit", "constraint", "size_limit"]):
+                    score += 0.3
+            except (IndexError, AttributeError):
+                pass
+            rewards.append(score)
+        except Exception:
+            rewards.append(0.0)
+    return rewards
+def governance_reward_func_multiagent(prompts, completions, **kwargs) -> list[float]:
+    """Score compliance against both Risk Manager and Portfolio Manager limits."""
+    rewards = []
+    for prompt, completion in zip(prompts, completions):
+        try:
+            data = _extract_json_action(completion)
+            if data is None:
+                rewards.append(0.0)
+                continue
+            size = float(data.get("size", 0.0))
+            direction = int(data.get("direction", 0))
+            limit = _extract_signal_value(prompt, "rm_size_limit")
+            if limit is None:
+                limit = _extract_signal_value(prompt, "position_limit")
+            if limit is None:
+                limit = 1.0
+            pm_cap = _extract_signal_value(prompt, "pm_cap_alloc")
+            effective_limit = min(limit, pm_cap) if pm_cap is not None else limit
+            score = 0.0
+            if size <= effective_limit:
+                score += 0.40
+                if 0 < size <= effective_limit * 0.8:
+                    score += 0.20
+            else:
+                score -= 0.50
+            try:
+                thought = completion.split("<thought>")[1].split("</thought>")[0].lower()
+                governance_keywords = [
+                    "risk",
+                    "limit",
+                    "constraint",
+                    "compliance",
+                    "conservative",
+                    "governance",
+                    "restrict",
+                    "drawdown",
+                    "cap",
+                    "position limit",
+                    "size_limit",
+                    "risk manager",
+                    "portfolio manager",
+                    "allocation",
+                ]
+                if any(kw in thought for kw in governance_keywords):
+                    score += 0.20
+            except (IndexError, AttributeError):
+                pass
+            if direction != 0:
+                score += 0.20
+            rewards.append(float(np.clip(score, 0.0, 1.0)))
+        except Exception:
+            rewards.append(0.0)
+    return rewards
+__all__ = [
+    "governance_reward_func_multiagent",
+    "risk_reward_func_multiagent",
+]

training/train_grpo_multiagent.py CHANGED Viewed

@@ -1,178 +1,54 @@
 """
-PettingZoo-Compatible GRPO Training Pipeline for Qwen 2.5.
-Uses MultiAgentTradingEnv to generate scenarios where RM and PM
-send governance messages that become part of the Trader's prompt.
-The Trader is trained as a Qwen 2.5-1.5B model via Unsloth + TRL GRPOTrainer.
-RM and PM use rule-based policies during Trader training (alternating opt.).
 """
 from __future__ import annotations
-import os
-os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
-os.environ.setdefault("OMP_NUM_THREADS", "1")
 import argparse
 import inspect
 import json
 import random
 import sys
 from pathlib import Path
-from typing import Dict, List
 import numpy as np
 ROOT = Path(__file__).resolve().parents[1]
 if str(ROOT) not in sys.path:
     sys.path.insert(0, str(ROOT))
-from datasets import Dataset
-from env.multi_agent_env import (
-    MultiAgentTradingEnv,
-    RISK_MANAGER,
-    PORTFOLIO_MGR,
-    TRADER,
-)
 from env.reward import (
-    format_reward_func,
     alignment_reward_func,
     profit_reward_func,
 )
-from training.train_multi_agent import (
-    RuleRiskManagerPolicy,
-    RulePortfolioManagerPolicy,
 )
-# ─── Constants ─────────────────────────────────────────────────────────────────
 DEFAULT_MODEL_NAME = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"
 DEFAULT_OUTPUT_DIR = "models/local_policy_grpo_multiagent"
-from training.prompt_utils import SYSTEM_PROMPT, generate_pz_scenarios, build_prompt_multiagent
-# ─── Updated GRPO Verifiers ───────────────────────────────────────────────────
-def _extract_json_action(completion: str):
-    import re
-    match = re.search(r"<action>\s*({.*?})\s*</action>", completion, re.DOTALL)
-    if not match:
-        return None
-    return json.loads(match.group(1))
-def _extract_signal_value(prompt: str, key: str):
-    import re
-    json_match = re.search(rf'"{key}"\s*:\s*(-?[\d\.]+)', prompt)
-    if json_match:
-        return float(json_match.group(1))
-    plain_match = re.search(rf"{key}\s*[:=]\s*(-?[\d\.]+)", prompt)
-    if plain_match:
-        return float(plain_match.group(1))
-    return None
-def risk_reward_func_multiagent(prompts, completions, **kwargs) -> list[float]:
-    """Updated risk verifier: reads RM's dynamic size_limit from the prompt."""
-    rewards = []
-    for prompt, completion in zip(prompts, completions):
-        try:
-            # Read RM's size_limit from the governance block
-            limit = _extract_signal_value(prompt, "rm_size_limit")
-            if limit is None:
-                limit = _extract_signal_value(prompt, "position_limit")
-            if limit is None:
-                limit = 1.0
-            data = _extract_json_action(completion)
-            if data is not None:
-                size = float(data.get("size", 0.0))
-                score = 0.7 if size <= limit else 0.0
-                try:
-                    thought = completion.split("<thought>")[1].split("</thought>")[0].lower()
-                    if any(kw in thought for kw in ["risk", "limit", "constraint", "size_limit"]):
-                        score += 0.3
-                except (IndexError, AttributeError):
-                    pass
-                rewards.append(score)
-            else:
-                rewards.append(0.0)
-        except Exception:
-            rewards.append(0.0)
-    return rewards
-def governance_reward_func_multiagent(prompts, completions, **kwargs) -> list[float]:
-    """Updated governance verifier: checks compliance against *learned* RM constraints.
-    The key differentiator: the governance verifier now checks compliance against
-    RM's size_limit from the prompt, not a hardcoded position_limit.
-    """
-    rewards = []
-    for prompt, completion in zip(prompts, completions):
-        try:
-            data = _extract_json_action(completion)
-            if data is None:
-                rewards.append(0.0)
-                continue
-            size = float(data.get("size", 0.0))
-            direction = int(data.get("direction", 0))
-            # Use RM's dynamic limit
-            limit = _extract_signal_value(prompt, "rm_size_limit")
-            if limit is None:
-                limit = _extract_signal_value(prompt, "position_limit")
-            if limit is None:
-                limit = 1.0
-            # Also check PM cap
-            pm_cap = _extract_signal_value(prompt, "pm_cap_alloc")
-            effective_limit = min(limit, pm_cap) if pm_cap is not None else limit
-            score = 0.0
-            # Core compliance: within both RM limit and PM cap
-            if size <= effective_limit:
-                score += 0.40
-                if 0 < size <= effective_limit * 0.8:
-                    score += 0.20
-            else:
-                score -= 0.50
-            # Reasoning quality: governance-aware language
-            try:
-                thought = completion.split("<thought>")[1].split("</thought>")[0].lower()
-                governance_keywords = [
-                    "risk", "limit", "constraint", "compliance", "conservative",
-                    "governance", "restrict", "drawdown", "cap", "position limit",
-                    "size_limit", "risk manager", "portfolio manager", "allocation",
-                ]
-                if any(kw in thought for kw in governance_keywords):
-                    score += 0.20
-            except (IndexError, AttributeError):
-                pass
-            # Activity bonus
-            if direction != 0:
-                score += 0.20
-            rewards.append(float(np.clip(score, 0.0, 1.0)))
-        except Exception:
-            rewards.append(0.0)
-    return rewards
-# ─── Model Loading ─────────────────────────────────────────────────────────────
 def require_cuda():
     import torch
     if not torch.cuda.is_available():
         raise SystemExit("GRPO training requires CUDA.")
     return torch
@@ -180,6 +56,7 @@ def require_cuda():
 def load_model(model_name: str, max_seq_length: int):
     from unsloth import FastLanguageModel, PatchFastRL
     PatchFastRL("GRPO", "unsloth")
     model, tokenizer = FastLanguageModel.from_pretrained(
@@ -191,8 +68,15 @@ def load_model(model_name: str, max_seq_length: int):
     model = FastLanguageModel.get_peft_model(
         model,
         r=16,
-        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
-                         "gate_proj", "up_proj", "down_proj"],
         lora_alpha=16,
         lora_dropout=0,
         bias="none",
@@ -205,8 +89,6 @@ def load_model(model_name: str, max_seq_length: int):
     return model, tokenizer
-# ─── Trainer ───────────────────────────────────────────────────────────────────
 def make_trainer(model, tokenizer, dataset, args, torch_module):
     from trl.trainer.grpo_config import GRPOConfig
     from trl.trainer.grpo_trainer import GRPOTrainer
@@ -231,9 +113,9 @@ def make_trainer(model, tokenizer, dataset, args, torch_module):
     reward_funcs = [
         format_reward_func,
         alignment_reward_func,
-        risk_reward_func_multiagent,      # Updated: reads RM's dynamic size_limit
         profit_reward_func,
-        governance_reward_func_multiagent, # Updated: checks compliance vs learned RM constraints
     ]
     trainer_kwargs = {
@@ -261,10 +143,8 @@ def save_model(model, tokenizer, output_dir: str) -> None:
         tokenizer.save_pretrained(output_dir)
-# ─── CLI ───────────────────────────────────────────────────────────────────────
 def parse_args():
-    parser = argparse.ArgumentParser(description="Multi-Agent GRPO Training for Trader (Qwen 2.5)")
     parser.add_argument("--model-name", default=DEFAULT_MODEL_NAME)
     parser.add_argument("--output-dir", default=DEFAULT_OUTPUT_DIR)
     parser.add_argument("--difficulty", choices=["easy", "medium", "hard"], default="easy")
@@ -288,43 +168,40 @@ def main():
     random.seed(args.seed)
     np.random.seed(args.seed)
-    # 1. Generate scenarios from the PettingZoo env
-    print(f"Generating {args.num_scenarios} scenarios from MultiAgentTradingEnv (difficulty={args.difficulty})...")
     scenarios = generate_pz_scenarios(n=args.num_scenarios, difficulty=args.difficulty)
     print(f"  Generated {len(scenarios)} scenarios.")
-    # 2. Build dataset
     prompts = [{"prompt": build_prompt_multiagent(sc)} for sc in scenarios]
     dataset = Dataset.from_list(prompts)
-    # 3. Load model
     torch_module = require_cuda()
     model, tokenizer = load_model(args.model_name, args.max_seq_length)
-    # 4. Train
     trainer = make_trainer(model, tokenizer, dataset, args, torch_module)
     print(f"Starting multi-agent GRPO training on {len(dataset)} prompts...")
-    train_result = trainer.train()
-    # 5. Generate plots
     history = trainer.state.log_history
     rewards = [x["reward"] for x in history if "reward" in x]
     losses = [x["loss"] for x in history if "loss" in x]
     try:
         from utils.plotting import plot_training_results
         plot_training_results(rewards, losses)
-    except Exception as e:
-        print(f"  Warning: could not generate plots: {e}")
-    # 6. Save model
     print(f"Saving GRPO policy to {args.output_dir}...")
     save_model(model, tokenizer, args.output_dir)
-    # 7. Save training metrics
     metrics_path = Path(args.output_dir) / "training_metrics.json"
-    with open(metrics_path, "w") as f:
-        json.dump({"rewards": rewards, "losses": losses}, f, indent=2)
     print("Multi-agent GRPO training complete.")
     print(f"  Model saved to:   {args.output_dir}")

 """
+PettingZoo-compatible GRPO training pipeline for Qwen 2.5.
+Uses MultiAgentTradingEnv-derived scenarios where the Risk Manager and
+Portfolio Manager send governance messages that become part of the Trader
+prompt. The Trader is then trained with Unsloth + TRL GRPOTrainer.
 """
 from __future__ import annotations
 import argparse
 import inspect
 import json
+import os
 import random
 import sys
 from pathlib import Path
 import numpy as np
+from datasets import Dataset
+os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
+os.environ.setdefault("OMP_NUM_THREADS", "1")
 ROOT = Path(__file__).resolve().parents[1]
 if str(ROOT) not in sys.path:
     sys.path.insert(0, str(ROOT))
 from env.reward import (
     alignment_reward_func,
+    format_reward_func,
     profit_reward_func,
 )
+from training.grpo_verifiers_multiagent import (
+    governance_reward_func_multiagent,
+    risk_reward_func_multiagent,
+)
+from training.prompt_utils import (
+    SYSTEM_PROMPT,
+    build_prompt_multiagent,
+    generate_pz_scenarios,
 )
 DEFAULT_MODEL_NAME = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"
 DEFAULT_OUTPUT_DIR = "models/local_policy_grpo_multiagent"
 def require_cuda():
     import torch
     if not torch.cuda.is_available():
         raise SystemExit("GRPO training requires CUDA.")
     return torch
 def load_model(model_name: str, max_seq_length: int):
     from unsloth import FastLanguageModel, PatchFastRL
     PatchFastRL("GRPO", "unsloth")
     model, tokenizer = FastLanguageModel.from_pretrained(
     model = FastLanguageModel.get_peft_model(
         model,
         r=16,
+        target_modules=[
+            "q_proj",
+            "k_proj",
+            "v_proj",
+            "o_proj",
+            "gate_proj",
+            "up_proj",
+            "down_proj",
+        ],
         lora_alpha=16,
         lora_dropout=0,
         bias="none",
     return model, tokenizer
 def make_trainer(model, tokenizer, dataset, args, torch_module):
     from trl.trainer.grpo_config import GRPOConfig
     from trl.trainer.grpo_trainer import GRPOTrainer
     reward_funcs = [
         format_reward_func,
         alignment_reward_func,
+        risk_reward_func_multiagent,
         profit_reward_func,
+        governance_reward_func_multiagent,
     ]
     trainer_kwargs = {
         tokenizer.save_pretrained(output_dir)
 def parse_args():
+    parser = argparse.ArgumentParser(description="Multi-agent GRPO training for Trader (Qwen 2.5)")
     parser.add_argument("--model-name", default=DEFAULT_MODEL_NAME)
     parser.add_argument("--output-dir", default=DEFAULT_OUTPUT_DIR)
     parser.add_argument("--difficulty", choices=["easy", "medium", "hard"], default="easy")
     random.seed(args.seed)
     np.random.seed(args.seed)
+    print(
+        f"Generating {args.num_scenarios} scenarios from MultiAgentTradingEnv "
+        f"(difficulty={args.difficulty})..."
+    )
     scenarios = generate_pz_scenarios(n=args.num_scenarios, difficulty=args.difficulty)
     print(f"  Generated {len(scenarios)} scenarios.")
     prompts = [{"prompt": build_prompt_multiagent(sc)} for sc in scenarios]
     dataset = Dataset.from_list(prompts)
     torch_module = require_cuda()
     model, tokenizer = load_model(args.model_name, args.max_seq_length)
     trainer = make_trainer(model, tokenizer, dataset, args, torch_module)
     print(f"Starting multi-agent GRPO training on {len(dataset)} prompts...")
+    trainer.train()
     history = trainer.state.log_history
     rewards = [x["reward"] for x in history if "reward" in x]
     losses = [x["loss"] for x in history if "loss" in x]
     try:
         from utils.plotting import plot_training_results
         plot_training_results(rewards, losses)
+    except Exception as exc:
+        print(f"  Warning: could not generate plots: {exc}")
     print(f"Saving GRPO policy to {args.output_dir}...")
     save_model(model, tokenizer, args.output_dir)
     metrics_path = Path(args.output_dir) / "training_metrics.json"
+    with open(metrics_path, "w", encoding="utf-8") as handle:
+        json.dump({"rewards": rewards, "losses": losses}, handle, indent=2)
     print("Multi-agent GRPO training complete.")
     print(f"  Model saved to:   {args.output_dir}")