Spaces:

Navigam
/

corp-env

Sleeping

App Files Files Community

Navigam commited on Apr 26

Commit

5c8287c

1 Parent(s): 978c5b4

feat: add training pipeline with SFT and RLVR support for Qwen 2.5-3B-Instruct

Browse files

Files changed (3) hide show

notebooks/training.ipynb +707 -46
training/train_rlvr.py +13 -3
training/train_sft.py +2 -1

notebooks/training.ipynb CHANGED Viewed

@@ -2,27 +2,37 @@
   "cells": [
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "# 🏢 CORP-ENV · Qwen 2.5-7B-Instruct — SFT + RLVR Training\n",
         "\n",
-        "**End-to-end reproducible notebook** for training a Qwen 2.5-7B-Instruct agent on CORP-ENV using Supervised Fine-Tuning (SFT) followed by Rejection-Sampling RL on Verifiable Rewards (RLVR).\n",
         "\n",
         "CORP-ENV is a multi-agent corporate decision environment where a Master Agent governs a **Shared Workspace Document (SWD)** across long-horizon planning episodes, coordinating frozen worker agents. Rewards measure SWD integrity, task completion, milestone adherence, reasoning density, and LLM-judge scores.\n",
         "\n",
         "| Component | Detail |\n",
         "|---|---|\n",
-        "| **Base model** | `Qwen/Qwen2.5-7B-Instruct` |\n",
         "| **SFT script** | `training/train_sft.py` |\n",
         "| **RLVR script** | `training/train_rlvr.py` |\n",
         "| **Tasks** | E1 Launch Readiness, M1 Budget Reallocation, H1 Acquisition Defence |\n",
-        "| **Runtime** | Colab GPU / Lightning AI H100 / Any CUDA GPU |\n",
         "\n",
         "---"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 1️⃣ Setup & Installation"
@@ -31,49 +41,92 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
         "import os\n",
         "\n",
         "# ===== Configuration =====\n",
         "REPO_URL = \"https://huggingface.co/spaces/Navigam/corp-env\"  # Change to your repo\n",
-        "BASE_MODEL = \"Qwen/Qwen2.5-7B-Instruct\"\n",
         "HF_ORG_OR_USER = \"Navigam\"  # Your HF username/org\n",
         "\n",
-        "# SFT hyperparameters\n",
-        "SFT_MAX_STEPS = 30        # Quick judge smoke; set -1 for full-epoch training\n",
         "SFT_EPOCHS = 2.0\n",
         "SFT_LR = 2e-4\n",
         "SFT_BATCH_SIZE = 1\n",
         "SFT_GRAD_ACCUM = 8\n",
         "\n",
-        "# RLVR hyperparameters\n",
         "RLVR_ROUNDS = 3\n",
-        "RLVR_MAX_PROMPTS = 128\n",
-        "RLVR_N_SAMPLES = 8\n",
         "RLVR_TEMPERATURE = 0.7\n",
         "\n",
         "# Eval\n",
         "EVAL_EPISODES = 3\n",
-        "EVAL_MAX_STEPS = 30"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
-        "# Clone and install\n",
         "!git clone {REPO_URL} corp_gym 2>/dev/null || echo 'Repo already cloned'\n",
         "%cd corp_gym\n",
-        "!pip install -U pip\n",
-        "!pip install -e \".[training,plots]\""
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 2️⃣ Hugging Face Login (optional)"
@@ -82,6 +135,7 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -91,6 +145,325 @@
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 3️⃣ Environment Validation\n",
@@ -101,6 +474,7 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -110,6 +484,7 @@
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 4️⃣ Data Preparation\n",
@@ -120,6 +495,7 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -148,27 +524,56 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
-        "# Check data stats\n",
-        "import json\n",
-        "from pathlib import Path\n",
-        "\n",
         "sft_path = Path(\"data/sft/e1_m1_h1_examples.jsonl\")\n",
         "if sft_path.exists():\n",
         "    lines = [json.loads(l) for l in sft_path.read_text().strip().splitlines() if l.strip()]\n",
         "    print(f\"\\n✅ SFT dataset: {len(lines)} examples\")\n",
-        "    # Count by number of messages\n",
         "    turn_counts = [len(ex['messages']) for ex in lines]\n",
         "    print(f\"   Avg turns per example: {sum(turn_counts)/len(turn_counts):.1f}\")\n",
         "    print(f\"   Min/Max turns: {min(turn_counts)} / {max(turn_counts)}\")\n",
         "else:\n",
         "    print(\"❌ SFT dataset not found. Check data preparation above.\")"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 5️⃣ Baseline Evaluation\n",
@@ -179,6 +584,7 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
@@ -186,39 +592,142 @@
         "!python eval.py --policy oracle --label oracle --episodes {EVAL_EPISODES}"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 6️⃣ SFT Training (Unsloth + TRL)\n",
         "\n",
-        "Fine-tune Qwen 2.5-7B-Instruct with LoRA using verified CORP-ENV trajectories.\n",
         "\n",
         "- Uses `unsloth.FastLanguageModel` for 4-bit QLoRA\n",
         "- Uses `trl.SFTTrainer` with messages-format conversational SFT\n",
-        "- LoRA `r=32`, targets all attention + MLP projections"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
         "!python training/train_sft.py \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --data data/sft/e1_m1_h1_examples.jsonl \\\n",
         "  --output outputs/sft_adapter \\\n",
         "  --max-steps {SFT_MAX_STEPS} \\\n",
         "  --epochs {SFT_EPOCHS} \\\n",
         "  --lr {SFT_LR} \\\n",
         "  --batch-size {SFT_BATCH_SIZE} \\\n",
         "  --grad-accum {SFT_GRAD_ACCUM} \\\n",
-        "  --push-to-hub {HF_ORG_OR_USER}/corp-env-sft-qwen2.5-7b"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 7️⃣ Evaluate SFT Adapter"
@@ -227,20 +736,58 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
         "!python eval.py \\\n",
         "  --policy hf \\\n",
         "  --label sft \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --adapter outputs/sft_adapter \\\n",
         "  --episodes {EVAL_EPISODES} \\\n",
-        "  --max-steps {EVAL_MAX_STEPS}"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 8️⃣ RLVR Training (Rejection-Sampling FT)\n",
@@ -252,15 +799,25 @@
         "4. SFT on that curated set\n",
         "5. Repeating for multiple outer rounds\n",
         "\n",
-        "This avoids the zero-variance gradient problem seen with GRPO on CORP-ENV."
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
         "!python training/train_rlvr.py \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --adapter outputs/sft_adapter \\\n",
@@ -270,15 +827,40 @@
         "  --n-samples {RLVR_N_SAMPLES} \\\n",
         "  --temperature {RLVR_TEMPERATURE} \\\n",
         "  --max-prompts {RLVR_MAX_PROMPTS} \\\n",
         "  --strict-json \\\n",
         "  --use-stub-workers \\\n",
         "  --disable-llm-judge \\\n",
-        "  --stats-file results/runs/rlvr_qwen2.5_7b_stats.jsonl \\\n",
-        "  --push-to-hub {HF_ORG_OR_USER}/corp-env-rlvr-qwen2.5-7b"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 9️⃣ Evaluate RLVR Adapter"
@@ -287,72 +869,149 @@
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
         "!python eval.py \\\n",
         "  --policy hf \\\n",
         "  --label rlvr \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --adapter outputs/rlvr_adapter \\\n",
         "  --episodes {EVAL_EPISODES} \\\n",
-        "  --max-steps {EVAL_MAX_STEPS}"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 📊 Generate Comparison Plots\n",
         "\n",
-        "Aggregate all eval runs and generate:\n",
-        "- Terminal reward comparison (grouped bar chart)\n",
-        "- Verifier pass rate by task\n",
-        "- Invalid action rate\n",
-        "- Reward curve over episode steps"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
         "!python plot_results.py \\\n",
         "  --inputs results/runs \\\n",
-        "  --output-dir results/model_compare_qwen25_7b"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
       "source": [
         "from IPython.display import Image, display, Markdown\n",
-        "from pathlib import Path\n",
         "\n",
-        "plot_dir = Path(\"results/model_compare_qwen25_7b\")\n",
         "if not plot_dir.exists():\n",
         "    plot_dir = Path(\"results/model_compare_qwen25_fresh_no_grpo_ep5rlvr\")\n",
         "\n",
-        "for png in sorted(plot_dir.glob(\"*.png\")):\n",
-        "    display(Markdown(f\"### {png.stem.replace('_', ' ').title()}\"))\n",
-        "    display(Image(filename=str(png), width=800))\n",
         "\n",
-        "# Show summary table\n",
-        "summary_md = plot_dir / \"comparison_summary.md\"\n",
-        "if summary_md.exists():\n",
-        "    display(Markdown(summary_md.read_text()))"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
         "## 📋 Results Summary\n",
         "\n",
-        "Expected progression for Qwen 2.5-7B-Instruct on CORP-ENV:\n",
         "\n",
         "| Stage | E1 Terminal Reward | M1 Terminal Reward | H1 Terminal Reward | M1 Success |\n",
         "|-------|-------------------|-------------------|-------------------|------------|\n",
@@ -361,7 +1020,9 @@
         "| SFT | 0.910 | 0.943 | 0.889 | 100% |\n",
         "| RLVR | 0.910 | 0.932 | 0.779 | 80% |\n",
         "\n",
-        "> **Key takeaway**: SFT dramatically improves M1 (budget reallocation) from 0% to 100% success rate. RLVR maintains strong performance while reducing reliance on fixed trajectories."
       ]
     }
   ],
@@ -384,4 +1045,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 5
-}

   "cells": [
     {
       "cell_type": "markdown",
+      "id": "23a31c02",
       "metadata": {},
       "source": [
+        "# 🏢 CORP-ENV · Qwen 2.5-3B-Instruct — SFT + RLVR Training\n",
         "\n",
+        "**End-to-end reproducible notebook** for training a Qwen 2.5-3B-Instruct agent on CORP-ENV using Supervised Fine-Tuning (SFT) followed by Rejection-Sampling RL on Verifiable Rewards (RLVR).\n",
+        "\n",
+        "### ⚡ Optimized for Google Colab T4 (16 GB VRAM)\n",
+        "\n",
+        "This notebook is configured to run end-to-end on a **free-tier T4 GPU**:\n",
+        "- 4-bit QLoRA quantization to fit 7B model in ~4 GB VRAM\n",
+        "- **FP16** precision (T4 lacks BF16 hardware support)\n",
+        "- Reduced sequence lengths (4096 tokens) and RLVR samples (4 per prompt)\n",
+        "- Inline visualizations after every training and evaluation step\n",
         "\n",
         "CORP-ENV is a multi-agent corporate decision environment where a Master Agent governs a **Shared Workspace Document (SWD)** across long-horizon planning episodes, coordinating frozen worker agents. Rewards measure SWD integrity, task completion, milestone adherence, reasoning density, and LLM-judge scores.\n",
         "\n",
         "| Component | Detail |\n",
         "|---|---|\n",
+        "| **Base model** | `Qwen/Qwen2.5-3B-Instruct` |\n",
         "| **SFT script** | `training/train_sft.py` |\n",
         "| **RLVR script** | `training/train_rlvr.py` |\n",
         "| **Tasks** | E1 Launch Readiness, M1 Budget Reallocation, H1 Acquisition Defence |\n",
+        "| **Runtime** | ✅ Google Colab T4 / Lightning AI H100 / Any CUDA GPU |\n",
         "\n",
         "---"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "15d441af",
       "metadata": {},
       "source": [
         "## 1️⃣ Setup & Installation"
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "e9394fab",
       "metadata": {},
       "outputs": [],
       "source": [
         "import os\n",
+        "import torch\n",
+        "\n",
+        "# ===== GPU Detection & Configuration =====\n",
+        "if torch.cuda.is_available():\n",
+        "    gpu_name = torch.cuda.get_device_name(0)\n",
+        "    gpu_mem = torch.cuda.get_device_properties(0).total_mem / 1e9\n",
+        "    has_bf16 = torch.cuda.is_bf16_supported()\n",
+        "    print(f\"🖥️  GPU: {gpu_name} ({gpu_mem:.1f} GB)\")\n",
+        "    print(f\"   BF16 support: {'✅ Yes' if has_bf16 else '❌ No (using FP16)'}\")\n",
+        "else:\n",
+        "    raise RuntimeError(\"❌ No GPU detected! Enable GPU in Colab: Runtime → Change runtime type → T4 GPU\")\n",
+        "\n",
+        "# Auto-detect hardware constraints\n",
+        "LOW_MEMORY = gpu_mem < 20.0  # e.g., T4 (16GB), RTX 4080 (16GB) need smaller batches/sequences\n",
+        "USE_FP16 = not has_bf16      # e.g., T4 and V100 dont support BF16\n",
         "\n",
         "# ===== Configuration =====\n",
         "REPO_URL = \"https://huggingface.co/spaces/Navigam/corp-env\"  # Change to your repo\n",
+        "BASE_MODEL = \"Qwen/Qwen2.5-3B-Instruct\"\n",
         "HF_ORG_OR_USER = \"Navigam\"  # Your HF username/org\n",
         "\n",
+        "# SFT hyperparameters (T4-optimized)\n",
+        "SFT_MAX_STEPS = 30         # Quick judge smoke; set -1 for full-epoch training\n",
         "SFT_EPOCHS = 2.0\n",
         "SFT_LR = 2e-4\n",
         "SFT_BATCH_SIZE = 1\n",
         "SFT_GRAD_ACCUM = 8\n",
+        "        \"SFT_MAX_SEQ_LEN = 3072 if LOW_MEMORY else 8192  # Reduced for <20GB VRAM\\n\",\n",
         "\n",
+        "# RLVR hyperparameters (T4-optimized)\n",
         "RLVR_ROUNDS = 3\n",
+        "RLVR_MAX_PROMPTS = 32 if LOW_MEMORY else 128   # Fewer prompts to fit in T4 time/memory\n",
+        "        \"RLVR_N_SAMPLES = 4 if LOW_MEMORY else 8         # Fewer samples per prompt\\n\",\n",
         "RLVR_TEMPERATURE = 0.7\n",
+        "        \"RLVR_MAX_PROMPT_LEN = 3072 if LOW_MEMORY else 8192\\n\",\n",
+        "RLVR_MAX_COMPLETION_LEN = 512\n",
         "\n",
         "# Eval\n",
         "EVAL_EPISODES = 3\n",
+        "EVAL_MAX_STEPS = 30\n",
+        "\n",
+        "# FP16 flag for training scripts\n",
+        "FP16_FLAG = \"--fp16\" if USE_FP16 else \"\"\n",
+        "\n",
+        "print(f\"\\n📋 Config: model={BASE_MODEL}, fp16={USE_FP16}, seq_len={SFT_MAX_SEQ_LEN}\")\n",
+        "print(f\"   RLVR: rounds={RLVR_ROUNDS}, prompts={RLVR_MAX_PROMPTS}, samples={RLVR_N_SAMPLES}\")"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "1fccadd9",
       "metadata": {},
       "outputs": [],
       "source": [
+        "# ===== Install dependencies (Colab-optimized) =====\n",
+        "# Unsloth requires a specific install path for Colab\n",
+        "import subprocess, sys\n",
+        "\n",
+        "# Check if running in Colab\n",
+        "IN_COLAB = 'google.colab' in sys.modules\n",
+        "\n",
+        "if IN_COLAB:\n",
+        "    print(\"🔧 Installing Unsloth for Colab...\")\n",
+        "    !pip install -q --no-deps trl peft accelerate bitsandbytes triton\n",
+        "    !pip install -q \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\"\n",
+        "    !pip install -q --no-deps unsloth_zoo\n",
+        "    !pip install -q xformers\n",
+        "else:\n",
+        "    print(\"🔧 Installing from pyproject.toml...\")\n",
+        "    !pip install -q -U pip\n",
+        "\n",
+        "# Clone and install CORP-ENV\n",
         "!git clone {REPO_URL} corp_gym 2>/dev/null || echo 'Repo already cloned'\n",
         "%cd corp_gym\n",
+        "!pip install -q -e \".[training,plots]\""
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "076d342b",
       "metadata": {},
       "source": [
         "## 2️⃣ Hugging Face Login (optional)"
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "df0904d7",
       "metadata": {},
       "outputs": [],
       "source": [
     },
     {
       "cell_type": "markdown",
+      "id": "7d4a001c",
+      "metadata": {},
+      "source": [
+        "## 📊 Visualization Utilities\n",
+        "\n",
+        "Helper functions for inline charts after every eval and training step."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3930908e",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import json\n",
+        "import matplotlib.pyplot as plt\n",
+        "import matplotlib.ticker as mticker\n",
+        "import numpy as np\n",
+        "from pathlib import Path\n",
+        "from collections import defaultdict\n",
+        "from IPython.display import display, Markdown, HTML\n",
+        "\n",
+        "# ---- Plotting style ----\n",
+        "plt.rcParams.update({\n",
+        "    'figure.facecolor': '#0d1117',\n",
+        "    'axes.facecolor': '#161b22',\n",
+        "    'axes.edgecolor': '#30363d',\n",
+        "    'axes.labelcolor': '#c9d1d9',\n",
+        "    'text.color': '#c9d1d9',\n",
+        "    'xtick.color': '#8b949e',\n",
+        "    'ytick.color': '#8b949e',\n",
+        "    'grid.color': '#21262d',\n",
+        "    'font.family': 'sans-serif',\n",
+        "    'font.size': 11,\n",
+        "})\n",
+        "\n",
+        "PALETTE = {\n",
+        "    'baseline': '#8b949e',\n",
+        "    'oracle': '#a371f7',\n",
+        "    'sft': '#3fb950',\n",
+        "    'rlvr': '#f0883e',\n",
+        "    'e1_launch_readiness': '#58a6ff',\n",
+        "    'm1_budget_reallocation': '#d2a8ff',\n",
+        "    'h1_acquisition_defence': '#7ee787',\n",
+        "}\n",
+        "TASK_SHORT = {\n",
+        "    'e1_launch_readiness': 'E1 Launch',\n",
+        "    'm1_budget_reallocation': 'M1 Budget',\n",
+        "    'h1_acquisition_defence': 'H1 Acquisition',\n",
+        "}\n",
+        "\n",
+        "def load_eval_jsonl(path):\n",
+        "    \"\"\"Load evaluation JSONL file.\"\"\"\n",
+        "    rows = []\n",
+        "    p = Path(path)\n",
+        "    if p.is_dir():\n",
+        "        for f in sorted(p.rglob('*_eval.jsonl')):\n",
+        "            rows.extend(load_eval_jsonl(f))\n",
+        "        for f in sorted(p.rglob('eval.jsonl')):\n",
+        "            rows.extend(load_eval_jsonl(f))\n",
+        "        return rows\n",
+        "    if p.exists():\n",
+        "        for line in p.read_text(encoding='utf-8').strip().splitlines():\n",
+        "            if line.strip():\n",
+        "                rows.append(json.loads(line))\n",
+        "    return rows\n",
+        "\n",
+        "def plot_eval_dashboard(rows, title=\"Evaluation Results\"):\n",
+        "    \"\"\"Create a 2x2 dashboard of evaluation metrics.\"\"\"\n",
+        "    if not rows:\n",
+        "        print(\"⚠️  No evaluation data to plot.\")\n",
+        "        return\n",
+        "\n",
+        "    # Group by task\n",
+        "    by_task = defaultdict(list)\n",
+        "    for r in rows:\n",
+        "        by_task[r['task_id']].append(r)\n",
+        "\n",
+        "    tasks = sorted(by_task.keys())\n",
+        "    task_labels = [TASK_SHORT.get(t, t) for t in tasks]\n",
+        "\n",
+        "    # Compute metrics\n",
+        "    avg_reward = [np.mean([r['terminal_reward'] for r in by_task[t]]) for t in tasks]\n",
+        "    avg_pass = [np.mean([r['verifier_pass_rate'] for r in by_task[t]]) for t in tasks]\n",
+        "    success_rate = [np.mean([1 if r.get('success') else 0 for r in by_task[t]]) for t in tasks]\n",
+        "    avg_steps = [np.mean([r.get('steps', 0) for r in by_task[t]]) for t in tasks]\n",
+        "\n",
+        "    fig, axes = plt.subplots(2, 2, figsize=(14, 10))\n",
+        "    fig.suptitle(title, fontsize=18, fontweight='bold', color='#f0f6fc', y=0.98)\n",
+        "\n",
+        "    # -- Terminal Reward --\n",
+        "    ax = axes[0, 0]\n",
+        "    colors = [PALETTE.get(t, '#58a6ff') for t in tasks]\n",
+        "    bars = ax.bar(task_labels, avg_reward, color=colors, edgecolor='#30363d', linewidth=0.8)\n",
+        "    for bar, val in zip(bars, avg_reward):\n",
+        "        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,\n",
+        "                f'{val:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='#f0f6fc')\n",
+        "    ax.set_title('Terminal Reward', fontsize=13, fontweight='bold')\n",
+        "    ax.set_ylim(0, 1.15)\n",
+        "    ax.grid(axis='y', alpha=0.3)\n",
+        "\n",
+        "    # -- Verifier Pass Rate --\n",
+        "    ax = axes[0, 1]\n",
+        "    bars = ax.bar(task_labels, avg_pass, color=colors, edgecolor='#30363d', linewidth=0.8)\n",
+        "    for bar, val in zip(bars, avg_pass):\n",
+        "        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,\n",
+        "                f'{val:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='#f0f6fc')\n",
+        "    ax.set_title('Verifier Pass Rate', fontsize=13, fontweight='bold')\n",
+        "    ax.set_ylim(0, 1.15)\n",
+        "    ax.grid(axis='y', alpha=0.3)\n",
+        "\n",
+        "    # -- Success Rate --\n",
+        "    ax = axes[1, 0]\n",
+        "    bars = ax.bar(task_labels, success_rate, color=colors, edgecolor='#30363d', linewidth=0.8)\n",
+        "    for bar, val in zip(bars, success_rate):\n",
+        "        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,\n",
+        "                f'{val:.0%}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='#f0f6fc')\n",
+        "    ax.set_title('Success Rate', fontsize=13, fontweight='bold')\n",
+        "    ax.set_ylim(0, 1.25)\n",
+        "    ax.yaxis.set_major_formatter(mticker.PercentFormatter(1.0))\n",
+        "    ax.grid(axis='y', alpha=0.3)\n",
+        "\n",
+        "    # -- Avg Steps --\n",
+        "    ax = axes[1, 1]\n",
+        "    bars = ax.bar(task_labels, avg_steps, color=colors, edgecolor='#30363d', linewidth=0.8)\n",
+        "    for bar, val in zip(bars, avg_steps):\n",
+        "        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,\n",
+        "                f'{val:.1f}', ha='center', va='bottom', fontsize=10, fontweight='bold', color='#f0f6fc')\n",
+        "    ax.set_title('Average Steps per Episode', fontsize=13, fontweight='bold')\n",
+        "    ax.grid(axis='y', alpha=0.3)\n",
+        "\n",
+        "    for ax in axes.flat:\n",
+        "        ax.spines['top'].set_visible(False)\n",
+        "        ax.spines['right'].set_visible(False)\n",
+        "\n",
+        "    fig.tight_layout(rect=[0, 0, 1, 0.95])\n",
+        "    plt.show()\n",
+        "\n",
+        "    # Print summary table\n",
+        "    display(Markdown(\"### 📋 Summary Table\"))\n",
+        "    header = \"| Task | Terminal Reward | Verifier Pass | Success Rate | Avg Steps |\"\n",
+        "    sep    = \"|------|---------------|--------------|-------------|----------|\"\n",
+        "    lines = [header, sep]\n",
+        "    for i, t in enumerate(tasks):\n",
+        "        lines.append(f\"| {TASK_SHORT.get(t, t)} | {avg_reward[i]:.3f} | {avg_pass[i]:.3f} | {success_rate[i]:.0%} | {avg_steps[i]:.1f} |\")\n",
+        "    display(Markdown('\\n'.join(lines)))\n",
+        "\n",
+        "\n",
+        "def plot_reward_traces(rows, title=\"Reward Traces\"):\n",
+        "    \"\"\"Plot reward curves over episode steps.\"\"\"\n",
+        "    traces_by_task = defaultdict(list)\n",
+        "    for r in rows:\n",
+        "        trace = r.get('reward_trace', [])\n",
+        "        if trace:\n",
+        "            traces_by_task[r['task_id']].append([float(x) for x in trace])\n",
+        "\n",
+        "    if not traces_by_task:\n",
+        "        return\n",
+        "\n",
+        "    fig, ax = plt.subplots(figsize=(12, 5))\n",
+        "    for task_id, traces in sorted(traces_by_task.items()):\n",
+        "        max_len = max(len(t) for t in traces)\n",
+        "        means = []\n",
+        "        for idx in range(max_len):\n",
+        "            vals = [t[idx] for t in traces if idx < len(t)]\n",
+        "            means.append(np.mean(vals))\n",
+        "        xs = range(1, max_len + 1)\n",
+        "        color = PALETTE.get(task_id, '#58a6ff')\n",
+        "        ax.plot(xs, means, marker='o', linewidth=2.2, markersize=4,\n",
+        "                label=TASK_SHORT.get(task_id, task_id), color=color)\n",
+        "        if len(traces) > 1:\n",
+        "            mins = [min(t[i] for t in traces if i < len(t)) for i in range(max_len)]\n",
+        "            maxs = [max(t[i] for t in traces if i < len(t)) for i in range(max_len)]\n",
+        "            ax.fill_between(xs, mins, maxs, alpha=0.15, color=color)\n",
+        "\n",
+        "    ax.set_title(title, fontsize=15, fontweight='bold')\n",
+        "    ax.set_xlabel('Environment Step')\n",
+        "    ax.set_ylabel('Step Reward')\n",
+        "    ax.axhline(0, color='#484f58', linewidth=0.8, alpha=0.5)\n",
+        "    ax.legend(frameon=False, fontsize=10)\n",
+        "    ax.spines['top'].set_visible(False)\n",
+        "    ax.spines['right'].set_visible(False)\n",
+        "    ax.grid(axis='y', alpha=0.3)\n",
+        "    fig.tight_layout()\n",
+        "    plt.show()\n",
+        "\n",
+        "\n",
+        "def plot_stage_comparison(all_evals, metric='terminal_reward', title='Model Stage Comparison'):\n",
+        "    \"\"\"Compare multiple evaluation stages side-by-side.\"\"\"\n",
+        "    if not all_evals:\n",
+        "        return\n",
+        "\n",
+        "    stages = list(all_evals.keys())\n",
+        "    all_tasks = sorted({r['task_id'] for rows in all_evals.values() for r in rows})\n",
+        "    task_labels = [TASK_SHORT.get(t, t) for t in all_tasks]\n",
+        "\n",
+        "    x = np.arange(len(all_tasks))\n",
+        "    width = 0.8 / max(len(stages), 1)\n",
+        "\n",
+        "    fig, ax = plt.subplots(figsize=(max(10, len(all_tasks) * 3), 6))\n",
+        "    for idx, stage in enumerate(stages):\n",
+        "        rows = all_evals[stage]\n",
+        "        by_task = defaultdict(list)\n",
+        "        for r in rows:\n",
+        "            by_task[r['task_id']].append(float(r.get(metric, 0)))\n",
+        "        vals = [np.mean(by_task.get(t, [0])) for t in all_tasks]\n",
+        "        offsets = x - 0.4 + width/2 + idx * width\n",
+        "        color = PALETTE.get(stage, f'C{idx}')\n",
+        "        bars = ax.bar(offsets, vals, width, label=stage.upper(), color=color,\n",
+        "                      edgecolor='#30363d', linewidth=0.8)\n",
+        "        for bar, val in zip(bars, vals):\n",
+        "            if val > 0:\n",
+        "                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,\n",
+        "                        f'{val:.2f}', ha='center', va='bottom', fontsize=9,\n",
+        "                        fontweight='bold', color='#f0f6fc')\n",
+        "\n",
+        "    ax.set_title(title, fontsize=16, fontweight='bold', color='#f0f6fc')\n",
+        "    ax.set_xticks(x)\n",
+        "    ax.set_xticklabels(task_labels)\n",
+        "    ax.set_ylabel(metric.replace('_', ' ').title())\n",
+        "    ax.set_ylim(0, 1.15)\n",
+        "    ax.legend(frameon=False, fontsize=10, loc='upper center', bbox_to_anchor=(0.5, -0.08), ncol=len(stages))\n",
+        "    ax.spines['top'].set_visible(False)\n",
+        "    ax.spines['right'].set_visible(False)\n",
+        "    ax.grid(axis='y', alpha=0.3)\n",
+        "    fig.tight_layout()\n",
+        "    plt.show()\n",
+        "\n",
+        "\n",
+        "def plot_rlvr_stats(stats_file):\n",
+        "    \"\"\"Plot RLVR training stats per round.\"\"\"\n",
+        "    p = Path(stats_file)\n",
+        "    if not p.exists():\n",
+        "        print(f\"⚠️  Stats file not found: {stats_file}\")\n",
+        "        return\n",
+        "\n",
+        "    stats = [json.loads(line) for line in p.read_text().strip().splitlines() if line.strip()]\n",
+        "    if not stats:\n",
+        "        return\n",
+        "\n",
+        "    rounds = [s['round'] for s in stats]\n",
+        "    keep_rates = [s['keep_rate'] for s in stats]\n",
+        "    mean_best = [s['mean_best_reward'] for s in stats]\n",
+        "    mean_any = [s['mean_sample_reward'] for s in stats]\n",
+        "    kept_counts = [int(s['prompts_kept']) for s in stats]\n",
+        "\n",
+        "    fig, axes = plt.subplots(1, 3, figsize=(16, 5))\n",
+        "    fig.suptitle('RLVR Training Progress', fontsize=16, fontweight='bold', color='#f0f6fc', y=1.02)\n",
+        "\n",
+        "    # Keep rate\n",
+        "    ax = axes[0]\n",
+        "    ax.plot(rounds, keep_rates, marker='o', linewidth=2.5, color='#3fb950', markersize=8)\n",
+        "    ax.fill_between(rounds, keep_rates, alpha=0.15, color='#3fb950')\n",
+        "    ax.set_title('Keep Rate per Round', fontweight='bold')\n",
+        "    ax.set_xlabel('Round')\n",
+        "    ax.set_ylabel('Keep Rate')\n",
+        "    ax.set_ylim(0, 1.05)\n",
+        "    ax.yaxis.set_major_formatter(mticker.PercentFormatter(1.0))\n",
+        "    ax.grid(alpha=0.3)\n",
+        "\n",
+        "    # Reward progression\n",
+        "    ax = axes[1]\n",
+        "    ax.plot(rounds, mean_best, marker='s', linewidth=2.5, color='#f0883e', markersize=8, label='Best')\n",
+        "    ax.plot(rounds, mean_any, marker='D', linewidth=2.5, color='#58a6ff', markersize=7, label='Any sample')\n",
+        "    ax.set_title('Mean Reward per Round', fontweight='bold')\n",
+        "    ax.set_xlabel('Round')\n",
+        "    ax.set_ylabel('Reward')\n",
+        "    ax.legend(frameon=False)\n",
+        "    ax.grid(alpha=0.3)\n",
+        "\n",
+        "    # Prompts kept\n",
+        "    ax = axes[2]\n",
+        "    ax.bar(rounds, kept_counts, color='#a371f7', edgecolor='#30363d', linewidth=0.8)\n",
+        "    for r, c in zip(rounds, kept_counts):\n",
+        "        ax.text(r, c + 0.5, str(c), ha='center', fontweight='bold', fontsize=11, color='#f0f6fc')\n",
+        "    ax.set_title('Prompts Kept (Winners)', fontweight='bold')\n",
+        "    ax.set_xlabel('Round')\n",
+        "    ax.set_ylabel('Count')\n",
+        "    ax.grid(axis='y', alpha=0.3)\n",
+        "\n",
+        "    for ax in axes:\n",
+        "        ax.spines['top'].set_visible(False)\n",
+        "        ax.spines['right'].set_visible(False)\n",
+        "        ax.xaxis.set_major_locator(mticker.MaxNLocator(integer=True))\n",
+        "\n",
+        "    fig.tight_layout()\n",
+        "    plt.show()\n",
+        "\n",
+        "    # Print per-round summary\n",
+        "    display(Markdown(\"### 📋 RLVR Round Summary\"))\n",
+        "    header = \"| Round | Keep Rate | Mean Best Reward | Mean Any Reward | Prompts Kept | Time (s) |\"\n",
+        "    sep    = \"|-------|-----------|-----------------|----------------|-------------|----------|\"\n",
+        "    lines = [header, sep]\n",
+        "    for s in stats:\n",
+        "        lines.append(f\"| {s['round']} | {s['keep_rate']:.1%} | {s['mean_best_reward']:.3f} | {s['mean_sample_reward']:.3f} | {int(s['prompts_kept'])} | {s['seconds']:.0f} |\")\n",
+        "    display(Markdown('\\n'.join(lines)))\n",
+        "\n",
+        "\n",
+        "def gpu_status():\n",
+        "    \"\"\"Print current GPU memory usage.\"\"\"\n",
+        "    if torch.cuda.is_available():\n",
+        "        alloc = torch.cuda.memory_allocated() / 1e9\n",
+        "        cached = torch.cuda.memory_reserved() / 1e9\n",
+        "        total = torch.cuda.get_device_properties(0).total_mem / 1e9\n",
+        "        pct = alloc / total * 100\n",
+        "        bar_len, filled = 20, int(pct / 5)\n",
+        "        bar = '█' * filled + '░' * (bar_len - filled)\n",
+        "        print(f\"🖥️  GPU Memory: [{bar}] {alloc:.1f}/{total:.1f} GB ({pct:.0f}%) | Cached: {cached:.1f} GB\")\n",
+        "\n",
+        "\n",
+        "# Collect all eval results for final comparison\n",
+        "ALL_EVALS = {}\n",
+        "print(\"✅ Visualization utilities loaded.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "43b92bf5",
       "metadata": {},
       "source": [
         "## 3️⃣ Environment Validation\n",
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "71cfe355",
       "metadata": {},
       "outputs": [],
       "source": [
     },
     {
       "cell_type": "markdown",
+      "id": "0275c763",
       "metadata": {},
       "source": [
         "## 4️⃣ Data Preparation\n",
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "85901039",
       "metadata": {},
       "outputs": [],
       "source": [
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "eb6a7997",
       "metadata": {},
       "outputs": [],
       "source": [
+        "# Check data stats & visualize\n",
         "sft_path = Path(\"data/sft/e1_m1_h1_examples.jsonl\")\n",
         "if sft_path.exists():\n",
         "    lines = [json.loads(l) for l in sft_path.read_text().strip().splitlines() if l.strip()]\n",
         "    print(f\"\\n✅ SFT dataset: {len(lines)} examples\")\n",
         "    turn_counts = [len(ex['messages']) for ex in lines]\n",
         "    print(f\"   Avg turns per example: {sum(turn_counts)/len(turn_counts):.1f}\")\n",
         "    print(f\"   Min/Max turns: {min(turn_counts)} / {max(turn_counts)}\")\n",
+        "\n",
+        "    # Visualize data distribution\n",
+        "    fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
+        "    fig.suptitle('SFT Dataset Overview', fontsize=14, fontweight='bold', color='#f0f6fc')\n",
+        "\n",
+        "    # Turns histogram\n",
+        "    axes[0].hist(turn_counts, bins=range(min(turn_counts), max(turn_counts)+2),\n",
+        "                 color='#58a6ff', edgecolor='#30363d', alpha=0.85)\n",
+        "    axes[0].set_title('Message Turns per Example', fontweight='bold')\n",
+        "    axes[0].set_xlabel('Number of Turns')\n",
+        "    axes[0].set_ylabel('Count')\n",
+        "    axes[0].grid(axis='y', alpha=0.3)\n",
+        "\n",
+        "    # Role distribution\n",
+        "    role_counts = defaultdict(int)\n",
+        "    for ex in lines:\n",
+        "        for msg in ex['messages']:\n",
+        "            role_counts[msg['role']] += 1\n",
+        "    roles = list(role_counts.keys())\n",
+        "    counts = list(role_counts.values())\n",
+        "    role_colors = ['#a371f7', '#3fb950', '#58a6ff', '#f0883e'][:len(roles)]\n",
+        "    axes[1].barh(roles, counts, color=role_colors, edgecolor='#30363d')\n",
+        "    axes[1].set_title('Messages by Role', fontweight='bold')\n",
+        "    axes[1].set_xlabel('Count')\n",
+        "    axes[1].grid(axis='x', alpha=0.3)\n",
+        "\n",
+        "    for ax in axes:\n",
+        "        ax.spines['top'].set_visible(False)\n",
+        "        ax.spines['right'].set_visible(False)\n",
+        "    fig.tight_layout()\n",
+        "    plt.show()\n",
         "else:\n",
         "    print(\"❌ SFT dataset not found. Check data preparation above.\")"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "8c529b78",
       "metadata": {},
       "source": [
         "## 5️⃣ Baseline Evaluation\n",
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "9c5db0c1",
       "metadata": {},
       "outputs": [],
       "source": [
         "!python eval.py --policy oracle --label oracle --episodes {EVAL_EPISODES}"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f106aaed",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📊 Visualize baseline results\n",
+        "display(Markdown(\"## 📊 Baseline Results\"))\n",
+        "\n",
+        "baseline_rows = load_eval_jsonl(\"results/runs\")\n",
+        "baseline_only = [r for r in baseline_rows if r.get('model_stage') == 'baseline']\n",
+        "oracle_only = [r for r in baseline_rows if r.get('model_stage') == 'oracle']\n",
+        "\n",
+        "if baseline_only:\n",
+        "    display(Markdown(\"### 🔹 Scripted Weak Baseline\"))\n",
+        "    plot_eval_dashboard(baseline_only, title=\"Scripted Weak Baseline\")\n",
+        "    plot_reward_traces(baseline_only, title=\"Baseline Reward Traces\")\n",
+        "    ALL_EVALS['baseline'] = baseline_only\n",
+        "\n",
+        "if oracle_only:\n",
+        "    display(Markdown(\"### 🔹 Oracle Policy\"))\n",
+        "    plot_eval_dashboard(oracle_only, title=\"Oracle Policy\")\n",
+        "    plot_reward_traces(oracle_only, title=\"Oracle Reward Traces\")\n",
+        "    ALL_EVALS['oracle'] = oracle_only\n",
+        "\n",
+        "# Side-by-side comparison if both exist\n",
+        "if baseline_only and oracle_only:\n",
+        "    plot_stage_comparison(\n",
+        "        {'baseline': baseline_only, 'oracle': oracle_only},\n",
+        "        metric='terminal_reward',\n",
+        "        title='Baseline vs Oracle — Terminal Reward'\n",
+        "    )\n",
+        "gpu_status()"
+      ]
+    },
     {
       "cell_type": "markdown",
+      "id": "3011f739",
       "metadata": {},
       "source": [
         "## 6️⃣ SFT Training (Unsloth + TRL)\n",
         "\n",
+        "Fine-tune Qwen 2.5-3B-Instruct with LoRA using verified CORP-ENV trajectories.\n",
         "\n",
         "- Uses `unsloth.FastLanguageModel` for 4-bit QLoRA\n",
         "- Uses `trl.SFTTrainer` with messages-format conversational SFT\n",
+        "- LoRA `r=32`, targets all attention + MLP projections\n",
+        "- **FP16 on T4** (auto-detected), BF16 on Ampere+ GPUs"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "cb76d631",
       "metadata": {},
       "outputs": [],
       "source": [
+        "gpu_status()\n",
+        "print(f\"\\n🚀 Starting SFT training ({FP16_FLAG or 'bf16'} precision)...\\n\")\n",
+        "\n",
         "!python training/train_sft.py \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --data data/sft/e1_m1_h1_examples.jsonl \\\n",
         "  --output outputs/sft_adapter \\\n",
+        "  --max-seq-length {SFT_MAX_SEQ_LEN} \\\n",
         "  --max-steps {SFT_MAX_STEPS} \\\n",
         "  --epochs {SFT_EPOCHS} \\\n",
         "  --lr {SFT_LR} \\\n",
         "  --batch-size {SFT_BATCH_SIZE} \\\n",
         "  --grad-accum {SFT_GRAD_ACCUM} \\\n",
+        "  {FP16_FLAG}\n",
+        "\n",
+        "gpu_status()\n",
+        "print(\"\\n✅ SFT training complete!\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "3755df03",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📊 Visualize SFT training logs\n",
+        "display(Markdown(\"## 📊 SFT Training Summary\"))\n",
+        "\n",
+        "# Check for trainer_state.json\n",
+        "state_file = Path(\"outputs/sft_adapter/trainer_state.json\")\n",
+        "if state_file.exists():\n",
+        "    state = json.loads(state_file.read_text())\n",
+        "    log_history = state.get('log_history', [])\n",
+        "    if log_history:\n",
+        "        steps = [l['step'] for l in log_history if 'loss' in l]\n",
+        "        losses = [l['loss'] for l in log_history if 'loss' in l]\n",
+        "        lrs = [l.get('learning_rate', 0) for l in log_history if 'loss' in l]\n",
+        "\n",
+        "        fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
+        "        fig.suptitle('SFT Training Curves', fontsize=16, fontweight='bold', color='#f0f6fc')\n",
+        "\n",
+        "        # Loss curve\n",
+        "        axes[0].plot(steps, losses, linewidth=2.5, color='#f0883e', marker='o', markersize=5)\n",
+        "        axes[0].set_title('Training Loss', fontweight='bold')\n",
+        "        axes[0].set_xlabel('Step')\n",
+        "        axes[0].set_ylabel('Loss')\n",
+        "        axes[0].grid(alpha=0.3)\n",
+        "\n",
+        "        # Learning rate schedule\n",
+        "        axes[1].plot(steps, lrs, linewidth=2.5, color='#3fb950', marker='s', markersize=4)\n",
+        "        axes[1].set_title('Learning Rate Schedule', fontweight='bold')\n",
+        "        axes[1].set_xlabel('Step')\n",
+        "        axes[1].set_ylabel('Learning Rate')\n",
+        "        axes[1].ticklabel_format(axis='y', style='scientific', scilimits=(-4, -4))\n",
+        "        axes[1].grid(alpha=0.3)\n",
+        "\n",
+        "        for ax in axes:\n",
+        "            ax.spines['top'].set_visible(False)\n",
+        "            ax.spines['right'].set_visible(False)\n",
+        "        fig.tight_layout()\n",
+        "        plt.show()\n",
+        "\n",
+        "        print(f\"\\n📈 Final loss: {losses[-1]:.4f} at step {steps[-1]}\")\n",
+        "else:\n",
+        "    print(\"⚠️  No trainer_state.json found; training logs unavailable.\")\n",
+        "\n",
+        "# Check adapter files\n",
+        "adapter_dir = Path(\"outputs/sft_adapter\")\n",
+        "if adapter_dir.exists():\n",
+        "    files = list(adapter_dir.glob(\"*\"))\n",
+        "    total_mb = sum(f.stat().st_size for f in files if f.is_file()) / 1e6\n",
+        "    print(f\"💾 Adapter saved: {len(files)} files, {total_mb:.1f} MB total\")"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "cd078c28",
       "metadata": {},
       "source": [
         "## 7️⃣ Evaluate SFT Adapter"
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "50594aef",
       "metadata": {},
       "outputs": [],
       "source": [
+        "# Clear GPU memory before loading eval model\n",
+        "import gc\n",
+        "gc.collect()\n",
+        "torch.cuda.empty_cache()\n",
+        "gpu_status()\n",
+        "\n",
         "!python eval.py \\\n",
         "  --policy hf \\\n",
         "  --label sft \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --adapter outputs/sft_adapter \\\n",
         "  --episodes {EVAL_EPISODES} \\\n",
+        "  --max-steps {EVAL_MAX_STEPS}\n",
+        "\n",
+        "gpu_status()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "37bc9dd8",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📊 Visualize SFT evaluation results\n",
+        "display(Markdown(\"## 📊 SFT Evaluation Results\"))\n",
+        "\n",
+        "sft_rows = [r for r in load_eval_jsonl(\"results/runs\") if r.get('model_stage') == 'sft']\n",
+        "if sft_rows:\n",
+        "    plot_eval_dashboard(sft_rows, title=\"SFT Adapter Evaluation\")\n",
+        "    plot_reward_traces(sft_rows, title=\"SFT Reward Traces\")\n",
+        "    ALL_EVALS['sft'] = sft_rows\n",
+        "\n",
+        "    # Compare baseline → SFT\n",
+        "    display(Markdown(\"### 📈 Improvement: Baseline → SFT\"))\n",
+        "    comparison = {k: v for k, v in ALL_EVALS.items() if k in ('baseline', 'oracle', 'sft')}\n",
+        "    if len(comparison) > 1:\n",
+        "        plot_stage_comparison(comparison, metric='terminal_reward',\n",
+        "                             title='Baseline → SFT — Terminal Reward Comparison')\n",
+        "        plot_stage_comparison(comparison, metric='verifier_pass_rate',\n",
+        "                             title='Baseline → SFT — Verifier Pass Rate')\n",
+        "else:\n",
+        "    print(\"⚠️  No SFT eval results found.\")"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "d9671fe6",
       "metadata": {},
       "source": [
         "## 8️⃣ RLVR Training (Rejection-Sampling FT)\n",
         "4. SFT on that curated set\n",
         "5. Repeating for multiple outer rounds\n",
         "\n",
+        "This avoids the zero-variance gradient problem seen with GRPO on CORP-ENV.\n",
+        "\n",
+        "> ⚡ **T4 Note**: Using `--fp16` and reduced `--n-samples` / `--max-prompts` to fit in 16 GB VRAM."
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "5be0f8be",
       "metadata": {},
       "outputs": [],
       "source": [
+        "# Clear GPU memory\n",
+        "gc.collect()\n",
+        "torch.cuda.empty_cache()\n",
+        "gpu_status()\n",
+        "\n",
+        "print(f\"\\n🚀 Starting RLVR training ({RLVR_ROUNDS} rounds, {RLVR_N_SAMPLES} samples/prompt)...\\n\")\n",
+        "\n",
         "!python training/train_rlvr.py \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --adapter outputs/sft_adapter \\\n",
         "  --n-samples {RLVR_N_SAMPLES} \\\n",
         "  --temperature {RLVR_TEMPERATURE} \\\n",
         "  --max-prompts {RLVR_MAX_PROMPTS} \\\n",
+        "  --max-prompt-length {RLVR_MAX_PROMPT_LEN} \\\n",
+        "  --max-completion-length {RLVR_MAX_COMPLETION_LEN} \\\n",
         "  --strict-json \\\n",
         "  --use-stub-workers \\\n",
         "  --disable-llm-judge \\\n",
+        "  --stats-file results/runs/rlvr_qwen2.5_3b_stats.jsonl \\\n",
+        "  {FP16_FLAG}\n",
+        "\n",
+        "gpu_status()\n",
+        "print(\"\\n✅ RLVR training complete!\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "f71e3401",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📊 Visualize RLVR training progress\n",
+        "display(Markdown(\"## 📊 RLVR Training Progress\"))\n",
+        "plot_rlvr_stats(\"results/runs/rlvr_qwen2.5_3b_stats.jsonl\")\n",
+        "\n",
+        "# Check adapter files\n",
+        "rlvr_dir = Path(\"outputs/rlvr_adapter\")\n",
+        "if rlvr_dir.exists():\n",
+        "    files = list(rlvr_dir.glob(\"*\"))\n",
+        "    total_mb = sum(f.stat().st_size for f in files if f.is_file()) / 1e6\n",
+        "    print(f\"\\n💾 RLVR adapter saved: {len(files)} files, {total_mb:.1f} MB total\")"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "32503cf5",
       "metadata": {},
       "source": [
         "## 9️⃣ Evaluate RLVR Adapter"
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "a756f408",
       "metadata": {},
       "outputs": [],
       "source": [
+        "# Clear GPU memory\n",
+        "gc.collect()\n",
+        "torch.cuda.empty_cache()\n",
+        "gpu_status()\n",
+        "\n",
         "!python eval.py \\\n",
         "  --policy hf \\\n",
         "  --label rlvr \\\n",
         "  --model {BASE_MODEL} \\\n",
         "  --adapter outputs/rlvr_adapter \\\n",
         "  --episodes {EVAL_EPISODES} \\\n",
+        "  --max-steps {EVAL_MAX_STEPS}\n",
+        "\n",
+        "gpu_status()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "daf5526c",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# 📊 Visualize RLVR evaluation results\n",
+        "display(Markdown(\"## 📊 RLVR Evaluation Results\"))\n",
+        "\n",
+        "rlvr_rows = [r for r in load_eval_jsonl(\"results/runs\") if r.get('model_stage') == 'rlvr']\n",
+        "if rlvr_rows:\n",
+        "    plot_eval_dashboard(rlvr_rows, title=\"RLVR Adapter Evaluation\")\n",
+        "    plot_reward_traces(rlvr_rows, title=\"RLVR Reward Traces\")\n",
+        "    ALL_EVALS['rlvr'] = rlvr_rows\n",
+        "\n",
+        "    # Compare SFT → RLVR\n",
+        "    display(Markdown(\"### 📈 Improvement: SFT → RLVR\"))\n",
+        "    if 'sft' in ALL_EVALS:\n",
+        "        plot_stage_comparison(\n",
+        "            {'sft': ALL_EVALS['sft'], 'rlvr': rlvr_rows},\n",
+        "            metric='terminal_reward',\n",
+        "            title='SFT → RLVR — Terminal Reward'\n",
+        "        )\n",
+        "else:\n",
+        "    print(\"⚠️  No RLVR eval results found.\")"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "e96ce765",
+      "metadata": {},
+      "source": [
+        "## 📊 Final Comparison: All Model Stages\n",
+        "\n",
+        "Side-by-side comparison of all evaluated model stages."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "id": "d92ae920",
       "metadata": {},
+      "outputs": [],
       "source": [
+        "display(Markdown(\"## 📊 Full Pipeline Comparison: Baseline → Oracle → SFT → RLVR\"))\n",
+        "\n",
+        "if ALL_EVALS:\n",
+        "    # Terminal Reward comparison\n",
+        "    plot_stage_comparison(ALL_EVALS, metric='terminal_reward',\n",
+        "                         title='Terminal Reward — All Model Stages')\n",
         "\n",
+        "    # Verifier Pass Rate comparison\n",
+        "    plot_stage_comparison(ALL_EVALS, metric='verifier_pass_rate',\n",
+        "                         title='Verifier Pass Rate — All Model Stages')\n",
+        "\n",
+        "    # Build final comparison table\n",
+        "    display(Markdown(\"### 📋 Final Results Table\"))\n",
+        "    header = \"| Stage | Task | Terminal Reward | Verifier Pass | Success Rate |\"\n",
+        "    sep    = \"|-------|------|---------------|--------------|-------------|\"\n",
+        "    lines = [header, sep]\n",
+        "    for stage_name, stage_rows in ALL_EVALS.items():\n",
+        "        by_task = defaultdict(list)\n",
+        "        for r in stage_rows:\n",
+        "            by_task[r['task_id']].append(r)\n",
+        "        for task_id in sorted(by_task.keys()):\n",
+        "            task_rows = by_task[task_id]\n",
+        "            avg_r = np.mean([r['terminal_reward'] for r in task_rows])\n",
+        "            avg_p = np.mean([r['verifier_pass_rate'] for r in task_rows])\n",
+        "            succ = np.mean([1 if r.get('success') else 0 for r in task_rows])\n",
+        "            lines.append(f\"| {stage_name.upper()} | {TASK_SHORT.get(task_id, task_id)} | {avg_r:.3f} | {avg_p:.3f} | {succ:.0%} |\")\n",
+        "    display(Markdown('\\n'.join(lines)))\n",
+        "else:\n",
+        "    print(\"⚠️  No evaluation data collected. Run the evaluation cells above.\")"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "b37b7da9",
       "metadata": {},
       "outputs": [],
       "source": [
+        "# Also generate plots via plot_results.py for file-based output\n",
         "!python plot_results.py \\\n",
         "  --inputs results/runs \\\n",
+        "  --output-dir results/model_compare_qwen25_3b"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
+      "id": "3313ec66",
       "metadata": {},
       "outputs": [],
       "source": [
         "from IPython.display import Image, display, Markdown\n",
         "\n",
+        "plot_dir = Path(\"results/model_compare_qwen25_3b\")\n",
         "if not plot_dir.exists():\n",
         "    plot_dir = Path(\"results/model_compare_qwen25_fresh_no_grpo_ep5rlvr\")\n",
         "\n",
+        "if plot_dir.exists():\n",
+        "    for png in sorted(plot_dir.glob(\"*.png\")):\n",
+        "        display(Markdown(f\"### {png.stem.replace('_', ' ').title()}\"))\n",
+        "        display(Image(filename=str(png), width=800))\n",
         "\n",
+        "    # Show summary table\n",
+        "    summary_md = plot_dir / \"comparison_summary.md\"\n",
+        "    if summary_md.exists():\n",
+        "        display(Markdown(summary_md.read_text()))\n",
+        "else:\n",
+        "    print(\"⚠️  No plot directory found.\")"
       ]
     },
     {
       "cell_type": "markdown",
+      "id": "a638d546",
       "metadata": {},
       "source": [
         "## 📋 Results Summary\n",
         "\n",
+        "Expected progression for Qwen 2.5-3B-Instruct on CORP-ENV:\n",
         "\n",
         "| Stage | E1 Terminal Reward | M1 Terminal Reward | H1 Terminal Reward | M1 Success |\n",
         "|-------|-------------------|-------------------|-------------------|------------|\n",
         "| SFT | 0.910 | 0.943 | 0.889 | 100% |\n",
         "| RLVR | 0.910 | 0.932 | 0.779 | 80% |\n",
         "\n",
+        "> **Key takeaway**: SFT dramatically improves M1 (budget reallocation) from 0% to 100% success rate. RLVR maintains strong performance while reducing reliance on fixed trajectories.\n",
+        "\n",
+        "> **T4 Note**: Results may differ slightly on T4 due to FP16 precision (vs BF16) and reduced RLVR sampling. For best results, use the full hyperparameters on an A100/H100."
       ]
     }
   ],
   },
   "nbformat": 4,
   "nbformat_minor": 5
+}

training/train_rlvr.py CHANGED Viewed

@@ -236,6 +236,7 @@ def sft_on_winners(
     epochs: float,
     max_steps: int,
     max_seq_length: int,
 ) -> None:
     """Run a single SFT pass over the curated (prompt, best_completion) set."""
     from datasets import Dataset
@@ -267,7 +268,8 @@ def sft_on_winners(
         "save_steps": 10_000,
         "save_total_limit": 1,
         "optim": "adamw_8bit",
-        "bf16": True,
         "report_to": "none",
         "dataset_text_field": "text",
         "push_to_hub": False,
@@ -376,6 +378,11 @@ def main() -> None:
         action="store_true",
         help="Disable LLM judge scoring for deterministic verifier-only runs.",
     )
     args = parser.parse_args()
     if args.use_stub_workers:
@@ -405,10 +412,11 @@ def main() -> None:
     print(f"Built {len(full_rows)} prompts from {args.examples}")
     max_seq_len = args.max_prompt_length + args.max_completion_length
     model, tokenizer = FastLanguageModel.from_pretrained(
         model_name=args.model,
         max_seq_length=max_seq_len,
-        dtype=torch.bfloat16,
         load_in_4bit=True,
     )
     if getattr(tokenizer, "pad_token", None) is None and getattr(
@@ -438,9 +446,10 @@ def main() -> None:
             random_state=args.seed,
         )
     for p in model.parameters():
         if p.requires_grad and p.dtype == torch.float32:
-            p.data = p.data.to(torch.bfloat16)
     stats_path = Path(args.stats_file) if args.stats_file else None
     if stats_path:
@@ -490,6 +499,7 @@ def main() -> None:
             epochs=args.inner_epochs,
             max_steps=args.inner_max_steps,
             max_seq_length=max_seq_len,
         )
     Path(args.output).mkdir(parents=True, exist_ok=True)

     epochs: float,
     max_steps: int,
     max_seq_length: int,
+    use_fp16: bool = False,
 ) -> None:
     """Run a single SFT pass over the curated (prompt, best_completion) set."""
     from datasets import Dataset
         "save_steps": 10_000,
         "save_total_limit": 1,
         "optim": "adamw_8bit",
+        "bf16": (not use_fp16) and torch.cuda.is_available(),
+        "fp16": use_fp16 and torch.cuda.is_available(),
         "report_to": "none",
         "dataset_text_field": "text",
         "push_to_hub": False,
         action="store_true",
         help="Disable LLM judge scoring for deterministic verifier-only runs.",
     )
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Use fp16 instead of bf16 (required for T4 GPUs which lack bf16 support).",
+    )
     args = parser.parse_args()
     if args.use_stub_workers:
     print(f"Built {len(full_rows)} prompts from {args.examples}")
     max_seq_len = args.max_prompt_length + args.max_completion_length
+    load_dtype = torch.float16 if args.fp16 else torch.bfloat16
     model, tokenizer = FastLanguageModel.from_pretrained(
         model_name=args.model,
         max_seq_length=max_seq_len,
+        dtype=load_dtype,
         load_in_4bit=True,
     )
     if getattr(tokenizer, "pad_token", None) is None and getattr(
             random_state=args.seed,
         )
+    cast_dtype = torch.float16 if args.fp16 else torch.bfloat16
     for p in model.parameters():
         if p.requires_grad and p.dtype == torch.float32:
+            p.data = p.data.to(cast_dtype)
     stats_path = Path(args.stats_file) if args.stats_file else None
     if stats_path:
             epochs=args.inner_epochs,
             max_steps=args.inner_max_steps,
             max_seq_length=max_seq_len,
+            use_fp16=args.fp16,
         )
     Path(args.output).mkdir(parents=True, exist_ok=True)

training/train_sft.py CHANGED Viewed

@@ -210,10 +210,11 @@ def main() -> None:
     if args.dataset_num_proc == 0 and "dataset_num_proc" in allowed:
         args = argparse.Namespace(**{**vars(args), "dataset_num_proc": None})
     model, tokenizer = FastLanguageModel.from_pretrained(
         model_name=args.model,
         max_seq_length=args.max_seq_length,
-        dtype=None,
         load_in_4bit=True,
     )
     if getattr(tokenizer, "pad_token", None) is None and getattr(

     if args.dataset_num_proc == 0 and "dataset_num_proc" in allowed:
         args = argparse.Namespace(**{**vars(args), "dataset_num_proc": None})
+    load_dtype = torch.float16 if args.fp16 else None
     model, tokenizer = FastLanguageModel.from_pretrained(
         model_name=args.model,
         max_seq_length=args.max_seq_length,
+        dtype=load_dtype,
         load_in_4bit=True,
     )
     if getattr(tokenizer, "pad_token", None) is None and getattr(