{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "01751695",
   "metadata": {},
   "source": [
    "# SENTINEL GRPO Training Notebook\n",
    "\n",
    "Free T4 = smoke run (50 episodes, ~30 min). Pro/L4 = real run (200 episodes, ~1.5-2.5 hr).\n",
    "Run cells top-to-bottom. The \"go big\" cell at the bottom is optional and only changes `--episodes`.\n",
    "\n",
    "This notebook is the single driver that produces every artifact the rest of the repo already expects but does not have on disk:\n",
    "\n",
    "- `outputs/eval_pre.json`\n",
    "- `training/sentinel_qwen15_grpo/` (LoRA adapter + `trainer_state.json`)\n",
    "- `outputs/trained_policy_replay.jsonl` (UI replay table)\n",
    "- `outputs/eval_post.json` (also copied to `outputs/evaluation_results.json` for the live dashboard)\n",
    "- `outputs/reward_report_task3_seed42.json`\n",
    "- `outputs/cluster_health_history.json`\n",
    "- `outputs/charts/*.png` (12 charts via `training/plots.py`)\n",
    "\n",
    "It is idempotent: re-running any cell overwrites its outputs cleanly. If GRPO deps fail to install, every downstream cell still runs because the codepaths fall back to a heuristic policy / dependency-free PNGs."
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf35ae51",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 2 - Setup. GPU check, clone, install deps, set PYTHONPATH defensively.\n",
    "!nvidia-smi || echo \"No GPU detected; CPU fallbacks will still produce artifacts.\"\n",
    "\n",
    "import os, sys, subprocess\n",
    "\n",
    "if not os.path.isdir(\"sentinel-env\"):\n",
    "    subprocess.check_call([\"git\", \"clone\", \"https://github.com/ADITYAGABA1322/sentinel-env\"])\n",
    "if os.path.basename(os.getcwd()) != \"sentinel-env\":\n",
    "    os.chdir(\"sentinel-env\")\n",
    "\n",
    "subprocess.check_call([\"pip\", \"install\", \"-q\", \"-r\", \"requirements.txt\"])\n",
    "\n",
    "try:\n",
    "    subprocess.check_call([\"pip\", \"install\", \"-q\",\n",
    "        \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\",\n",
    "    ])\n",
    "    subprocess.check_call([\"pip\", \"install\", \"-q\", \"--no-deps\",\n",
    "        \"trl==0.24.0\", \"transformers==4.57.6\", \"datasets==4.3.0\", \"accelerate==1.13.0\", \"peft==0.19.1\", \"bitsandbytes==0.49.2\",\n",
    "    ])\n",
    "except subprocess.CalledProcessError as exc:\n",
    "    print(f\"Training extras failed to install ({exc}); continuing with heuristic-fallback path.\")\n",
    "\n",
    "subprocess.check_call([\"pip\", \"install\", \"-q\", \"matplotlib\", \"seaborn\", \"pandas\", \"huggingface_hub\"])\n",
    "\n",
    "os.environ[\"PYTHONPATH\"] = os.getcwd()\n",
    "if os.getcwd() not in sys.path:\n",
    "    sys.path.insert(0, os.getcwd())\n",
    "\n",
    "print(\"Working dir:\", os.getcwd())\n",
    "print(\"PYTHONPATH set to:\", os.environ[\"PYTHONPATH\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b4b77b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 3 - Hugging Face auth. Optional. Needed only for credit-backed inference\n",
    "# providers and for pushing the trained adapter back to the Hub in Cell 12.\n",
    "# Skip this cell if you do not want to upload anything.\n",
    "try:\n",
    "    from huggingface_hub import notebook_login\n",
    "    notebook_login()\n",
    "except Exception as exc:\n",
    "    print(f\"HF login skipped: {exc}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "796bf539",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 4 - Pre-training baseline eval. Locks in the \"before\" numbers used by the\n",
    "# delta charts and the ablation chart in training/plots.py.\n",
    "!python training/evaluate.py --episodes 30 --task all \\\n",
    "    --policies random,heuristic,oracle_lite \\\n",
    "    --out outputs/eval_pre.json --no-plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc28625a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 5 - Smoke GRPO (default tier; free T4).\n",
    "# Writes training/sentinel_qwen15_grpo/ including trainer_state.json which\n",
    "# the GRPO reward-curve chart reads. If training deps are missing this prints\n",
    "# a friendly message and exits 0; downstream cells then fall back to heuristic\n",
    "# policy via training/replay.py.\n",
    "!python training/train.py \\\n",
    "    --episodes 50 --task all --seed 0 \\\n",
    "    --model unsloth/Qwen2.5-1.5B-Instruct \\\n",
    "    --epochs 1 --batch-size 2 --learning-rate 5e-6 \\\n",
    "    --lora-rank 16 --max-seq-length 1024 \\\n",
    "    --output-dir training/sentinel_qwen15_grpo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02c012d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 6 - Record trained-policy actions across 30 seeds x 3 tasks.\n",
    "# Writes outputs/trained_policy_replay.jsonl which the UI fetches at\n",
    "# /assets/trained_policy_replay.jsonl. If the LoRA adapter is missing this\n",
    "# automatically writes heuristic actions tagged model_source=\"heuristic_fallback\";\n",
    "# the replay still works end-to-end so the dashboard never 404s.\n",
    "from training.replay import record_trained_actions\n",
    "\n",
    "out_path = record_trained_actions(\n",
    "    adapter_path=\"training/sentinel_qwen15_grpo\",\n",
    "    base_model=\"unsloth/Qwen2.5-1.5B-Instruct\",\n",
    "    tasks=[\"task1\", \"task2\", \"task3\"],\n",
    "    seeds=range(30),\n",
    "    out_path=\"outputs/trained_policy_replay.jsonl\",\n",
    ")\n",
    "print(f\"Wrote {out_path}\")\n",
    "!head -n 2 outputs/trained_policy_replay.jsonl\n",
    "!wc -l outputs/trained_policy_replay.jsonl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "142c7750",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 7 - Post-training eval with the 4th \"trained\" policy. This is the\n",
    "# headline file the live dashboard reads at /assets/evaluation_results.json,\n",
    "# so we copy eval_post.json into that canonical name.\n",
    "import shutil\n",
    "\n",
    "!python training/evaluate.py --episodes 30 --task all \\\n",
    "    --policies random,heuristic,oracle_lite,trained \\\n",
    "    --replay outputs/trained_policy_replay.jsonl \\\n",
    "    --out outputs/eval_post.json --no-plot\n",
    "\n",
    "shutil.copy(\"outputs/eval_post.json\", \"outputs/evaluation_results.json\")\n",
    "print(\"Copied outputs/eval_post.json -> outputs/evaluation_results.json (UI-canonical)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0661105",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 8 - Reward report dump for task3, seed=42.\n",
    "# This is the input training/plots.py needs to draw trust_evolution.png,\n",
    "# trust_gap_over_time.png, and reward_component_stacked_area.png.\n",
    "import json, os, random, sys\n",
    "\n",
    "if os.getcwd() not in sys.path:\n",
    "    sys.path.insert(0, os.getcwd())\n",
    "\n",
    "from environment import SentinelEnv\n",
    "from training.evaluate import heuristic_policy\n",
    "\n",
    "env = SentinelEnv()\n",
    "result = env.reset(task_type=\"task3\", seed=42)\n",
    "rng = random.Random(42)\n",
    "while not result[\"done\"]:\n",
    "    result = env.step(heuristic_policy(env, result[\"observation\"], rng))\n",
    "\n",
    "raw_events = env.reward_report().get(\"events\", [])\n",
    "events = []\n",
    "for idx, event in enumerate(raw_events):\n",
    "    snap = event.get(\"trust_snapshot\", {}) or {}\n",
    "    action = event.get(\"action\", {}) or {}\n",
    "    sid = action.get(\"specialist_id\")\n",
    "    events.append({\n",
    "        \"step_count\": event.get(\"step_count\", idx + 1),\n",
    "        \"trust_snapshot\": snap,\n",
    "        \"signal_breakdown\": event.get(\"signal_breakdown\", {}),\n",
    "        \"specialist_id\": sid,\n",
    "        \"trust_after\": snap.get(sid) if sid else None,\n",
    "    })\n",
    "\n",
    "report = {\"task_type\": \"task3\", \"seed\": 42, \"events\": events}\n",
    "os.makedirs(\"outputs\", exist_ok=True)\n",
    "with open(\"outputs/reward_report_task3_seed42.json\", \"w\") as f:\n",
    "    json.dump(report, f, indent=2)\n",
    "print(f\"Wrote outputs/reward_report_task3_seed42.json with {len(events)} events\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ee5da19",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 9 - Cluster health timeline dump.\n",
    "# Runs ClusterTrustEnv twice (random/blind allocation vs trust-aware) so the\n",
    "# cluster_health_timeline.png and cluster_health_policy_lines.png charts have\n",
    "# real series instead of plots.py' synthetic fallback data.\n",
    "import json, os, random, sys\n",
    "from typing import List\n",
    "\n",
    "if os.getcwd() not in sys.path:\n",
    "    sys.path.insert(0, os.getcwd())\n",
    "\n",
    "from cluster_trust_env import ClusterTrustEnv\n",
    "from scripts.cluster_trust_walkthrough import choose_action\n",
    "\n",
    "def run_cluster(policy_arg: str, steps: int = 80, seed: int = 42) -> List[float]:\n",
    "    env = ClusterTrustEnv()\n",
    "    res = env.reset(task_type=\"task3\", seed=seed)\n",
    "    rng = random.Random(seed)\n",
    "    series: List[float] = []\n",
    "    for _ in range(steps):\n",
    "        if res[\"done\"]:\n",
    "            break\n",
    "        action = choose_action(res[\"observation\"], policy_arg, rng)\n",
    "        res = env.step(action)\n",
    "        series.append(env.state()[\"cluster\"][\"cluster_health_score\"])\n",
    "    return series\n",
    "\n",
    "series = {\n",
    "    \"random\": run_cluster(\"blind\"),\n",
    "    \"heuristic\": run_cluster(\"trust\"),\n",
    "}\n",
    "\n",
    "os.makedirs(\"outputs\", exist_ok=True)\n",
    "with open(\"outputs/cluster_health_history.json\", \"w\") as f:\n",
    "    json.dump({\"task_type\": \"task3\", \"seed\": 42, \"series\": series}, f, indent=2)\n",
    "print({k: len(v) for k, v in series.items()})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a17fe36f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 10 - Render all 12 charts via training/plots.py.\n",
    "# matplotlib path on Colab; falls back to dependency-free PNGs if needed.\n",
    "!python -m training.plots \\\n",
    "    --pre  outputs/eval_pre.json \\\n",
    "    --post outputs/eval_post.json \\\n",
    "    --trainer-state training/sentinel_qwen15_grpo/trainer_state.json \\\n",
    "    --reward-report-task3 outputs/reward_report_task3_seed42.json \\\n",
    "    --cluster-health outputs/cluster_health_history.json \\\n",
    "    --out-dir outputs/charts\n",
    "\n",
    "!ls outputs/charts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cadde73",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 11 - Inline preview of the headline charts.\n",
    "from IPython.display import Image, display\n",
    "\n",
    "for name in [\n",
    "    \"baseline_grouped_bars.png\",\n",
    "    \"grpo_reward_curve.png\",\n",
    "    \"trust_evolution.png\",\n",
    "    \"detection_vs_poisoning.png\",\n",
    "    \"cluster_health_timeline.png\",\n",
    "    \"task_radar.png\",\n",
    "    \"ablation.png\",\n",
    "]:\n",
    "    path = f\"outputs/charts/{name}\"\n",
    "    print(path)\n",
    "    display(Image(path))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ede6dde",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 12 - (optional) Push the LoRA adapter and outputs/ to a Hub repo.\n",
    "# Requires Cell 3 to have authenticated. Change the repo id to your own namespace.\n",
    "REPO_ID = \"XcodeAddy/sentinel-grpo-qwen15\"\n",
    "\n",
    "from huggingface_hub import HfApi\n",
    "import os\n",
    "\n",
    "api = HfApi()\n",
    "api.create_repo(REPO_ID, exist_ok=True)\n",
    "\n",
    "if os.path.isdir(\"training/sentinel_qwen15_grpo\"):\n",
    "    api.upload_folder(folder_path=\"training/sentinel_qwen15_grpo\", repo_id=REPO_ID)\n",
    "else:\n",
    "    print(\"No adapter folder; skipping LoRA upload.\")\n",
    "\n",
    "api.upload_folder(\n",
    "    folder_path=\"outputs\",\n",
    "    repo_id=REPO_ID,\n",
    "    path_in_repo=\"outputs\",\n",
    "    allow_patterns=[\"*.json\", \"*.jsonl\", \"charts/*.png\"],\n",
    ")\n",
    "print(f\"Uploaded artifacts to https://huggingface.co/{REPO_ID}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4f3522e",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## GO BIG TIER - only run on Pro / L4 / A100\n",
    "\n",
    "The cell below replaces the smoke run from Cell 5 with a 200-episode GRPO training run. Free T4 is unlikely to finish it in a single Colab session, so prefer Pro/L4 here. After it completes, re-run cells 6, 7, 8, 9, 10, 11 (and optionally 12) in order to refresh every artifact and chart against the better adapter."
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c01f3cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 14 - Real-run GRPO. After this finishes, re-run cells 6 -> 11 (-> 12).\n",
    "!python training/train.py \\\n",
    "    --episodes 200 --task all --seed 0 \\\n",
    "    --model unsloth/Qwen2.5-1.5B-Instruct \\\n",
    "    --epochs 1 --batch-size 2 --learning-rate 5e-6 \\\n",
    "    --lora-rank 16 --max-seq-length 1024 \\\n",
    "    --output-dir training/sentinel_qwen15_grpo"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv (3.13.7)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}