{ "cells": [ { "cell_type": "markdown", "id": "01751695", "metadata": {}, "source": [ "# SENTINEL GRPO Training Notebook\n", "\n", "Free T4 = smoke run (50 episodes, ~30 min). Pro/L4 = real run (200 episodes, ~1.5-2.5 hr).\n", "Run cells top-to-bottom. The \"go big\" cell at the bottom is optional and only changes `--episodes`.\n", "\n", "This notebook is the single driver that produces every artifact the rest of the repo already expects but does not have on disk:\n", "\n", "- `outputs/eval_pre.json`\n", "- `training/sentinel_qwen15_grpo/` (LoRA adapter + `trainer_state.json`)\n", "- `outputs/trained_policy_replay.jsonl` (UI replay table)\n", "- `outputs/eval_post.json` (also copied to `outputs/evaluation_results.json` for the live dashboard)\n", "- `outputs/reward_report_task3_seed42.json`\n", "- `outputs/cluster_health_history.json`\n", "- `outputs/charts/*.png` (12 charts via `training/plots.py`)\n", "\n", "It is idempotent: re-running any cell overwrites its outputs cleanly. If GRPO deps fail to install, every downstream cell still runs because the codepaths fall back to a heuristic policy / dependency-free PNGs." ], "outputs": [], "execution_count": null }, { "cell_type": "code", "execution_count": null, "id": "bf35ae51", "metadata": {}, "outputs": [], "source": [ "# Cell 2 - Setup. GPU check, clone, install deps, set PYTHONPATH defensively.\n", "!nvidia-smi || echo \"No GPU detected; CPU fallbacks will still produce artifacts.\"\n", "\n", "import os, sys, subprocess\n", "\n", "if not os.path.isdir(\"sentinel-env\"):\n", " subprocess.check_call([\"git\", \"clone\", \"https://github.com/ADITYAGABA1322/sentinel-env\"])\n", "if os.path.basename(os.getcwd()) != \"sentinel-env\":\n", " os.chdir(\"sentinel-env\")\n", "\n", "subprocess.check_call([\"pip\", \"install\", \"-q\", \"-r\", \"requirements.txt\"])\n", "\n", "try:\n", " subprocess.check_call([\"pip\", \"install\", \"-q\",\n", " \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\",\n", " ])\n", " subprocess.check_call([\"pip\", \"install\", \"-q\", \"--no-deps\",\n", " \"trl==0.24.0\", \"transformers==4.57.6\", \"datasets==4.3.0\", \"accelerate==1.13.0\", \"peft==0.19.1\", \"bitsandbytes==0.49.2\",\n", " ])\n", "except subprocess.CalledProcessError as exc:\n", " print(f\"Training extras failed to install ({exc}); continuing with heuristic-fallback path.\")\n", "\n", "subprocess.check_call([\"pip\", \"install\", \"-q\", \"matplotlib\", \"seaborn\", \"pandas\", \"huggingface_hub\"])\n", "\n", "os.environ[\"PYTHONPATH\"] = os.getcwd()\n", "if os.getcwd() not in sys.path:\n", " sys.path.insert(0, os.getcwd())\n", "\n", "print(\"Working dir:\", os.getcwd())\n", "print(\"PYTHONPATH set to:\", os.environ[\"PYTHONPATH\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "1b4b77b5", "metadata": {}, "outputs": [], "source": [ "# Cell 3 - Hugging Face auth. Optional. Needed only for credit-backed inference\n", "# providers and for pushing the trained adapter back to the Hub in Cell 12.\n", "# Skip this cell if you do not want to upload anything.\n", "try:\n", " from huggingface_hub import notebook_login\n", " notebook_login()\n", "except Exception as exc:\n", " print(f\"HF login skipped: {exc}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "796bf539", "metadata": {}, "outputs": [], "source": [ "# Cell 4 - Pre-training baseline eval. Locks in the \"before\" numbers used by the\n", "# delta charts and the ablation chart in training/plots.py.\n", "!python training/evaluate.py --episodes 30 --task all \\\n", " --policies random,heuristic,oracle_lite \\\n", " --out outputs/eval_pre.json --no-plot" ] }, { "cell_type": "code", "execution_count": null, "id": "fc28625a", "metadata": {}, "outputs": [], "source": [ "# Cell 5 - Smoke GRPO (default tier; free T4).\n", "# Writes training/sentinel_qwen15_grpo/ including trainer_state.json which\n", "# the GRPO reward-curve chart reads. If training deps are missing this prints\n", "# a friendly message and exits 0; downstream cells then fall back to heuristic\n", "# policy via training/replay.py.\n", "!python training/train.py \\\n", " --episodes 50 --task all --seed 0 \\\n", " --model unsloth/Qwen2.5-1.5B-Instruct \\\n", " --epochs 1 --batch-size 2 --learning-rate 5e-6 \\\n", " --lora-rank 16 --max-seq-length 1024 \\\n", " --output-dir training/sentinel_qwen15_grpo" ] }, { "cell_type": "code", "execution_count": null, "id": "02c012d7", "metadata": {}, "outputs": [], "source": [ "# Cell 6 - Record trained-policy actions across 30 seeds x 3 tasks.\n", "# Writes outputs/trained_policy_replay.jsonl which the UI fetches at\n", "# /assets/trained_policy_replay.jsonl. If the LoRA adapter is missing this\n", "# automatically writes heuristic actions tagged model_source=\"heuristic_fallback\";\n", "# the replay still works end-to-end so the dashboard never 404s.\n", "from training.replay import record_trained_actions\n", "\n", "out_path = record_trained_actions(\n", " adapter_path=\"training/sentinel_qwen15_grpo\",\n", " base_model=\"unsloth/Qwen2.5-1.5B-Instruct\",\n", " tasks=[\"task1\", \"task2\", \"task3\"],\n", " seeds=range(30),\n", " out_path=\"outputs/trained_policy_replay.jsonl\",\n", ")\n", "print(f\"Wrote {out_path}\")\n", "!head -n 2 outputs/trained_policy_replay.jsonl\n", "!wc -l outputs/trained_policy_replay.jsonl" ] }, { "cell_type": "code", "execution_count": null, "id": "142c7750", "metadata": {}, "outputs": [], "source": [ "# Cell 7 - Post-training eval with the 4th \"trained\" policy. This is the\n", "# headline file the live dashboard reads at /assets/evaluation_results.json,\n", "# so we copy eval_post.json into that canonical name.\n", "import shutil\n", "\n", "!python training/evaluate.py --episodes 30 --task all \\\n", " --policies random,heuristic,oracle_lite,trained \\\n", " --replay outputs/trained_policy_replay.jsonl \\\n", " --out outputs/eval_post.json --no-plot\n", "\n", "shutil.copy(\"outputs/eval_post.json\", \"outputs/evaluation_results.json\")\n", "print(\"Copied outputs/eval_post.json -> outputs/evaluation_results.json (UI-canonical)\")" ] }, { "cell_type": "code", "execution_count": null, "id": "a0661105", "metadata": {}, "outputs": [], "source": [ "# Cell 8 - Reward report dump for task3, seed=42.\n", "# This is the input training/plots.py needs to draw trust_evolution.png,\n", "# trust_gap_over_time.png, and reward_component_stacked_area.png.\n", "import json, os, random, sys\n", "\n", "if os.getcwd() not in sys.path:\n", " sys.path.insert(0, os.getcwd())\n", "\n", "from environment import SentinelEnv\n", "from training.evaluate import heuristic_policy\n", "\n", "env = SentinelEnv()\n", "result = env.reset(task_type=\"task3\", seed=42)\n", "rng = random.Random(42)\n", "while not result[\"done\"]:\n", " result = env.step(heuristic_policy(env, result[\"observation\"], rng))\n", "\n", "raw_events = env.reward_report().get(\"events\", [])\n", "events = []\n", "for idx, event in enumerate(raw_events):\n", " snap = event.get(\"trust_snapshot\", {}) or {}\n", " action = event.get(\"action\", {}) or {}\n", " sid = action.get(\"specialist_id\")\n", " events.append({\n", " \"step_count\": event.get(\"step_count\", idx + 1),\n", " \"trust_snapshot\": snap,\n", " \"signal_breakdown\": event.get(\"signal_breakdown\", {}),\n", " \"specialist_id\": sid,\n", " \"trust_after\": snap.get(sid) if sid else None,\n", " })\n", "\n", "report = {\"task_type\": \"task3\", \"seed\": 42, \"events\": events}\n", "os.makedirs(\"outputs\", exist_ok=True)\n", "with open(\"outputs/reward_report_task3_seed42.json\", \"w\") as f:\n", " json.dump(report, f, indent=2)\n", "print(f\"Wrote outputs/reward_report_task3_seed42.json with {len(events)} events\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1ee5da19", "metadata": {}, "outputs": [], "source": [ "# Cell 9 - Cluster health timeline dump.\n", "# Runs ClusterTrustEnv twice (random/blind allocation vs trust-aware) so the\n", "# cluster_health_timeline.png and cluster_health_policy_lines.png charts have\n", "# real series instead of plots.py' synthetic fallback data.\n", "import json, os, random, sys\n", "from typing import List\n", "\n", "if os.getcwd() not in sys.path:\n", " sys.path.insert(0, os.getcwd())\n", "\n", "from cluster_trust_env import ClusterTrustEnv\n", "from scripts.cluster_trust_walkthrough import choose_action\n", "\n", "def run_cluster(policy_arg: str, steps: int = 80, seed: int = 42) -> List[float]:\n", " env = ClusterTrustEnv()\n", " res = env.reset(task_type=\"task3\", seed=seed)\n", " rng = random.Random(seed)\n", " series: List[float] = []\n", " for _ in range(steps):\n", " if res[\"done\"]:\n", " break\n", " action = choose_action(res[\"observation\"], policy_arg, rng)\n", " res = env.step(action)\n", " series.append(env.state()[\"cluster\"][\"cluster_health_score\"])\n", " return series\n", "\n", "series = {\n", " \"random\": run_cluster(\"blind\"),\n", " \"heuristic\": run_cluster(\"trust\"),\n", "}\n", "\n", "os.makedirs(\"outputs\", exist_ok=True)\n", "with open(\"outputs/cluster_health_history.json\", \"w\") as f:\n", " json.dump({\"task_type\": \"task3\", \"seed\": 42, \"series\": series}, f, indent=2)\n", "print({k: len(v) for k, v in series.items()})" ] }, { "cell_type": "code", "execution_count": null, "id": "a17fe36f", "metadata": {}, "outputs": [], "source": [ "# Cell 10 - Render all 12 charts via training/plots.py.\n", "# matplotlib path on Colab; falls back to dependency-free PNGs if needed.\n", "!python -m training.plots \\\n", " --pre outputs/eval_pre.json \\\n", " --post outputs/eval_post.json \\\n", " --trainer-state training/sentinel_qwen15_grpo/trainer_state.json \\\n", " --reward-report-task3 outputs/reward_report_task3_seed42.json \\\n", " --cluster-health outputs/cluster_health_history.json \\\n", " --out-dir outputs/charts\n", "\n", "!ls outputs/charts" ] }, { "cell_type": "code", "execution_count": null, "id": "3cadde73", "metadata": {}, "outputs": [], "source": [ "# Cell 11 - Inline preview of the headline charts.\n", "from IPython.display import Image, display\n", "\n", "for name in [\n", " \"baseline_grouped_bars.png\",\n", " \"grpo_reward_curve.png\",\n", " \"trust_evolution.png\",\n", " \"detection_vs_poisoning.png\",\n", " \"cluster_health_timeline.png\",\n", " \"task_radar.png\",\n", " \"ablation.png\",\n", "]:\n", " path = f\"outputs/charts/{name}\"\n", " print(path)\n", " display(Image(path))" ] }, { "cell_type": "code", "execution_count": null, "id": "7ede6dde", "metadata": {}, "outputs": [], "source": [ "# Cell 12 - (optional) Push the LoRA adapter and outputs/ to a Hub repo.\n", "# Requires Cell 3 to have authenticated. Change the repo id to your own namespace.\n", "REPO_ID = \"XcodeAddy/sentinel-grpo-qwen15\"\n", "\n", "from huggingface_hub import HfApi\n", "import os\n", "\n", "api = HfApi()\n", "api.create_repo(REPO_ID, exist_ok=True)\n", "\n", "if os.path.isdir(\"training/sentinel_qwen15_grpo\"):\n", " api.upload_folder(folder_path=\"training/sentinel_qwen15_grpo\", repo_id=REPO_ID)\n", "else:\n", " print(\"No adapter folder; skipping LoRA upload.\")\n", "\n", "api.upload_folder(\n", " folder_path=\"outputs\",\n", " repo_id=REPO_ID,\n", " path_in_repo=\"outputs\",\n", " allow_patterns=[\"*.json\", \"*.jsonl\", \"charts/*.png\"],\n", ")\n", "print(f\"Uploaded artifacts to https://huggingface.co/{REPO_ID}\")" ] }, { "cell_type": "markdown", "id": "f4f3522e", "metadata": {}, "source": [ "---\n", "\n", "## GO BIG TIER - only run on Pro / L4 / A100\n", "\n", "The cell below replaces the smoke run from Cell 5 with a 200-episode GRPO training run. Free T4 is unlikely to finish it in a single Colab session, so prefer Pro/L4 here. After it completes, re-run cells 6, 7, 8, 9, 10, 11 (and optionally 12) in order to refresh every artifact and chart against the better adapter." ], "outputs": [], "execution_count": null }, { "cell_type": "code", "execution_count": null, "id": "8c01f3cc", "metadata": {}, "outputs": [], "source": [ "# Cell 14 - Real-run GRPO. After this finishes, re-run cells 6 -> 11 (-> 12).\n", "!python training/train.py \\\n", " --episodes 200 --task all --seed 0 \\\n", " --model unsloth/Qwen2.5-1.5B-Instruct \\\n", " --epochs 1 --batch-size 2 --learning-rate 5e-6 \\\n", " --lora-rank 16 --max-seq-length 1024 \\\n", " --output-dir training/sentinel_qwen15_grpo" ] } ], "metadata": { "kernelspec": { "display_name": ".venv (3.13.7)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }