File size: 14,364 Bytes
136ea72
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
 
 
cf2cf65
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db820a9
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3824e84
 
 
cf2cf65
 
 
3824e84
cf2cf65
 
3824e84
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136ea72
cf2cf65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136ea72
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "01751695",
   "metadata": {},
   "source": [
    "# SENTINEL GRPO Training Notebook\n",
    "\n",
    "Free T4 = smoke run (50 episodes, ~30 min). Pro/L4 = real run (200 episodes, ~1.5-2.5 hr).\n",
    "Run cells top-to-bottom. The \"go big\" cell at the bottom is optional and only changes `--episodes`.\n",
    "\n",
    "This notebook is the single driver that produces every artifact the rest of the repo already expects but does not have on disk:\n",
    "\n",
    "- `outputs/eval_pre.json`\n",
    "- `training/sentinel_qwen15_grpo/` (LoRA adapter + `trainer_state.json`)\n",
    "- `outputs/trained_policy_replay.jsonl` (UI replay table)\n",
    "- `outputs/eval_post.json` (also copied to `outputs/evaluation_results.json` for the live dashboard)\n",
    "- `outputs/reward_report_task3_seed42.json`\n",
    "- `outputs/cluster_health_history.json`\n",
    "- `outputs/charts/*.png` (12 charts via `training/plots.py`)\n",
    "\n",
    "It is idempotent: re-running any cell overwrites its outputs cleanly. If GRPO deps fail to install, every downstream cell still runs because the codepaths fall back to a heuristic policy / dependency-free PNGs."
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf35ae51",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 2 - Setup. GPU check, clone, install deps, set PYTHONPATH defensively.\n",
    "!nvidia-smi || echo \"No GPU detected; CPU fallbacks will still produce artifacts.\"\n",
    "\n",
    "import os, sys, subprocess\n",
    "\n",
    "if not os.path.isdir(\"sentinel-env\"):\n",
    "    subprocess.check_call([\"git\", \"clone\", \"https://github.com/ADITYAGABA1322/sentinel-env\"])\n",
    "if os.path.basename(os.getcwd()) != \"sentinel-env\":\n",
    "    os.chdir(\"sentinel-env\")\n",
    "\n",
    "subprocess.check_call([\"pip\", \"install\", \"-q\", \"-r\", \"requirements.txt\"])\n",
    "\n",
    "try:\n",
    "    subprocess.check_call([\"pip\", \"install\", \"-q\",\n",
    "        \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\",\n",
    "    ])\n",
    "    subprocess.check_call([\"pip\", \"install\", \"-q\", \"--no-deps\",\n",
    "        \"trl==0.24.0\", \"transformers==4.57.6\", \"datasets==4.3.0\", \"accelerate==1.13.0\", \"peft==0.19.1\", \"bitsandbytes==0.49.2\",\n",
    "    ])\n",
    "except subprocess.CalledProcessError as exc:\n",
    "    print(f\"Training extras failed to install ({exc}); continuing with heuristic-fallback path.\")\n",
    "\n",
    "subprocess.check_call([\"pip\", \"install\", \"-q\", \"matplotlib\", \"seaborn\", \"pandas\", \"huggingface_hub\"])\n",
    "\n",
    "os.environ[\"PYTHONPATH\"] = os.getcwd()\n",
    "if os.getcwd() not in sys.path:\n",
    "    sys.path.insert(0, os.getcwd())\n",
    "\n",
    "print(\"Working dir:\", os.getcwd())\n",
    "print(\"PYTHONPATH set to:\", os.environ[\"PYTHONPATH\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b4b77b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 3 - Hugging Face auth. Optional. Needed only for credit-backed inference\n",
    "# providers and for pushing the trained adapter back to the Hub in Cell 12.\n",
    "# Skip this cell if you do not want to upload anything.\n",
    "try:\n",
    "    from huggingface_hub import notebook_login\n",
    "    notebook_login()\n",
    "except Exception as exc:\n",
    "    print(f\"HF login skipped: {exc}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "796bf539",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 4 - Pre-training baseline eval. Locks in the \"before\" numbers used by the\n",
    "# delta charts and the ablation chart in training/plots.py.\n",
    "!python training/evaluate.py --episodes 30 --task all \\\n",
    "    --policies random,heuristic,oracle_lite \\\n",
    "    --out outputs/eval_pre.json --no-plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc28625a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 5 - Smoke GRPO (default tier; free T4).\n",
    "# Writes training/sentinel_qwen15_grpo/ including trainer_state.json which\n",
    "# the GRPO reward-curve chart reads. If training deps are missing this prints\n",
    "# a friendly message and exits 0; downstream cells then fall back to heuristic\n",
    "# policy via training/replay.py.\n",
    "!python training/train.py \\\n",
    "    --episodes 50 --task all --seed 0 \\\n",
    "    --model unsloth/Qwen2.5-1.5B-Instruct \\\n",
    "    --epochs 1 --batch-size 2 --learning-rate 5e-6 \\\n",
    "    --lora-rank 16 --max-seq-length 1024 \\\n",
    "    --output-dir training/sentinel_qwen15_grpo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02c012d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 6 - Record trained-policy actions across 30 seeds x 3 tasks.\n",
    "# Writes outputs/trained_policy_replay.jsonl which the UI fetches at\n",
    "# /assets/trained_policy_replay.jsonl. If the LoRA adapter is missing this\n",
    "# automatically writes heuristic actions tagged model_source=\"heuristic_fallback\";\n",
    "# the replay still works end-to-end so the dashboard never 404s.\n",
    "from training.replay import record_trained_actions\n",
    "\n",
    "out_path = record_trained_actions(\n",
    "    adapter_path=\"training/sentinel_qwen15_grpo\",\n",
    "    base_model=\"unsloth/Qwen2.5-1.5B-Instruct\",\n",
    "    tasks=[\"task1\", \"task2\", \"task3\"],\n",
    "    seeds=range(30),\n",
    "    out_path=\"outputs/trained_policy_replay.jsonl\",\n",
    ")\n",
    "print(f\"Wrote {out_path}\")\n",
    "!head -n 2 outputs/trained_policy_replay.jsonl\n",
    "!wc -l outputs/trained_policy_replay.jsonl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "142c7750",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 7 - Post-training eval with the 4th \"trained\" policy. This is the\n",
    "# headline file the live dashboard reads at /assets/evaluation_results.json,\n",
    "# so we copy eval_post.json into that canonical name.\n",
    "import shutil\n",
    "\n",
    "!python training/evaluate.py --episodes 30 --task all \\\n",
    "    --policies random,heuristic,oracle_lite,trained \\\n",
    "    --replay outputs/trained_policy_replay.jsonl \\\n",
    "    --out outputs/eval_post.json --no-plot\n",
    "\n",
    "shutil.copy(\"outputs/eval_post.json\", \"outputs/evaluation_results.json\")\n",
    "print(\"Copied outputs/eval_post.json -> outputs/evaluation_results.json (UI-canonical)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0661105",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 8 - Reward report dump for task3, seed=42.\n",
    "# This is the input training/plots.py needs to draw trust_evolution.png,\n",
    "# trust_gap_over_time.png, and reward_component_stacked_area.png.\n",
    "import json, os, random, sys\n",
    "\n",
    "if os.getcwd() not in sys.path:\n",
    "    sys.path.insert(0, os.getcwd())\n",
    "\n",
    "from environment import SentinelEnv\n",
    "from training.evaluate import heuristic_policy\n",
    "\n",
    "env = SentinelEnv()\n",
    "result = env.reset(task_type=\"task3\", seed=42)\n",
    "rng = random.Random(42)\n",
    "while not result[\"done\"]:\n",
    "    result = env.step(heuristic_policy(env, result[\"observation\"], rng))\n",
    "\n",
    "raw_events = env.reward_report().get(\"events\", [])\n",
    "events = []\n",
    "for idx, event in enumerate(raw_events):\n",
    "    snap = event.get(\"trust_snapshot\", {}) or {}\n",
    "    action = event.get(\"action\", {}) or {}\n",
    "    sid = action.get(\"specialist_id\")\n",
    "    events.append({\n",
    "        \"step_count\": event.get(\"step_count\", idx + 1),\n",
    "        \"trust_snapshot\": snap,\n",
    "        \"signal_breakdown\": event.get(\"signal_breakdown\", {}),\n",
    "        \"specialist_id\": sid,\n",
    "        \"trust_after\": snap.get(sid) if sid else None,\n",
    "    })\n",
    "\n",
    "report = {\"task_type\": \"task3\", \"seed\": 42, \"events\": events}\n",
    "os.makedirs(\"outputs\", exist_ok=True)\n",
    "with open(\"outputs/reward_report_task3_seed42.json\", \"w\") as f:\n",
    "    json.dump(report, f, indent=2)\n",
    "print(f\"Wrote outputs/reward_report_task3_seed42.json with {len(events)} events\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ee5da19",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 9 - Cluster health timeline dump.\n",
    "# Runs ClusterTrustEnv twice (random/blind allocation vs trust-aware) so the\n",
    "# cluster_health_timeline.png and cluster_health_policy_lines.png charts have\n",
    "# real series instead of plots.py' synthetic fallback data.\n",
    "import json, os, random, sys\n",
    "from typing import List\n",
    "\n",
    "if os.getcwd() not in sys.path:\n",
    "    sys.path.insert(0, os.getcwd())\n",
    "\n",
    "from cluster_trust_env import ClusterTrustEnv\n",
    "from scripts.cluster_trust_walkthrough import choose_action\n",
    "\n",
    "def run_cluster(policy_arg: str, steps: int = 80, seed: int = 42) -> List[float]:\n",
    "    env = ClusterTrustEnv()\n",
    "    res = env.reset(task_type=\"task3\", seed=seed)\n",
    "    rng = random.Random(seed)\n",
    "    series: List[float] = []\n",
    "    for _ in range(steps):\n",
    "        if res[\"done\"]:\n",
    "            break\n",
    "        action = choose_action(res[\"observation\"], policy_arg, rng)\n",
    "        res = env.step(action)\n",
    "        series.append(env.state()[\"cluster\"][\"cluster_health_score\"])\n",
    "    return series\n",
    "\n",
    "series = {\n",
    "    \"random\": run_cluster(\"blind\"),\n",
    "    \"heuristic\": run_cluster(\"trust\"),\n",
    "}\n",
    "\n",
    "os.makedirs(\"outputs\", exist_ok=True)\n",
    "with open(\"outputs/cluster_health_history.json\", \"w\") as f:\n",
    "    json.dump({\"task_type\": \"task3\", \"seed\": 42, \"series\": series}, f, indent=2)\n",
    "print({k: len(v) for k, v in series.items()})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a17fe36f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 10 - Render all 12 charts via training/plots.py.\n",
    "# matplotlib path on Colab; falls back to dependency-free PNGs if needed.\n",
    "!python -m training.plots \\\n",
    "    --pre  outputs/eval_pre.json \\\n",
    "    --post outputs/eval_post.json \\\n",
    "    --trainer-state training/sentinel_qwen15_grpo/trainer_state.json \\\n",
    "    --reward-report-task3 outputs/reward_report_task3_seed42.json \\\n",
    "    --cluster-health outputs/cluster_health_history.json \\\n",
    "    --out-dir outputs/charts\n",
    "\n",
    "!ls outputs/charts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cadde73",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 11 - Inline preview of the headline charts.\n",
    "from IPython.display import Image, display\n",
    "\n",
    "for name in [\n",
    "    \"baseline_grouped_bars.png\",\n",
    "    \"grpo_reward_curve.png\",\n",
    "    \"trust_evolution.png\",\n",
    "    \"detection_vs_poisoning.png\",\n",
    "    \"cluster_health_timeline.png\",\n",
    "    \"task_radar.png\",\n",
    "    \"ablation.png\",\n",
    "]:\n",
    "    path = f\"outputs/charts/{name}\"\n",
    "    print(path)\n",
    "    display(Image(path))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ede6dde",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 12 - (optional) Push the LoRA adapter and outputs/ to a Hub repo.\n",
    "# Requires Cell 3 to have authenticated. Change the repo id to your own namespace.\n",
    "REPO_ID = \"XcodeAddy/sentinel-grpo-qwen15\"\n",
    "\n",
    "from huggingface_hub import HfApi\n",
    "import os\n",
    "\n",
    "api = HfApi()\n",
    "api.create_repo(REPO_ID, exist_ok=True)\n",
    "\n",
    "if os.path.isdir(\"training/sentinel_qwen15_grpo\"):\n",
    "    api.upload_folder(folder_path=\"training/sentinel_qwen15_grpo\", repo_id=REPO_ID)\n",
    "else:\n",
    "    print(\"No adapter folder; skipping LoRA upload.\")\n",
    "\n",
    "api.upload_folder(\n",
    "    folder_path=\"outputs\",\n",
    "    repo_id=REPO_ID,\n",
    "    path_in_repo=\"outputs\",\n",
    "    allow_patterns=[\"*.json\", \"*.jsonl\", \"charts/*.png\"],\n",
    ")\n",
    "print(f\"Uploaded artifacts to https://huggingface.co/{REPO_ID}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4f3522e",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## GO BIG TIER - only run on Pro / L4 / A100\n",
    "\n",
    "The cell below replaces the smoke run from Cell 5 with a 200-episode GRPO training run. Free T4 is unlikely to finish it in a single Colab session, so prefer Pro/L4 here. After it completes, re-run cells 6, 7, 8, 9, 10, 11 (and optionally 12) in order to refresh every artifact and chart against the better adapter."
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c01f3cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cell 14 - Real-run GRPO. After this finishes, re-run cells 6 -> 11 (-> 12).\n",
    "!python training/train.py \\\n",
    "    --episodes 200 --task all --seed 0 \\\n",
    "    --model unsloth/Qwen2.5-1.5B-Instruct \\\n",
    "    --epochs 1 --batch-size 2 --learning-rate 5e-6 \\\n",
    "    --lora-rank 16 --max-seq-length 1024 \\\n",
    "    --output-dir training/sentinel_qwen15_grpo"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv (3.13.7)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}