XcodeAddy commited on
Commit
1f43fa9
·
1 Parent(s): d82d9e8

Prepare SENTINEL onsite deployment proof

Browse files
.gitignore CHANGED
@@ -5,7 +5,10 @@ __pycache__/
5
  .mypy_cache/
6
  .ruff_cache/
7
  .venv/
8
- outputs/
 
 
 
9
  .env
10
  .env.*
11
  !.env.example
 
5
  .mypy_cache/
6
  .ruff_cache/
7
  .venv/
8
+ outputs/*
9
+ !outputs/baseline_comparison.png
10
+ !outputs/baseline_scores.json
11
+ !outputs/evaluation_results.json
12
  .env
13
  .env.*
14
  !.env.example
README.md CHANGED
@@ -24,6 +24,12 @@ SENTINEL turns that failure mode into a trainable environment. The model only se
24
  - Rewards: per-step reward plus terminal score, normalized to `0.0-1.0`
25
  - Dataset: 120 abstract multi-agent scenarios
26
 
 
 
 
 
 
 
27
  ## Specialist Behaviors
28
 
29
  | Public Slot | Hidden Behavior |
@@ -133,10 +139,11 @@ pip install pytest
133
  Run checks:
134
 
135
  ```bash
136
- python -m py_compile app.py environment.py models.py graders.py specialists.py trust_ledger.py task_graph.py scenarios.py inference.py
137
  python -m pytest -q
138
  python inference.py
139
- python training/evaluate.py --episodes 20 --task task3
 
140
  ```
141
 
142
  Run the server:
@@ -175,7 +182,47 @@ docker run -p 7860:7860 sentinel-env
175
  - `heuristic`
176
  - `oracle_lite`
177
 
178
- The evaluator writes `outputs/evaluation_results.json` for demo charts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
180
  ## Hackathon Alignment
181
 
 
24
  - Rewards: per-step reward plus terminal score, normalized to `0.0-1.0`
25
  - Dataset: 120 abstract multi-agent scenarios
26
 
27
+ ## Live Submission Targets
28
+
29
+ - GitHub: `https://github.com/ADITYAGABA1322/sentinel-env`
30
+ - Hugging Face Space: `https://xcodeaddy-sentinel-env.hf.space`
31
+ - OpenEnv base URL: `https://xcodeaddy-sentinel-env.hf.space`
32
+
33
  ## Specialist Behaviors
34
 
35
  | Public Slot | Hidden Behavior |
 
139
  Run checks:
140
 
141
  ```bash
142
+ python -m py_compile app.py server/app.py environment.py models.py graders.py specialists.py trust_ledger.py task_graph.py scenarios.py inference.py comms_bus.py training/evaluate.py training/train.py
143
  python -m pytest -q
144
  python inference.py
145
+ python training/evaluate.py --episodes 20 --task all --plot outputs/baseline_comparison.png
146
+ python training/train.py --dry-run --episodes 5
147
  ```
148
 
149
  Run the server:
 
182
  - `heuristic`
183
  - `oracle_lite`
184
 
185
+ The evaluator writes `outputs/evaluation_results.json` and `outputs/baseline_comparison.png`.
186
+
187
+ ![Baseline Comparison](outputs/baseline_comparison.png)
188
+
189
+ Latest local comparison, 20 episodes per task and policy:
190
+
191
+ | Policy | Overall | Task 1 | Task 2 | Task 3 |
192
+ | --- | ---: | ---: | ---: | ---: |
193
+ | Random | 0.7144 | 0.7948 | 0.6493 | 0.6990 |
194
+ | Heuristic trust-weighted | 0.8162 | 0.8911 | 0.7736 | 0.7838 |
195
+ | Oracle-lite upper bound | 0.8718 | 0.9445 | 0.7760 | 0.8950 |
196
+
197
+ The demo story is the score gap: the reward function distinguishes blind delegation from trust-aware routing, and the oracle-lite upper bound shows room for onsite RL training.
198
+
199
+ ## Hugging Face Deployment
200
+
201
+ ```bash
202
+ huggingface-cli login
203
+ huggingface-cli repo create sentinel-env --type space --space-sdk docker --private false
204
+ git remote add hf https://huggingface.co/spaces/XcodeAddy/sentinel-env
205
+ git push hf main
206
+ ```
207
+
208
+ After the Space builds:
209
+
210
+ ```bash
211
+ curl https://xcodeaddy-sentinel-env.hf.space/health
212
+ curl https://xcodeaddy-sentinel-env.hf.space/
213
+ curl -X POST https://xcodeaddy-sentinel-env.hf.space/reset \
214
+ -H "Content-Type: application/json" \
215
+ -d '{"task_type":"task3","seed":42}'
216
+ openenv validate . --json
217
+ ```
218
+
219
+ ## Mini-Blog Draft
220
+
221
+ Title: `SENTINEL: Training AI to Trust Wisely in Multi-Agent Systems`
222
+
223
+ SENTINEL is an OpenEnv RL environment for one failure mode: multi-agent systems delegate blindly. One orchestrator must complete long tasks by routing work across five specialist agents whose reliability profiles are hidden and reshuffled every episode. The orchestrator only sees behavior, confidence, stakes, and history, so it must learn skepticism, verification, recovery, and calibrated trust.
224
+
225
+ The specialists are deterministic FSMs on purpose: they give stable reward signals while the orchestrator remains the trainable target. Random routing scores `0.7144`, trust-weighted routing scores `0.8162`, and oracle-lite reaches `0.8718`, showing the environment has a meaningful learning signal before onsite GRPO training.
226
 
227
  ## Hackathon Alignment
228
 
openenv.yaml CHANGED
@@ -23,7 +23,7 @@ description: >
23
  transferable skill, not memorized identities.
24
 
25
  api:
26
- base_url: http://0.0.0.0:7860
27
  endpoints:
28
  health:
29
  method: GET
@@ -140,9 +140,10 @@ baseline:
140
  script: inference.py
141
  required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
142
  optional_env_vars: [ENV_URL]
143
- latest_local_score: 0.7942
144
- latest_local_episodes: 30
 
145
  reproducibility:
146
  inference_temperature: 0.0
147
  agent: heuristic-trust-weighted
148
- dataset_order: fixed SCN-TASK*-001 through SCN-TASK*-010 per task
 
23
  transferable skill, not memorized identities.
24
 
25
  api:
26
+ base_url: https://xcodeaddy-sentinel-env.hf.space
27
  endpoints:
28
  health:
29
  method: GET
 
140
  script: inference.py
141
  required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
142
  optional_env_vars: [ENV_URL]
143
+ latest_local_score: 0.8162
144
+ latest_local_episodes: 60
145
+ comparison_artifact: outputs/baseline_comparison.png
146
  reproducibility:
147
  inference_temperature: 0.0
148
  agent: heuristic-trust-weighted
149
+ dataset_order: fixed SCN-TASK*-001 through SCN-TASK*-020 per task
outputs/baseline_comparison.png ADDED
outputs/baseline_scores.json ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "heuristic-baseline",
3
+ "total_episodes": 30,
4
+ "avg_score": 0.7942,
5
+ "by_task": {
6
+ "task1": {
7
+ "episodes": 10,
8
+ "avg_score": 0.8706
9
+ },
10
+ "task2": {
11
+ "episodes": 10,
12
+ "avg_score": 0.7475
13
+ },
14
+ "task3": {
15
+ "episodes": 10,
16
+ "avg_score": 0.7646
17
+ }
18
+ },
19
+ "episodes": [
20
+ {
21
+ "scenario_id": "SCN-TASK1-001",
22
+ "task_type": "task1",
23
+ "steps": 13,
24
+ "score": 0.765,
25
+ "total_reward": 10.71,
26
+ "completion_rate": 0.8,
27
+ "adversarial_detections": 0,
28
+ "adversarial_poisonings": 0,
29
+ "final_trust": {
30
+ "S0": 0.473,
31
+ "S1": 0.743,
32
+ "S2": 0.5,
33
+ "S3": 0.5,
34
+ "S4": 0.5
35
+ }
36
+ },
37
+ {
38
+ "scenario_id": "SCN-TASK1-002",
39
+ "task_type": "task1",
40
+ "steps": 12,
41
+ "score": 0.7962,
42
+ "total_reward": 10.35,
43
+ "completion_rate": 0.8,
44
+ "adversarial_detections": 0,
45
+ "adversarial_poisonings": 0,
46
+ "final_trust": {
47
+ "S0": 0.473,
48
+ "S1": 0.888,
49
+ "S2": 0.5,
50
+ "S3": 0.5,
51
+ "S4": 0.5
52
+ }
53
+ },
54
+ {
55
+ "scenario_id": "SCN-TASK1-003",
56
+ "task_type": "task1",
57
+ "steps": 11,
58
+ "score": 0.885,
59
+ "total_reward": 10.62,
60
+ "completion_rate": 0.9,
61
+ "adversarial_detections": 0,
62
+ "adversarial_poisonings": 0,
63
+ "final_trust": {
64
+ "S0": 0.296,
65
+ "S1": 0.296,
66
+ "S2": 0.94,
67
+ "S3": 0.5,
68
+ "S4": 0.5
69
+ }
70
+ },
71
+ {
72
+ "scenario_id": "SCN-TASK1-004",
73
+ "task_type": "task1",
74
+ "steps": 8,
75
+ "score": 0.99,
76
+ "total_reward": 8.91,
77
+ "completion_rate": 0.8,
78
+ "adversarial_detections": 0,
79
+ "adversarial_poisonings": 0,
80
+ "final_trust": {
81
+ "S0": 0.931,
82
+ "S1": 0.5,
83
+ "S2": 0.5,
84
+ "S3": 0.5,
85
+ "S4": 0.5
86
+ }
87
+ },
88
+ {
89
+ "scenario_id": "SCN-TASK1-005",
90
+ "task_type": "task1",
91
+ "steps": 11,
92
+ "score": 0.9375,
93
+ "total_reward": 11.25,
94
+ "completion_rate": 1.0,
95
+ "adversarial_detections": 0,
96
+ "adversarial_poisonings": 0,
97
+ "final_trust": {
98
+ "S0": 0.86,
99
+ "S1": 0.5,
100
+ "S2": 0.5,
101
+ "S3": 0.5,
102
+ "S4": 0.5
103
+ }
104
+ },
105
+ {
106
+ "scenario_id": "SCN-TASK1-006",
107
+ "task_type": "task1",
108
+ "steps": 8,
109
+ "score": 0.85,
110
+ "total_reward": 7.65,
111
+ "completion_rate": 0.6,
112
+ "adversarial_detections": 0,
113
+ "adversarial_poisonings": 0,
114
+ "final_trust": {
115
+ "S0": 0.71,
116
+ "S1": 0.5,
117
+ "S2": 0.5,
118
+ "S3": 0.5,
119
+ "S4": 0.5
120
+ }
121
+ },
122
+ {
123
+ "scenario_id": "SCN-TASK1-007",
124
+ "task_type": "task1",
125
+ "steps": 10,
126
+ "score": 0.99,
127
+ "total_reward": 10.89,
128
+ "completion_rate": 1.0,
129
+ "adversarial_detections": 0,
130
+ "adversarial_poisonings": 0,
131
+ "final_trust": {
132
+ "S0": 0.943,
133
+ "S1": 0.5,
134
+ "S2": 0.5,
135
+ "S3": 0.5,
136
+ "S4": 0.5
137
+ }
138
+ },
139
+ {
140
+ "scenario_id": "SCN-TASK1-008",
141
+ "task_type": "task1",
142
+ "steps": 11,
143
+ "score": 0.8325,
144
+ "total_reward": 9.99,
145
+ "completion_rate": 0.8,
146
+ "adversarial_detections": 0,
147
+ "adversarial_poisonings": 0,
148
+ "final_trust": {
149
+ "S0": 0.482,
150
+ "S1": 0.9,
151
+ "S2": 0.5,
152
+ "S3": 0.5,
153
+ "S4": 0.5
154
+ }
155
+ },
156
+ {
157
+ "scenario_id": "SCN-TASK1-009",
158
+ "task_type": "task1",
159
+ "steps": 9,
160
+ "score": 0.864,
161
+ "total_reward": 8.64,
162
+ "completion_rate": 0.7,
163
+ "adversarial_detections": 0,
164
+ "adversarial_poisonings": 0,
165
+ "final_trust": {
166
+ "S0": 0.492,
167
+ "S1": 0.801,
168
+ "S2": 0.5,
169
+ "S3": 0.5,
170
+ "S4": 0.5
171
+ }
172
+ },
173
+ {
174
+ "scenario_id": "SCN-TASK1-010",
175
+ "task_type": "task1",
176
+ "steps": 12,
177
+ "score": 0.7962,
178
+ "total_reward": 10.35,
179
+ "completion_rate": 0.8,
180
+ "adversarial_detections": 0,
181
+ "adversarial_poisonings": 0,
182
+ "final_trust": {
183
+ "S0": 0.494,
184
+ "S1": 0.885,
185
+ "S2": 0.5,
186
+ "S3": 0.5,
187
+ "S4": 0.5
188
+ }
189
+ },
190
+ {
191
+ "scenario_id": "SCN-TASK2-001",
192
+ "task_type": "task2",
193
+ "steps": 19,
194
+ "score": 0.6054,
195
+ "total_reward": 12.1087,
196
+ "completion_rate": 0.8,
197
+ "adversarial_detections": 0,
198
+ "adversarial_poisonings": 0,
199
+ "final_trust": {
200
+ "S0": 0.476,
201
+ "S1": 0.26,
202
+ "S2": 0.717,
203
+ "S3": 0.5,
204
+ "S4": 0.5
205
+ }
206
+ },
207
+ {
208
+ "scenario_id": "SCN-TASK2-002",
209
+ "task_type": "task2",
210
+ "steps": 17,
211
+ "score": 0.7762,
212
+ "total_reward": 13.9711,
213
+ "completion_rate": 0.933,
214
+ "adversarial_detections": 0,
215
+ "adversarial_poisonings": 0,
216
+ "final_trust": {
217
+ "S0": 0.478,
218
+ "S1": 0.958,
219
+ "S2": 0.5,
220
+ "S3": 0.5,
221
+ "S4": 0.5
222
+ }
223
+ },
224
+ {
225
+ "scenario_id": "SCN-TASK2-003",
226
+ "task_type": "task2",
227
+ "steps": 17,
228
+ "score": 0.7377,
229
+ "total_reward": 13.2781,
230
+ "completion_rate": 0.867,
231
+ "adversarial_detections": 0,
232
+ "adversarial_poisonings": 0,
233
+ "final_trust": {
234
+ "S0": 0.289,
235
+ "S1": 0.289,
236
+ "S2": 0.818,
237
+ "S3": 0.5,
238
+ "S4": 0.5
239
+ }
240
+ },
241
+ {
242
+ "scenario_id": "SCN-TASK2-004",
243
+ "task_type": "task2",
244
+ "steps": 15,
245
+ "score": 0.7783,
246
+ "total_reward": 12.4521,
247
+ "completion_rate": 0.933,
248
+ "adversarial_detections": 0,
249
+ "adversarial_poisonings": 0,
250
+ "final_trust": {
251
+ "S0": 0.9,
252
+ "S1": 0.5,
253
+ "S2": 0.5,
254
+ "S3": 0.5,
255
+ "S4": 0.5
256
+ }
257
+ },
258
+ {
259
+ "scenario_id": "SCN-TASK2-005",
260
+ "task_type": "task2",
261
+ "steps": 17,
262
+ "score": 0.8174,
263
+ "total_reward": 14.7129,
264
+ "completion_rate": 1.0,
265
+ "adversarial_detections": 0,
266
+ "adversarial_poisonings": 0,
267
+ "final_trust": {
268
+ "S0": 0.849,
269
+ "S1": 0.5,
270
+ "S2": 0.5,
271
+ "S3": 0.5,
272
+ "S4": 0.5
273
+ }
274
+ },
275
+ {
276
+ "scenario_id": "SCN-TASK2-006",
277
+ "task_type": "task2",
278
+ "steps": 15,
279
+ "score": 0.6476,
280
+ "total_reward": 10.3617,
281
+ "completion_rate": 0.733,
282
+ "adversarial_detections": 0,
283
+ "adversarial_poisonings": 0,
284
+ "final_trust": {
285
+ "S0": 0.708,
286
+ "S1": 0.5,
287
+ "S2": 0.5,
288
+ "S3": 0.5,
289
+ "S4": 0.5
290
+ }
291
+ },
292
+ {
293
+ "scenario_id": "SCN-TASK2-007",
294
+ "task_type": "task2",
295
+ "steps": 15,
296
+ "score": 0.8967,
297
+ "total_reward": 14.3478,
298
+ "completion_rate": 1.0,
299
+ "adversarial_detections": 0,
300
+ "adversarial_poisonings": 0,
301
+ "final_trust": {
302
+ "S0": 0.967,
303
+ "S1": 0.5,
304
+ "S2": 0.5,
305
+ "S3": 0.5,
306
+ "S4": 0.5
307
+ }
308
+ },
309
+ {
310
+ "scenario_id": "SCN-TASK2-008",
311
+ "task_type": "task2",
312
+ "steps": 17,
313
+ "score": 0.7442,
314
+ "total_reward": 13.3953,
315
+ "completion_rate": 0.933,
316
+ "adversarial_detections": 0,
317
+ "adversarial_poisonings": 0,
318
+ "final_trust": {
319
+ "S0": 0.49,
320
+ "S1": 0.959,
321
+ "S2": 0.5,
322
+ "S3": 0.5,
323
+ "S4": 0.5
324
+ }
325
+ },
326
+ {
327
+ "scenario_id": "SCN-TASK2-009",
328
+ "task_type": "task2",
329
+ "steps": 16,
330
+ "score": 0.7525,
331
+ "total_reward": 12.792,
332
+ "completion_rate": 0.933,
333
+ "adversarial_detections": 0,
334
+ "adversarial_poisonings": 0,
335
+ "final_trust": {
336
+ "S0": 0.492,
337
+ "S1": 0.906,
338
+ "S2": 0.5,
339
+ "S3": 0.5,
340
+ "S4": 0.5
341
+ }
342
+ },
343
+ {
344
+ "scenario_id": "SCN-TASK2-010",
345
+ "task_type": "task2",
346
+ "steps": 18,
347
+ "score": 0.7191,
348
+ "total_reward": 13.6622,
349
+ "completion_rate": 0.933,
350
+ "adversarial_detections": 0,
351
+ "adversarial_poisonings": 0,
352
+ "final_trust": {
353
+ "S0": 0.474,
354
+ "S1": 0.955,
355
+ "S2": 0.5,
356
+ "S3": 0.5,
357
+ "S4": 0.5
358
+ }
359
+ },
360
+ {
361
+ "scenario_id": "SCN-TASK3-001",
362
+ "task_type": "task3",
363
+ "steps": 25,
364
+ "score": 0.7354,
365
+ "total_reward": 19.1204,
366
+ "completion_rate": 0.85,
367
+ "adversarial_detections": 0,
368
+ "adversarial_poisonings": 0,
369
+ "final_trust": {
370
+ "S0": 0.456,
371
+ "S1": 0.258,
372
+ "S2": 0.76,
373
+ "S3": 0.5,
374
+ "S4": 0.5
375
+ }
376
+ },
377
+ {
378
+ "scenario_id": "SCN-TASK3-002",
379
+ "task_type": "task3",
380
+ "steps": 25,
381
+ "score": 0.7054,
382
+ "total_reward": 18.341,
383
+ "completion_rate": 0.85,
384
+ "adversarial_detections": 3,
385
+ "adversarial_poisonings": 5,
386
+ "final_trust": {
387
+ "S0": 0.458,
388
+ "S1": 0.473,
389
+ "S2": 0.868,
390
+ "S3": 0.5,
391
+ "S4": 0.5
392
+ }
393
+ },
394
+ {
395
+ "scenario_id": "SCN-TASK3-003",
396
+ "task_type": "task3",
397
+ "steps": 19,
398
+ "score": 0.6438,
399
+ "total_reward": 12.8767,
400
+ "completion_rate": 0.6,
401
+ "adversarial_detections": 0,
402
+ "adversarial_poisonings": 5,
403
+ "final_trust": {
404
+ "S0": 0.299,
405
+ "S1": 0.299,
406
+ "S2": 0.633,
407
+ "S3": 0.5,
408
+ "S4": 0.5
409
+ }
410
+ },
411
+ {
412
+ "scenario_id": "SCN-TASK3-004",
413
+ "task_type": "task3",
414
+ "steps": 21,
415
+ "score": 0.8954,
416
+ "total_reward": 19.6992,
417
+ "completion_rate": 1.0,
418
+ "adversarial_detections": 0,
419
+ "adversarial_poisonings": 0,
420
+ "final_trust": {
421
+ "S0": 0.93,
422
+ "S1": 0.5,
423
+ "S2": 0.5,
424
+ "S3": 0.5,
425
+ "S4": 0.5
426
+ }
427
+ },
428
+ {
429
+ "scenario_id": "SCN-TASK3-005",
430
+ "task_type": "task3",
431
+ "steps": 24,
432
+ "score": 0.7134,
433
+ "total_reward": 17.8339,
434
+ "completion_rate": 0.85,
435
+ "adversarial_detections": 3,
436
+ "adversarial_poisonings": 6,
437
+ "final_trust": {
438
+ "S0": 0.491,
439
+ "S1": 0.797,
440
+ "S2": 0.5,
441
+ "S3": 0.5,
442
+ "S4": 0.5
443
+ }
444
+ },
445
+ {
446
+ "scenario_id": "SCN-TASK3-006",
447
+ "task_type": "task3",
448
+ "steps": 23,
449
+ "score": 0.7857,
450
+ "total_reward": 18.8578,
451
+ "completion_rate": 0.9,
452
+ "adversarial_detections": 0,
453
+ "adversarial_poisonings": 0,
454
+ "final_trust": {
455
+ "S0": 0.774,
456
+ "S1": 0.5,
457
+ "S2": 0.5,
458
+ "S3": 0.5,
459
+ "S4": 0.5
460
+ }
461
+ },
462
+ {
463
+ "scenario_id": "SCN-TASK3-007",
464
+ "task_type": "task3",
465
+ "steps": 24,
466
+ "score": 0.7045,
467
+ "total_reward": 17.6133,
468
+ "completion_rate": 0.85,
469
+ "adversarial_detections": 3,
470
+ "adversarial_poisonings": 7,
471
+ "final_trust": {
472
+ "S0": 0.498,
473
+ "S1": 0.5,
474
+ "S2": 0.5,
475
+ "S3": 0.5,
476
+ "S4": 0.5
477
+ }
478
+ },
479
+ {
480
+ "scenario_id": "SCN-TASK3-008",
481
+ "task_type": "task3",
482
+ "steps": 24,
483
+ "score": 0.8057,
484
+ "total_reward": 20.1435,
485
+ "completion_rate": 0.95,
486
+ "adversarial_detections": 0,
487
+ "adversarial_poisonings": 0,
488
+ "final_trust": {
489
+ "S0": 0.479,
490
+ "S1": 0.856,
491
+ "S2": 0.5,
492
+ "S3": 0.5,
493
+ "S4": 0.5
494
+ }
495
+ },
496
+ {
497
+ "scenario_id": "SCN-TASK3-009",
498
+ "task_type": "task3",
499
+ "steps": 23,
500
+ "score": 0.8456,
501
+ "total_reward": 20.2932,
502
+ "completion_rate": 1.0,
503
+ "adversarial_detections": 0,
504
+ "adversarial_poisonings": 0,
505
+ "final_trust": {
506
+ "S0": 0.488,
507
+ "S1": 0.891,
508
+ "S2": 0.5,
509
+ "S3": 0.5,
510
+ "S4": 0.5
511
+ }
512
+ },
513
+ {
514
+ "scenario_id": "SCN-TASK3-010",
515
+ "task_type": "task3",
516
+ "steps": 24,
517
+ "score": 0.8106,
518
+ "total_reward": 20.2645,
519
+ "completion_rate": 0.95,
520
+ "adversarial_detections": 0,
521
+ "adversarial_poisonings": 0,
522
+ "final_trust": {
523
+ "S0": 0.473,
524
+ "S1": 0.91,
525
+ "S2": 0.5,
526
+ "S3": 0.5,
527
+ "S4": 0.5
528
+ }
529
+ }
530
+ ]
531
+ }
outputs/evaluation_results.json ADDED
The diff for this file is too large to render. See raw diff
 
training/evaluate.py CHANGED
@@ -3,7 +3,9 @@ from __future__ import annotations
3
  import argparse
4
  import json
5
  import random
 
6
  import sys
 
7
  from pathlib import Path
8
  from typing import Callable
9
 
@@ -16,6 +18,8 @@ from environment import SentinelEnv, _GROUND_TRUTH_RELIABILITY
16
 
17
  Policy = Callable[[SentinelEnv, dict, random.Random], dict]
18
 
 
 
19
 
20
  def random_policy(env: SentinelEnv, obs: dict, rng: random.Random) -> dict:
21
  specialist = rng.choice(obs["available_specialists"])
@@ -117,11 +121,162 @@ def _avg(rows: list[dict], key: str) -> float:
117
  return round(sum(float(row.get(key, 0.0)) for row in rows) / max(1, len(rows)), 4)
118
 
119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  def main() -> None:
121
  parser = argparse.ArgumentParser(description="Evaluate SENTINEL policies.")
122
  parser.add_argument("--episodes", type=int, default=20, help="Episodes per policy.")
123
- parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3"])
124
  parser.add_argument("--out", default="outputs/evaluation_results.json")
 
 
125
  args = parser.parse_args()
126
 
127
  policies: dict[str, Policy] = {
@@ -130,23 +285,32 @@ def main() -> None:
130
  "oracle_lite": oracle_lite_policy,
131
  }
132
 
 
133
  rows = []
134
- for policy_name, policy in policies.items():
135
- for seed in range(args.episodes):
136
- rows.append(run_episode(policy_name, policy, args.task, seed))
 
137
 
138
  payload = {
139
  "task": args.task,
 
140
  "episodes_per_policy": args.episodes,
141
  "summary": summarize(rows),
 
142
  "episodes": rows,
143
  }
144
 
145
  out_path = ROOT / args.out
146
  out_path.parent.mkdir(parents=True, exist_ok=True)
147
  out_path.write_text(json.dumps(payload, indent=2) + "\n")
 
 
 
 
 
148
 
149
- print(json.dumps(payload["summary"], indent=2))
150
 
151
 
152
  if __name__ == "__main__":
 
3
  import argparse
4
  import json
5
  import random
6
+ import struct
7
  import sys
8
+ import zlib
9
  from pathlib import Path
10
  from typing import Callable
11
 
 
18
 
19
  Policy = Callable[[SentinelEnv, dict, random.Random], dict]
20
 
21
+ POLICIES: dict[str, Policy] = {}
22
+
23
 
24
  def random_policy(env: SentinelEnv, obs: dict, rng: random.Random) -> dict:
25
  specialist = rng.choice(obs["available_specialists"])
 
121
  return round(sum(float(row.get(key, 0.0)) for row in rows) / max(1, len(rows)), 4)
122
 
123
 
124
+ def summarize_by_task(rows: list[dict]) -> dict:
125
+ grouped: dict[str, list[dict]] = {}
126
+ for row in rows:
127
+ grouped.setdefault(row["task_type"], []).append(row)
128
+ return {task: summarize(task_rows) for task, task_rows in sorted(grouped.items())}
129
+
130
+
131
+ FONT_5X7 = {
132
+ " ": ["00000", "00000", "00000", "00000", "00000", "00000", "00000"],
133
+ "-": ["00000", "00000", "00000", "11111", "00000", "00000", "00000"],
134
+ ".": ["00000", "00000", "00000", "00000", "00000", "01100", "01100"],
135
+ ":": ["00000", "01100", "01100", "00000", "01100", "01100", "00000"],
136
+ "0": ["01110", "10001", "10011", "10101", "11001", "10001", "01110"],
137
+ "1": ["00100", "01100", "00100", "00100", "00100", "00100", "01110"],
138
+ "2": ["01110", "10001", "00001", "00010", "00100", "01000", "11111"],
139
+ "3": ["11110", "00001", "00001", "01110", "00001", "00001", "11110"],
140
+ "4": ["00010", "00110", "01010", "10010", "11111", "00010", "00010"],
141
+ "5": ["11111", "10000", "10000", "11110", "00001", "00001", "11110"],
142
+ "6": ["01110", "10000", "10000", "11110", "10001", "10001", "01110"],
143
+ "7": ["11111", "00001", "00010", "00100", "01000", "01000", "01000"],
144
+ "8": ["01110", "10001", "10001", "01110", "10001", "10001", "01110"],
145
+ "9": ["01110", "10001", "10001", "01111", "00001", "00001", "01110"],
146
+ "A": ["01110", "10001", "10001", "11111", "10001", "10001", "10001"],
147
+ "B": ["11110", "10001", "10001", "11110", "10001", "10001", "11110"],
148
+ "C": ["01110", "10001", "10000", "10000", "10000", "10001", "01110"],
149
+ "D": ["11110", "10001", "10001", "10001", "10001", "10001", "11110"],
150
+ "E": ["11111", "10000", "10000", "11110", "10000", "10000", "11111"],
151
+ "F": ["11111", "10000", "10000", "11110", "10000", "10000", "10000"],
152
+ "G": ["01110", "10001", "10000", "10111", "10001", "10001", "01110"],
153
+ "H": ["10001", "10001", "10001", "11111", "10001", "10001", "10001"],
154
+ "I": ["01110", "00100", "00100", "00100", "00100", "00100", "01110"],
155
+ "J": ["00001", "00001", "00001", "00001", "10001", "10001", "01110"],
156
+ "K": ["10001", "10010", "10100", "11000", "10100", "10010", "10001"],
157
+ "L": ["10000", "10000", "10000", "10000", "10000", "10000", "11111"],
158
+ "M": ["10001", "11011", "10101", "10101", "10001", "10001", "10001"],
159
+ "N": ["10001", "11001", "10101", "10011", "10001", "10001", "10001"],
160
+ "O": ["01110", "10001", "10001", "10001", "10001", "10001", "01110"],
161
+ "P": ["11110", "10001", "10001", "11110", "10000", "10000", "10000"],
162
+ "Q": ["01110", "10001", "10001", "10001", "10101", "10010", "01101"],
163
+ "R": ["11110", "10001", "10001", "11110", "10100", "10010", "10001"],
164
+ "S": ["01111", "10000", "10000", "01110", "00001", "00001", "11110"],
165
+ "T": ["11111", "00100", "00100", "00100", "00100", "00100", "00100"],
166
+ "U": ["10001", "10001", "10001", "10001", "10001", "10001", "01110"],
167
+ "V": ["10001", "10001", "10001", "10001", "10001", "01010", "00100"],
168
+ "W": ["10001", "10001", "10001", "10101", "10101", "10101", "01010"],
169
+ "X": ["10001", "10001", "01010", "00100", "01010", "10001", "10001"],
170
+ "Y": ["10001", "10001", "01010", "00100", "00100", "00100", "00100"],
171
+ "Z": ["11111", "00001", "00010", "00100", "01000", "10000", "11111"],
172
+ }
173
+
174
+
175
+ def write_baseline_chart(payload: dict, path: Path) -> None:
176
+ """Write a dependency-free PNG chart for README and onsite demos."""
177
+ by_task = payload["by_task"]
178
+ tasks = list(by_task.keys())
179
+ policies = [name for name in ("random", "heuristic", "oracle_lite") if any(name in by_task[t] for t in tasks)]
180
+ colors = {
181
+ "random": (239, 68, 68),
182
+ "heuristic": (59, 130, 246),
183
+ "oracle_lite": (16, 185, 129),
184
+ }
185
+ labels = {"random": "RANDOM", "heuristic": "HEURISTIC", "oracle_lite": "ORACLE LITE"}
186
+
187
+ width, height = 1200, 720
188
+ canvas = bytearray([255, 255, 255] * width * height)
189
+
190
+ def rect(x0: int, y0: int, x1: int, y1: int, color: tuple[int, int, int]) -> None:
191
+ x0, y0 = max(0, x0), max(0, y0)
192
+ x1, y1 = min(width, x1), min(height, y1)
193
+ for y in range(y0, y1):
194
+ row = y * width * 3
195
+ for x in range(x0, x1):
196
+ idx = row + x * 3
197
+ canvas[idx : idx + 3] = bytes(color)
198
+
199
+ def text(x: int, y: int, value: str, color: tuple[int, int, int] = (20, 20, 20), scale: int = 2) -> None:
200
+ cursor = x
201
+ for ch in value.upper():
202
+ glyph = FONT_5X7.get(ch, FONT_5X7[" "])
203
+ for gy, line in enumerate(glyph):
204
+ for gx, bit in enumerate(line):
205
+ if bit == "1":
206
+ rect(cursor + gx * scale, y + gy * scale, cursor + (gx + 1) * scale, y + (gy + 1) * scale, color)
207
+ cursor += 6 * scale
208
+
209
+ def line_h(y: int, x0: int, x1: int, color: tuple[int, int, int]) -> None:
210
+ rect(x0, y, x1, y + 1, color)
211
+
212
+ def line_v(x: int, y0: int, y1: int, color: tuple[int, int, int]) -> None:
213
+ rect(x, y0, x + 1, y1, color)
214
+
215
+ margin_left, margin_top, margin_right, margin_bottom = 100, 115, 40, 115
216
+ plot_x0, plot_y0 = margin_left, margin_top
217
+ plot_x1, plot_y1 = width - margin_right, height - margin_bottom
218
+ plot_w, plot_h = plot_x1 - plot_x0, plot_y1 - plot_y0
219
+
220
+ text(50, 28, "SENTINEL BASELINE COMPARISON", (17, 24, 39), 3)
221
+ text(52, 70, "EPISODE SCORE 0.0 TO 1.0 - RANDOM VS TRUST WEIGHTED VS ORACLE LITE", (75, 85, 99), 2)
222
+
223
+ for tick in (0.0, 0.25, 0.5, 0.75, 1.0):
224
+ y = int(plot_y1 - tick * plot_h)
225
+ line_h(y, plot_x0, plot_x1, (226, 232, 240))
226
+ text(32, y - 7, f"{tick:.2f}", (100, 116, 139), 2)
227
+ line_v(plot_x0, plot_y0, plot_y1, (148, 163, 184))
228
+ line_h(plot_y1, plot_x0, plot_x1, (148, 163, 184))
229
+
230
+ group_w = plot_w / max(1, len(tasks))
231
+ bar_w = max(34, min(76, int((group_w - 80) / max(1, len(policies)))))
232
+ for task_idx, task in enumerate(tasks):
233
+ group_center = int(plot_x0 + group_w * task_idx + group_w / 2)
234
+ start_x = group_center - int((len(policies) * bar_w + (len(policies) - 1) * 18) / 2)
235
+ for policy_idx, policy in enumerate(policies):
236
+ value = float(by_task[task].get(policy, {}).get("avg_score", 0.0))
237
+ x0 = start_x + policy_idx * (bar_w + 18)
238
+ y0 = int(plot_y1 - value * plot_h)
239
+ rect(x0 + 3, y0 + 3, x0 + bar_w + 3, plot_y1 + 3, (203, 213, 225))
240
+ rect(x0, y0, x0 + bar_w, plot_y1, colors[policy])
241
+ text(x0 - 4, max(plot_y0 - 2, y0 - 24), f"{value:.2f}", (15, 23, 42), 2)
242
+ text(group_center - 36, plot_y1 + 30, task.upper(), (15, 23, 42), 2)
243
+
244
+ legend_x, legend_y = 780, 32
245
+ for idx, policy in enumerate(policies):
246
+ x = legend_x
247
+ y = legend_y + idx * 24
248
+ rect(x, y, x + 16, y + 16, colors[policy])
249
+ text(x + 24, y + 1, labels[policy], (51, 65, 85), 2)
250
+
251
+ path.parent.mkdir(parents=True, exist_ok=True)
252
+ _write_png(path, width, height, canvas)
253
+
254
+
255
+ def _write_png(path: Path, width: int, height: int, rgb: bytearray) -> None:
256
+ def chunk(tag: bytes, data: bytes) -> bytes:
257
+ return struct.pack(">I", len(data)) + tag + data + struct.pack(">I", zlib.crc32(tag + data) & 0xFFFFFFFF)
258
+
259
+ rows = []
260
+ stride = width * 3
261
+ for y in range(height):
262
+ rows.append(b"\x00" + bytes(rgb[y * stride : (y + 1) * stride]))
263
+ raw = b"".join(rows)
264
+ png = (
265
+ b"\x89PNG\r\n\x1a\n"
266
+ + chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
267
+ + chunk(b"IDAT", zlib.compress(raw, 9))
268
+ + chunk(b"IEND", b"")
269
+ )
270
+ path.write_bytes(png)
271
+
272
+
273
  def main() -> None:
274
  parser = argparse.ArgumentParser(description="Evaluate SENTINEL policies.")
275
  parser.add_argument("--episodes", type=int, default=20, help="Episodes per policy.")
276
+ parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3", "all"])
277
  parser.add_argument("--out", default="outputs/evaluation_results.json")
278
+ parser.add_argument("--plot", default="outputs/baseline_comparison.png")
279
+ parser.add_argument("--no-plot", action="store_true")
280
  args = parser.parse_args()
281
 
282
  policies: dict[str, Policy] = {
 
285
  "oracle_lite": oracle_lite_policy,
286
  }
287
 
288
+ tasks = ["task1", "task2", "task3"] if args.task == "all" else [args.task]
289
  rows = []
290
+ for task_type in tasks:
291
+ for policy_name, policy in policies.items():
292
+ for seed in range(args.episodes):
293
+ rows.append(run_episode(policy_name, policy, task_type, seed))
294
 
295
  payload = {
296
  "task": args.task,
297
+ "tasks": tasks,
298
  "episodes_per_policy": args.episodes,
299
  "summary": summarize(rows),
300
+ "by_task": summarize_by_task(rows),
301
  "episodes": rows,
302
  }
303
 
304
  out_path = ROOT / args.out
305
  out_path.parent.mkdir(parents=True, exist_ok=True)
306
  out_path.write_text(json.dumps(payload, indent=2) + "\n")
307
+ if not args.no_plot:
308
+ chart_path = ROOT / args.plot
309
+ write_baseline_chart(payload, chart_path)
310
+ payload["chart"] = str(chart_path.relative_to(ROOT))
311
+ out_path.write_text(json.dumps(payload, indent=2) + "\n")
312
 
313
+ print(json.dumps({"summary": payload["summary"], "by_task": payload["by_task"], "chart": payload.get("chart")}, indent=2))
314
 
315
 
316
  if __name__ == "__main__":
training/train.py CHANGED
@@ -1,11 +1,11 @@
1
  from __future__ import annotations
2
 
3
  """
4
- Minimal onsite training entrypoint.
5
 
6
  This file is intentionally import-light so it can run locally without GPU
7
  packages. On the finale machine, install the training extras from pyproject and
8
- use this script as the GRPO wiring point.
9
  """
10
 
11
  import argparse
@@ -37,6 +37,24 @@ def build_prompt(observation: dict) -> str:
37
  )
38
 
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  def parse_action(text: str, observation: dict) -> dict:
41
  match = ACTION_RE.search(text or "")
42
  payload = {}
@@ -66,6 +84,44 @@ def parse_action(text: str, observation: dict) -> dict:
66
  }
67
 
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  def dry_run_rollouts(episodes: int, seed: int) -> dict:
70
  rng = random.Random(seed)
71
  scores = []
@@ -88,30 +144,81 @@ def dry_run_rollouts(episodes: int, seed: int) -> dict:
88
  return {"episodes": episodes, "avg_score": round(sum(scores) / max(1, len(scores)), 4)}
89
 
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  def main() -> None:
92
  parser = argparse.ArgumentParser(description="SENTINEL GRPO training harness.")
93
  parser.add_argument("--dry-run", action="store_true", help="Run local rollouts without GPU dependencies.")
94
  parser.add_argument("--episodes", type=int, default=5)
95
  parser.add_argument("--seed", type=int, default=0)
 
 
 
 
 
 
 
 
96
  args = parser.parse_args()
97
 
98
  if args.dry_run:
99
  print(json.dumps(dry_run_rollouts(args.episodes, args.seed), indent=2))
100
  return
101
 
102
- try:
103
- import trl # noqa: F401
104
- import unsloth # noqa: F401
105
- except ImportError as exc:
106
- raise SystemExit(
107
- "Training dependencies are not installed. Run with --dry-run locally, "
108
- "or install the pyproject training extras on the finale GPU machine."
109
- ) from exc
110
-
111
- raise SystemExit(
112
- "GPU training hook is ready. Wire GRPOTrainer here using build_prompt(), "
113
- "parse_action(), and SentinelEnv.step() as the reward source."
114
- )
115
 
116
 
117
  if __name__ == "__main__":
 
1
  from __future__ import annotations
2
 
3
  """
4
+ Onsite training entrypoint.
5
 
6
  This file is intentionally import-light so it can run locally without GPU
7
  packages. On the finale machine, install the training extras from pyproject and
8
+ run without --dry-run to train a small orchestrator policy with GRPO.
9
  """
10
 
11
  import argparse
 
37
  )
38
 
39
 
40
+ def build_dataset_records(episodes: int, task_type: str, seed: int) -> list[dict]:
41
+ records = []
42
+ task_choices = ["task1", "task2", "task3"] if task_type == "all" else [task_type]
43
+ for idx in range(episodes):
44
+ selected_task = task_choices[idx % len(task_choices)]
45
+ env = SentinelEnv()
46
+ result = env.reset(task_type=selected_task, seed=seed + idx)
47
+ obs = result["observation"]
48
+ records.append(
49
+ {
50
+ "prompt": build_prompt(obs),
51
+ "task_type": selected_task,
52
+ "seed": seed + idx,
53
+ }
54
+ )
55
+ return records
56
+
57
+
58
  def parse_action(text: str, observation: dict) -> dict:
59
  match = ACTION_RE.search(text or "")
60
  payload = {}
 
84
  }
85
 
86
 
87
+ def score_completion(completion: str, task_type: str, seed: int) -> float:
88
+ env = SentinelEnv()
89
+ result = env.reset(task_type=task_type, seed=seed)
90
+ obs = result["observation"]
91
+ action = parse_action(completion, obs)
92
+ result = env.step(action)
93
+ return float(result["reward"]["value"])
94
+
95
+
96
+ def sentinel_reward(completions, prompts=None, task_type=None, seed=None, **kwargs):
97
+ rewards = []
98
+ task_values = task_type or kwargs.get("task_type") or ["task3"] * len(completions)
99
+ seed_values = seed or kwargs.get("seed") or list(range(len(completions)))
100
+ for idx, completion in enumerate(completions):
101
+ text = _completion_text(completion)
102
+ try:
103
+ rewards.append(score_completion(text, str(task_values[idx]), int(seed_values[idx])))
104
+ except Exception:
105
+ rewards.append(0.01)
106
+ return rewards
107
+
108
+
109
+ def _completion_text(completion) -> str:
110
+ if isinstance(completion, str):
111
+ return completion
112
+ if isinstance(completion, list):
113
+ parts = []
114
+ for item in completion:
115
+ if isinstance(item, dict):
116
+ parts.append(str(item.get("content", "")))
117
+ else:
118
+ parts.append(str(item))
119
+ return "\n".join(parts)
120
+ if isinstance(completion, dict):
121
+ return str(completion.get("content", completion))
122
+ return str(completion)
123
+
124
+
125
  def dry_run_rollouts(episodes: int, seed: int) -> dict:
126
  rng = random.Random(seed)
127
  scores = []
 
144
  return {"episodes": episodes, "avg_score": round(sum(scores) / max(1, len(scores)), 4)}
145
 
146
 
147
+ def run_grpo(args) -> None:
148
+ try:
149
+ from datasets import Dataset
150
+ from trl import GRPOConfig, GRPOTrainer
151
+ from unsloth import FastLanguageModel
152
+ except ImportError:
153
+ print("Training dependencies are not installed locally.")
154
+ print("Local check passed. For onsite GPU training run:")
155
+ print(" pip install '.[training]'")
156
+ print(" python training/train.py --episodes 300 --task all")
157
+ return
158
+
159
+ records = build_dataset_records(args.episodes, args.task, args.seed)
160
+ dataset = Dataset.from_list(records)
161
+
162
+ model, tokenizer = FastLanguageModel.from_pretrained(
163
+ model_name=args.model,
164
+ max_seq_length=args.max_seq_length,
165
+ load_in_4bit=True,
166
+ )
167
+ model = FastLanguageModel.get_peft_model(
168
+ model,
169
+ r=args.lora_rank,
170
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
171
+ lora_alpha=args.lora_rank,
172
+ )
173
+
174
+ config = GRPOConfig(
175
+ output_dir=args.output_dir,
176
+ learning_rate=args.learning_rate,
177
+ num_train_epochs=args.epochs,
178
+ per_device_train_batch_size=args.batch_size,
179
+ logging_steps=10,
180
+ save_steps=50,
181
+ max_prompt_length=args.max_seq_length,
182
+ max_completion_length=192,
183
+ )
184
+
185
+ trainer_kwargs = {
186
+ "model": model,
187
+ "reward_funcs": [sentinel_reward],
188
+ "args": config,
189
+ "train_dataset": dataset,
190
+ }
191
+ try:
192
+ trainer = GRPOTrainer(processing_class=tokenizer, **trainer_kwargs)
193
+ except TypeError:
194
+ trainer = GRPOTrainer(tokenizer=tokenizer, **trainer_kwargs)
195
+
196
+ trainer.train()
197
+ model.save_pretrained(args.output_dir)
198
+ tokenizer.save_pretrained(args.output_dir)
199
+ print(f"Training complete. Saved LoRA adapter to {args.output_dir}")
200
+
201
+
202
  def main() -> None:
203
  parser = argparse.ArgumentParser(description="SENTINEL GRPO training harness.")
204
  parser.add_argument("--dry-run", action="store_true", help="Run local rollouts without GPU dependencies.")
205
  parser.add_argument("--episodes", type=int, default=5)
206
  parser.add_argument("--seed", type=int, default=0)
207
+ parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3", "all"])
208
+ parser.add_argument("--model", default="unsloth/Qwen2.5-1.5B-Instruct")
209
+ parser.add_argument("--output-dir", default="training/sentinel_model")
210
+ parser.add_argument("--epochs", type=int, default=1)
211
+ parser.add_argument("--batch-size", type=int, default=2)
212
+ parser.add_argument("--learning-rate", type=float, default=5e-6)
213
+ parser.add_argument("--max-seq-length", type=int, default=1024)
214
+ parser.add_argument("--lora-rank", type=int, default=16)
215
  args = parser.parse_args()
216
 
217
  if args.dry_run:
218
  print(json.dumps(dry_run_rollouts(args.episodes, args.seed), indent=2))
219
  return
220
 
221
+ run_grpo(args)
 
 
 
 
 
 
 
 
 
 
 
 
222
 
223
 
224
  if __name__ == "__main__":