File size: 6,547 Bytes
a36db1b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db820a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a36db1b
 
 
 
 
 
 
db820a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a36db1b
db820a9
 
a36db1b
 
 
 
 
 
 
 
 
 
 
 
 
 
abef90f
 
 
 
 
a36db1b
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
# SENTINEL Training Runbook

This is the exact path for training SENTINEL during the hackathon without
putting GPU work inside the Hugging Face Space runtime.

## Mental Model

SENTINEL is not trained from a normal static CSV of prompt-answer pairs.

The loop is:

```text
reset() observation -> model emits JSON action -> step(action) -> reward -> GRPO update
```

The environment is the dataset generator and the reward engine is the teacher.
The scripted specialists/workers are not trained. The first trained model is the
orchestrator policy that chooses actions.

## Data We Have

Abstract trust environment:

```text
task1: 40 scenarios x 10 subtasks = 400 nodes
task2: 40 scenarios x 15 subtasks = 600 nodes
task3: 40 scenarios x 20 subtasks = 800 nodes
total: 120 scenarios, 1,800 subtask nodes
```

GPU cluster environment:

```text
task1: 10 jobs, 8 GPUs, 30 steps
task2: 20 jobs, 12 GPUs, 60 steps
task3: 30 jobs, 16 GPUs, 120 steps
```

The cluster environment is procedural. Changing the seed creates new job
queues, hidden worker shuffles, attacks, and failure traces.

## SFT vs GRPO

Use SFT when you already have ideal demonstrations:

```text
prompt -> ideal JSON action
```

Use GRPO/RL when you can verify actions programmatically:

```text
prompt -> sampled JSON action -> environment reward
```

For SENTINEL, GRPO is the right headline because the reward is objective:
completion, detection, calibration, efficiency, and anti-hack signals. A small
SFT warmup can be added later by recording heuristic/oracle actions, but it is
not required for the first demo.

## Colab Free T4 Flow

1. Open `training/colab_notebook.ipynb` in Google Colab.
2. Runtime -> Change runtime type -> T4 GPU.
3. Run cells 1-4 to install dependencies and log in to Hugging Face.
4. Run a smoke training with 50-100 episodes.
5. Run the full training with 200 episodes when the smoke run looks good.
6. Generate replay JSONL and charts.
7. Commit `outputs/charts/*.png` and `outputs/trained_policy_replay.jsonl`.

## Why Replay Exists

The live Hugging Face Space should stay cheap and deterministic. It should not
load Qwen or a LoRA adapter at runtime.

After Colab training, the notebook records the trained model's actions:

```json
{"task_type":"task3","seed":42,"step":7,"action":{"action_type":"verify","specialist_id":"S0"}}
```

The Space can replay those actions as a fourth policy called `GRPO`. If the
current seed is missing from the replay table, it falls back to the heuristic
and marks the row as a replay miss.

## Commands

Pre-training baseline:

```bash
python training/evaluate.py --episodes 30 --task all \
  --out outputs/eval_pre.json --no-plot
```

Train:

```bash
python training/train.py \
  --episodes 200 --task all --seed 0 \
  --model unsloth/Qwen2.5-1.5B-Instruct \
  --epochs 1 --batch-size 2 --learning-rate 5e-6 \
  --lora-rank 16 --max-seq-length 1024 \
  --output-dir training/sentinel_qwen15_grpo
```

Record replay:

```python
from training.replay import record_trained_actions

record_trained_actions(
    adapter_path="training/sentinel_qwen15_grpo",
    base_model="unsloth/Qwen2.5-1.5B-Instruct",
    tasks=["task1", "task2", "task3"],
    seeds=range(30),
    out_path="outputs/trained_policy_replay.jsonl",
)
```

Post-training replay eval:

```bash
python training/evaluate.py --episodes 30 --task all \
  --policies random,heuristic,oracle_lite,trained \
  --replay outputs/trained_policy_replay.jsonl \
  --out outputs/eval_post.json --no-plot
```

Generate charts:

```bash
python -m training.plots \
  --pre outputs/eval_pre.json \
  --post outputs/eval_post.json \
  --trainer-state training/sentinel_qwen15_grpo/trainer_state.json \
  --reward-report-task3 outputs/reward_report_task3_seed42.json \
  --cluster-health outputs/cluster_health_history.json \
  --out-dir outputs/charts
```

## Hugging Face Token Usage

Use a Hugging Face token in Colab for:

- downloading gated/private models if needed,
- uploading the LoRA adapter to your namespace,
- pushing final chart/replay artifacts if you commit from Colab.

The Space itself does not need GPU to run the replay demo.

## Hugging Face App URLs

Use these two Hugging Face URLs for different jobs:

```text
https://huggingface.co/spaces/XcodeAddy/sentinel-env
```

This is the Space repository/settings page. Use it to inspect files, Settings,
hardware, build logs, variables, secrets, and commits. It is not the iframe app
URL you demo to judges.

```text
https://xcodeaddy-sentinel-env.hf.space/
```

This is the real live app URL. Use this for the dashboard, API smoke tests, and
OpenEnv base URL.

When running locally, start uvicorn with `--host 0.0.0.0`, but open the browser
at `http://127.0.0.1:7860/` or `http://localhost:7860/`. Do not browse to
`http://0.0.0.0:7860/`; `0.0.0.0` is only a bind address.

## Hugging Face Credits

Best use:

- keep the Space on CPU for normal judging,
- optionally upgrade the Space to T4 only during the final live demo if the UI
  needs extra responsiveness,
- avoid doing full training inside the Space,
- use Hugging Face Jobs or Colab for the actual GRPO run.

The Space is for serving the environment and replay demo. Training belongs in
Colab or in a Hugging Face GPU Job.

HF Jobs smoke path:

```bash
.venv/bin/python training/launch_hf_job.py \
  --mode import-smoke \
  --timeout 45m

.venv/bin/python training/launch_hf_job.py \
  --mode train-smoke \
  --episodes 50 \
  --timeout 2h
```

If `import-smoke` passes, run the full job:

```bash
.venv/bin/python training/launch_hf_job.py \
  --mode train-full \
  --episodes 200 \
  --timeout 4h
```

The launcher uses `pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel` because the
current Unsloth stack pulls `torchao`, which expects torch `>=2.11`.

## Success Criteria

Before the final demo, make sure these exist:

```text
outputs/trained_policy_replay.jsonl
outputs/charts/baseline_grouped_bars.png
outputs/charts/grpo_reward_curve.png
outputs/charts/trust_evolution.png
outputs/charts/detection_vs_poisoning.png
outputs/charts/cluster_health_timeline.png
outputs/charts/task_radar.png
outputs/charts/ablation.png
outputs/charts/baseline_delta_lines.png
outputs/charts/cluster_health_policy_lines.png
outputs/charts/trust_gap_over_time.png
outputs/charts/reward_component_stacked_area.png
outputs/charts/failure_fishbone_map.png
```

Then verify:

```bash
python -m pytest -q
python training/evaluate.py --episodes 5 --task task3 \
  --policies random,heuristic,oracle_lite,trained \
  --replay outputs/trained_policy_replay.jsonl
```