sentinel-env / docs /TRAINING_RUNBOOK.md
XcodeAddy's picture
Revert "Merge remote main with local project"
1c2514b
# SENTINEL Training Runbook
This is the exact path for training SENTINEL during the hackathon without
putting GPU work inside the Hugging Face Space runtime.
## Mental Model
SENTINEL is not trained from a normal static CSV of prompt-answer pairs.
The loop is:
```text
reset() observation -> model emits JSON action -> step(action) -> reward -> GRPO update
```
The environment is the dataset generator and the reward engine is the teacher.
The scripted specialists/workers are not trained. The first trained model is the
orchestrator policy that chooses actions.
## Data We Have
Abstract trust environment:
```text
task1: 40 scenarios x 10 subtasks = 400 nodes
task2: 40 scenarios x 15 subtasks = 600 nodes
task3: 40 scenarios x 20 subtasks = 800 nodes
total: 120 scenarios, 1,800 subtask nodes
```
GPU cluster environment:
```text
task1: 10 jobs, 8 GPUs, 30 steps
task2: 20 jobs, 12 GPUs, 60 steps
task3: 30 jobs, 16 GPUs, 120 steps
```
The cluster environment is procedural. Changing the seed creates new job
queues, hidden worker shuffles, attacks, and failure traces.
## SFT vs GRPO
Use SFT when you already have ideal demonstrations:
```text
prompt -> ideal JSON action
```
Use GRPO/RL when you can verify actions programmatically:
```text
prompt -> sampled JSON action -> environment reward
```
For SENTINEL, GRPO is the right headline because the reward is objective:
completion, detection, calibration, efficiency, and anti-hack signals. A small
SFT warmup can be added later by recording heuristic/oracle actions, but it is
not required for the first demo.
## Colab Free T4 Flow
1. Open `training/colab_notebook.ipynb` in Google Colab.
2. Runtime -> Change runtime type -> T4 GPU.
3. Run cells 1-4 to install dependencies and log in to Hugging Face.
4. Run a smoke training with 50-100 episodes.
5. Run the full training with 200 episodes when the smoke run looks good.
6. Generate replay JSONL and charts.
7. Commit `outputs/charts/*.png` and `outputs/trained_policy_replay.jsonl`.
## Why Replay Exists
The live Hugging Face Space should stay cheap and deterministic. It should not
load Qwen or a LoRA adapter at runtime.
After Colab training, the notebook records the trained model's actions:
```json
{"task_type":"task3","seed":42,"step":7,"action":{"action_type":"verify","specialist_id":"S0"}}
```
The Space can replay those actions as a fourth policy called `GRPO`. If the
current seed is missing from the replay table, it falls back to the heuristic
and marks the row as a replay miss.
## Commands
Pre-training baseline:
```bash
python training/evaluate.py --episodes 30 --task all \
--out outputs/eval_pre.json --no-plot
```
Train:
```bash
python training/train.py \
--episodes 200 --task all --seed 0 \
--model unsloth/Qwen2.5-1.5B-Instruct \
--epochs 1 --batch-size 2 --learning-rate 5e-6 \
--lora-rank 16 --max-seq-length 1024 \
--output-dir training/sentinel_qwen15_grpo
```
Record replay:
```python
from training.replay import record_trained_actions
record_trained_actions(
adapter_path="training/sentinel_qwen15_grpo",
base_model="unsloth/Qwen2.5-1.5B-Instruct",
tasks=["task1", "task2", "task3"],
seeds=range(30),
out_path="outputs/trained_policy_replay.jsonl",
)
```
Post-training replay eval:
```bash
python training/evaluate.py --episodes 30 --task all \
--policies random,heuristic,oracle_lite,trained \
--replay outputs/trained_policy_replay.jsonl \
--out outputs/eval_post.json --no-plot
```
Generate charts:
```bash
python -m training.plots \
--pre outputs/eval_pre.json \
--post outputs/eval_post.json \
--trainer-state training/sentinel_qwen15_grpo/trainer_state.json \
--reward-report-task3 outputs/reward_report_task3_seed42.json \
--cluster-health outputs/cluster_health_history.json \
--out-dir outputs/charts
```
## Hugging Face Token Usage
Use a Hugging Face token in Colab for:
- downloading gated/private models if needed,
- uploading the LoRA adapter to your namespace,
- pushing final chart/replay artifacts if you commit from Colab.
The Space itself does not need GPU to run the replay demo.
## Hugging Face App URLs
Use these two Hugging Face URLs for different jobs:
```text
https://huggingface.co/spaces/XcodeAddy/sentinel-env
```
This is the Space repository/settings page. Use it to inspect files, Settings,
hardware, build logs, variables, secrets, and commits. It is not the iframe app
URL you demo to judges.
```text
https://xcodeaddy-sentinel-env.hf.space/
```
This is the real live app URL. Use this for the dashboard, API smoke tests, and
OpenEnv base URL.
When running locally, start uvicorn with `--host 0.0.0.0`, but open the browser
at `http://127.0.0.1:7860/` or `http://localhost:7860/`. Do not browse to
`http://0.0.0.0:7860/`; `0.0.0.0` is only a bind address.
## Hugging Face Credits
Best use:
- keep the Space on CPU for normal judging,
- optionally upgrade the Space to T4 only during the final live demo if the UI
needs extra responsiveness,
- avoid doing full training inside the Space,
- use Hugging Face Jobs or Colab for the actual GRPO run.
The Space is for serving the environment and replay demo. Training belongs in
Colab or in a Hugging Face GPU Job.
HF Jobs smoke path:
```bash
.venv/bin/python training/launch_hf_job.py \
--mode import-smoke \
--timeout 45m
.venv/bin/python training/launch_hf_job.py \
--mode train-smoke \
--episodes 50 \
--timeout 2h
```
If `import-smoke` passes, run the full job:
```bash
.venv/bin/python training/launch_hf_job.py \
--mode train-full \
--episodes 200 \
--timeout 4h
```
The launcher uses `pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel` because the
current Unsloth stack pulls `torchao`, which expects torch `>=2.11`.
## Success Criteria
Before the final demo, make sure these exist:
```text
outputs/trained_policy_replay.jsonl
outputs/charts/baseline_grouped_bars.png
outputs/charts/grpo_reward_curve.png
outputs/charts/trust_evolution.png
outputs/charts/detection_vs_poisoning.png
outputs/charts/cluster_health_timeline.png
outputs/charts/task_radar.png
outputs/charts/ablation.png
outputs/charts/baseline_delta_lines.png
outputs/charts/cluster_health_policy_lines.png
outputs/charts/trust_gap_over_time.png
outputs/charts/reward_component_stacked_area.png
outputs/charts/failure_fishbone_map.png
```
Then verify:
```bash
python -m pytest -q
python training/evaluate.py --episodes 5 --task task3 \
--policies random,heuristic,oracle_lite,trained \
--replay outputs/trained_policy_replay.jsonl
```