Spaces:
Running
Running
| # SENTINEL Training Runbook | |
| This is the exact path for training SENTINEL during the hackathon without | |
| putting GPU work inside the Hugging Face Space runtime. | |
| ## Mental Model | |
| SENTINEL is not trained from a normal static CSV of prompt-answer pairs. | |
| The loop is: | |
| ```text | |
| reset() observation -> model emits JSON action -> step(action) -> reward -> GRPO update | |
| ``` | |
| The environment is the dataset generator and the reward engine is the teacher. | |
| The scripted specialists/workers are not trained. The first trained model is the | |
| orchestrator policy that chooses actions. | |
| ## Data We Have | |
| Abstract trust environment: | |
| ```text | |
| task1: 40 scenarios x 10 subtasks = 400 nodes | |
| task2: 40 scenarios x 15 subtasks = 600 nodes | |
| task3: 40 scenarios x 20 subtasks = 800 nodes | |
| total: 120 scenarios, 1,800 subtask nodes | |
| ``` | |
| GPU cluster environment: | |
| ```text | |
| task1: 10 jobs, 8 GPUs, 30 steps | |
| task2: 20 jobs, 12 GPUs, 60 steps | |
| task3: 30 jobs, 16 GPUs, 120 steps | |
| ``` | |
| The cluster environment is procedural. Changing the seed creates new job | |
| queues, hidden worker shuffles, attacks, and failure traces. | |
| ## SFT vs GRPO | |
| Use SFT when you already have ideal demonstrations: | |
| ```text | |
| prompt -> ideal JSON action | |
| ``` | |
| Use GRPO/RL when you can verify actions programmatically: | |
| ```text | |
| prompt -> sampled JSON action -> environment reward | |
| ``` | |
| For SENTINEL, GRPO is the right headline because the reward is objective: | |
| completion, detection, calibration, efficiency, and anti-hack signals. A small | |
| SFT warmup can be added later by recording heuristic/oracle actions, but it is | |
| not required for the first demo. | |
| ## Colab Free T4 Flow | |
| 1. Open `training/colab_notebook.ipynb` in Google Colab. | |
| 2. Runtime -> Change runtime type -> T4 GPU. | |
| 3. Run cells 1-4 to install dependencies and log in to Hugging Face. | |
| 4. Run a smoke training with 50-100 episodes. | |
| 5. Run the full training with 200 episodes when the smoke run looks good. | |
| 6. Generate replay JSONL and charts. | |
| 7. Commit `outputs/charts/*.png` and `outputs/trained_policy_replay.jsonl`. | |
| ## Why Replay Exists | |
| The live Hugging Face Space should stay cheap and deterministic. It should not | |
| load Qwen or a LoRA adapter at runtime. | |
| After Colab training, the notebook records the trained model's actions: | |
| ```json | |
| {"task_type":"task3","seed":42,"step":7,"action":{"action_type":"verify","specialist_id":"S0"}} | |
| ``` | |
| The Space can replay those actions as a fourth policy called `GRPO`. If the | |
| current seed is missing from the replay table, it falls back to the heuristic | |
| and marks the row as a replay miss. | |
| ## Commands | |
| Pre-training baseline: | |
| ```bash | |
| python training/evaluate.py --episodes 30 --task all \ | |
| --out outputs/eval_pre.json --no-plot | |
| ``` | |
| Train: | |
| ```bash | |
| python training/train.py \ | |
| --episodes 200 --task all --seed 0 \ | |
| --model unsloth/Qwen2.5-1.5B-Instruct \ | |
| --epochs 1 --batch-size 2 --learning-rate 5e-6 \ | |
| --lora-rank 16 --max-seq-length 1024 \ | |
| --output-dir training/sentinel_qwen15_grpo | |
| ``` | |
| Record replay: | |
| ```python | |
| from training.replay import record_trained_actions | |
| record_trained_actions( | |
| adapter_path="training/sentinel_qwen15_grpo", | |
| base_model="unsloth/Qwen2.5-1.5B-Instruct", | |
| tasks=["task1", "task2", "task3"], | |
| seeds=range(30), | |
| out_path="outputs/trained_policy_replay.jsonl", | |
| ) | |
| ``` | |
| Post-training replay eval: | |
| ```bash | |
| python training/evaluate.py --episodes 30 --task all \ | |
| --policies random,heuristic,oracle_lite,trained \ | |
| --replay outputs/trained_policy_replay.jsonl \ | |
| --out outputs/eval_post.json --no-plot | |
| ``` | |
| Generate charts: | |
| ```bash | |
| python -m training.plots \ | |
| --pre outputs/eval_pre.json \ | |
| --post outputs/eval_post.json \ | |
| --trainer-state training/sentinel_qwen15_grpo/trainer_state.json \ | |
| --reward-report-task3 outputs/reward_report_task3_seed42.json \ | |
| --cluster-health outputs/cluster_health_history.json \ | |
| --out-dir outputs/charts | |
| ``` | |
| ## Hugging Face Token Usage | |
| Use a Hugging Face token in Colab for: | |
| - downloading gated/private models if needed, | |
| - uploading the LoRA adapter to your namespace, | |
| - pushing final chart/replay artifacts if you commit from Colab. | |
| The Space itself does not need GPU to run the replay demo. | |
| ## Hugging Face App URLs | |
| Use these two Hugging Face URLs for different jobs: | |
| ```text | |
| https://huggingface.co/spaces/XcodeAddy/sentinel-env | |
| ``` | |
| This is the Space repository/settings page. Use it to inspect files, Settings, | |
| hardware, build logs, variables, secrets, and commits. It is not the iframe app | |
| URL you demo to judges. | |
| ```text | |
| https://xcodeaddy-sentinel-env.hf.space/ | |
| ``` | |
| This is the real live app URL. Use this for the dashboard, API smoke tests, and | |
| OpenEnv base URL. | |
| When running locally, start uvicorn with `--host 0.0.0.0`, but open the browser | |
| at `http://127.0.0.1:7860/` or `http://localhost:7860/`. Do not browse to | |
| `http://0.0.0.0:7860/`; `0.0.0.0` is only a bind address. | |
| ## Hugging Face Credits | |
| Best use: | |
| - keep the Space on CPU for normal judging, | |
| - optionally upgrade the Space to T4 only during the final live demo if the UI | |
| needs extra responsiveness, | |
| - avoid doing full training inside the Space, | |
| - use Hugging Face Jobs or Colab for the actual GRPO run. | |
| The Space is for serving the environment and replay demo. Training belongs in | |
| Colab or in a Hugging Face GPU Job. | |
| HF Jobs smoke path: | |
| ```bash | |
| .venv/bin/python training/launch_hf_job.py \ | |
| --mode import-smoke \ | |
| --timeout 45m | |
| .venv/bin/python training/launch_hf_job.py \ | |
| --mode train-smoke \ | |
| --episodes 50 \ | |
| --timeout 2h | |
| ``` | |
| If `import-smoke` passes, run the full job: | |
| ```bash | |
| .venv/bin/python training/launch_hf_job.py \ | |
| --mode train-full \ | |
| --episodes 200 \ | |
| --timeout 4h | |
| ``` | |
| The launcher uses `pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel` because the | |
| current Unsloth stack pulls `torchao`, which expects torch `>=2.11`. | |
| ## Success Criteria | |
| Before the final demo, make sure these exist: | |
| ```text | |
| outputs/trained_policy_replay.jsonl | |
| outputs/charts/baseline_grouped_bars.png | |
| outputs/charts/grpo_reward_curve.png | |
| outputs/charts/trust_evolution.png | |
| outputs/charts/detection_vs_poisoning.png | |
| outputs/charts/cluster_health_timeline.png | |
| outputs/charts/task_radar.png | |
| outputs/charts/ablation.png | |
| outputs/charts/baseline_delta_lines.png | |
| outputs/charts/cluster_health_policy_lines.png | |
| outputs/charts/trust_gap_over_time.png | |
| outputs/charts/reward_component_stacked_area.png | |
| outputs/charts/failure_fishbone_map.png | |
| ``` | |
| Then verify: | |
| ```bash | |
| python -m pytest -q | |
| python training/evaluate.py --episodes 5 --task task3 \ | |
| --policies random,heuristic,oracle_lite,trained \ | |
| --replay outputs/trained_policy_replay.jsonl | |
| ``` | |