# SENTINEL Training Runbook This is the exact path for training SENTINEL during the hackathon without putting GPU work inside the Hugging Face Space runtime. ## Mental Model SENTINEL is not trained from a normal static CSV of prompt-answer pairs. The loop is: ```text reset() observation -> model emits JSON action -> step(action) -> reward -> GRPO update ``` The environment is the dataset generator and the reward engine is the teacher. The scripted specialists/workers are not trained. The first trained model is the orchestrator policy that chooses actions. ## Data We Have Abstract trust environment: ```text task1: 40 scenarios x 10 subtasks = 400 nodes task2: 40 scenarios x 15 subtasks = 600 nodes task3: 40 scenarios x 20 subtasks = 800 nodes total: 120 scenarios, 1,800 subtask nodes ``` GPU cluster environment: ```text task1: 10 jobs, 8 GPUs, 30 steps task2: 20 jobs, 12 GPUs, 60 steps task3: 30 jobs, 16 GPUs, 120 steps ``` The cluster environment is procedural. Changing the seed creates new job queues, hidden worker shuffles, attacks, and failure traces. ## SFT vs GRPO Use SFT when you already have ideal demonstrations: ```text prompt -> ideal JSON action ``` Use GRPO/RL when you can verify actions programmatically: ```text prompt -> sampled JSON action -> environment reward ``` For SENTINEL, GRPO is the right headline because the reward is objective: completion, detection, calibration, efficiency, and anti-hack signals. A small SFT warmup can be added later by recording heuristic/oracle actions, but it is not required for the first demo. ## Colab Free T4 Flow 1. Open `training/colab_notebook.ipynb` in Google Colab. 2. Runtime -> Change runtime type -> T4 GPU. 3. Run cells 1-4 to install dependencies and log in to Hugging Face. 4. Run a smoke training with 50-100 episodes. 5. Run the full training with 200 episodes when the smoke run looks good. 6. Generate replay JSONL and charts. 7. Commit `outputs/charts/*.png` and `outputs/trained_policy_replay.jsonl`. ## Why Replay Exists The live Hugging Face Space should stay cheap and deterministic. It should not load Qwen or a LoRA adapter at runtime. After Colab training, the notebook records the trained model's actions: ```json {"task_type":"task3","seed":42,"step":7,"action":{"action_type":"verify","specialist_id":"S0"}} ``` The Space can replay those actions as a fourth policy called `GRPO`. If the current seed is missing from the replay table, it falls back to the heuristic and marks the row as a replay miss. ## Commands Pre-training baseline: ```bash python training/evaluate.py --episodes 30 --task all \ --out outputs/eval_pre.json --no-plot ``` Train: ```bash python training/train.py \ --episodes 200 --task all --seed 0 \ --model unsloth/Qwen2.5-1.5B-Instruct \ --epochs 1 --batch-size 2 --learning-rate 5e-6 \ --lora-rank 16 --max-seq-length 1024 \ --output-dir training/sentinel_qwen15_grpo ``` Record replay: ```python from training.replay import record_trained_actions record_trained_actions( adapter_path="training/sentinel_qwen15_grpo", base_model="unsloth/Qwen2.5-1.5B-Instruct", tasks=["task1", "task2", "task3"], seeds=range(30), out_path="outputs/trained_policy_replay.jsonl", ) ``` Post-training replay eval: ```bash python training/evaluate.py --episodes 30 --task all \ --policies random,heuristic,oracle_lite,trained \ --replay outputs/trained_policy_replay.jsonl \ --out outputs/eval_post.json --no-plot ``` Generate charts: ```bash python -m training.plots \ --pre outputs/eval_pre.json \ --post outputs/eval_post.json \ --trainer-state training/sentinel_qwen15_grpo/trainer_state.json \ --reward-report-task3 outputs/reward_report_task3_seed42.json \ --cluster-health outputs/cluster_health_history.json \ --out-dir outputs/charts ``` ## Hugging Face Token Usage Use a Hugging Face token in Colab for: - downloading gated/private models if needed, - uploading the LoRA adapter to your namespace, - pushing final chart/replay artifacts if you commit from Colab. The Space itself does not need GPU to run the replay demo. ## Hugging Face App URLs Use these two Hugging Face URLs for different jobs: ```text https://huggingface.co/spaces/XcodeAddy/sentinel-env ``` This is the Space repository/settings page. Use it to inspect files, Settings, hardware, build logs, variables, secrets, and commits. It is not the iframe app URL you demo to judges. ```text https://xcodeaddy-sentinel-env.hf.space/ ``` This is the real live app URL. Use this for the dashboard, API smoke tests, and OpenEnv base URL. When running locally, start uvicorn with `--host 0.0.0.0`, but open the browser at `http://127.0.0.1:7860/` or `http://localhost:7860/`. Do not browse to `http://0.0.0.0:7860/`; `0.0.0.0` is only a bind address. ## Hugging Face Credits Best use: - keep the Space on CPU for normal judging, - optionally upgrade the Space to T4 only during the final live demo if the UI needs extra responsiveness, - avoid doing full training inside the Space, - use Hugging Face Jobs or Colab for the actual GRPO run. The Space is for serving the environment and replay demo. Training belongs in Colab or in a Hugging Face GPU Job. HF Jobs smoke path: ```bash .venv/bin/python training/launch_hf_job.py \ --mode import-smoke \ --timeout 45m .venv/bin/python training/launch_hf_job.py \ --mode train-smoke \ --episodes 50 \ --timeout 2h ``` If `import-smoke` passes, run the full job: ```bash .venv/bin/python training/launch_hf_job.py \ --mode train-full \ --episodes 200 \ --timeout 4h ``` The launcher uses `pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel` because the current Unsloth stack pulls `torchao`, which expects torch `>=2.11`. ## Success Criteria Before the final demo, make sure these exist: ```text outputs/trained_policy_replay.jsonl outputs/charts/baseline_grouped_bars.png outputs/charts/grpo_reward_curve.png outputs/charts/trust_evolution.png outputs/charts/detection_vs_poisoning.png outputs/charts/cluster_health_timeline.png outputs/charts/task_radar.png outputs/charts/ablation.png outputs/charts/baseline_delta_lines.png outputs/charts/cluster_health_policy_lines.png outputs/charts/trust_gap_over_time.png outputs/charts/reward_component_stacked_area.png outputs/charts/failure_fishbone_map.png ``` Then verify: ```bash python -m pytest -q python training/evaluate.py --episodes 5 --task task3 \ --policies random,heuristic,oracle_lite,trained \ --replay outputs/trained_policy_replay.jsonl ```