Spaces:

XcodeAddy
/

sentinel-env

Running

App Files Files Community

sentinel-env / docs /TRAINING_RUNBOOK.md

XcodeAddy

Revert "Merge remote main with local project"

1c2514b 17 days ago

preview code

raw

history blame contribute delete

6.55 kB

	# SENTINEL Training Runbook

	This is the exact path for training SENTINEL during the hackathon without
	putting GPU work inside the Hugging Face Space runtime.

	## Mental Model

	SENTINEL is not trained from a normal static CSV of prompt-answer pairs.

	The loop is:

	```text
	reset() observation -> model emits JSON action -> step(action) -> reward -> GRPO update
	```

	The environment is the dataset generator and the reward engine is the teacher.
	The scripted specialists/workers are not trained. The first trained model is the
	orchestrator policy that chooses actions.

	## Data We Have

	Abstract trust environment:

	```text
	task1: 40 scenarios x 10 subtasks = 400 nodes
	task2: 40 scenarios x 15 subtasks = 600 nodes
	task3: 40 scenarios x 20 subtasks = 800 nodes
	total: 120 scenarios, 1,800 subtask nodes
	```

	GPU cluster environment:

	```text
	task1: 10 jobs, 8 GPUs, 30 steps
	task2: 20 jobs, 12 GPUs, 60 steps
	task3: 30 jobs, 16 GPUs, 120 steps
	```

	The cluster environment is procedural. Changing the seed creates new job
	queues, hidden worker shuffles, attacks, and failure traces.

	## SFT vs GRPO

	Use SFT when you already have ideal demonstrations:

	```text
	prompt -> ideal JSON action
	```

	Use GRPO/RL when you can verify actions programmatically:

	```text
	prompt -> sampled JSON action -> environment reward
	```

	For SENTINEL, GRPO is the right headline because the reward is objective:
	completion, detection, calibration, efficiency, and anti-hack signals. A small
	SFT warmup can be added later by recording heuristic/oracle actions, but it is
	not required for the first demo.

	## Colab Free T4 Flow

	1. Open `training/colab_notebook.ipynb` in Google Colab.
	2. Runtime -> Change runtime type -> T4 GPU.
	3. Run cells 1-4 to install dependencies and log in to Hugging Face.
	4. Run a smoke training with 50-100 episodes.
	5. Run the full training with 200 episodes when the smoke run looks good.
	6. Generate replay JSONL and charts.
	7. Commit `outputs/charts/*.png` and `outputs/trained_policy_replay.jsonl`.

	## Why Replay Exists

	The live Hugging Face Space should stay cheap and deterministic. It should not
	load Qwen or a LoRA adapter at runtime.

	After Colab training, the notebook records the trained model's actions:

	```json
	{"task_type":"task3","seed":42,"step":7,"action":{"action_type":"verify","specialist_id":"S0"}}
	```

	The Space can replay those actions as a fourth policy called `GRPO`. If the
	current seed is missing from the replay table, it falls back to the heuristic
	and marks the row as a replay miss.

	## Commands

	Pre-training baseline:

	```bash
	python training/evaluate.py --episodes 30 --task all \
	--out outputs/eval_pre.json --no-plot
	```

	Train:

	```bash
	python training/train.py \
	--episodes 200 --task all --seed 0 \
	--model unsloth/Qwen2.5-1.5B-Instruct \
	--epochs 1 --batch-size 2 --learning-rate 5e-6 \
	--lora-rank 16 --max-seq-length 1024 \
	--output-dir training/sentinel_qwen15_grpo
	```

	Record replay:

	```python
	from training.replay import record_trained_actions

	record_trained_actions(
	adapter_path="training/sentinel_qwen15_grpo",
	base_model="unsloth/Qwen2.5-1.5B-Instruct",
	tasks=["task1", "task2", "task3"],
	seeds=range(30),
	out_path="outputs/trained_policy_replay.jsonl",
	)
	```

	Post-training replay eval:

	```bash
	python training/evaluate.py --episodes 30 --task all \
	--policies random,heuristic,oracle_lite,trained \
	--replay outputs/trained_policy_replay.jsonl \
	--out outputs/eval_post.json --no-plot
	```

	Generate charts:

	```bash
	python -m training.plots \
	--pre outputs/eval_pre.json \
	--post outputs/eval_post.json \
	--trainer-state training/sentinel_qwen15_grpo/trainer_state.json \
	--reward-report-task3 outputs/reward_report_task3_seed42.json \
	--cluster-health outputs/cluster_health_history.json \
	--out-dir outputs/charts
	```

	## Hugging Face Token Usage

	Use a Hugging Face token in Colab for:

	- downloading gated/private models if needed,
	- uploading the LoRA adapter to your namespace,
	- pushing final chart/replay artifacts if you commit from Colab.

	The Space itself does not need GPU to run the replay demo.

	## Hugging Face App URLs

	Use these two Hugging Face URLs for different jobs:

	```text
	https://huggingface.co/spaces/XcodeAddy/sentinel-env
	```

	This is the Space repository/settings page. Use it to inspect files, Settings,
	hardware, build logs, variables, secrets, and commits. It is not the iframe app
	URL you demo to judges.

	```text
	https://xcodeaddy-sentinel-env.hf.space/
	```

	This is the real live app URL. Use this for the dashboard, API smoke tests, and
	OpenEnv base URL.

	When running locally, start uvicorn with `--host 0.0.0.0`, but open the browser
	at `http://127.0.0.1:7860/` or `http://localhost:7860/`. Do not browse to
	`http://0.0.0.0:7860/`; `0.0.0.0` is only a bind address.

	## Hugging Face Credits

	Best use:

	- keep the Space on CPU for normal judging,
	- optionally upgrade the Space to T4 only during the final live demo if the UI
	needs extra responsiveness,
	- avoid doing full training inside the Space,
	- use Hugging Face Jobs or Colab for the actual GRPO run.

	The Space is for serving the environment and replay demo. Training belongs in
	Colab or in a Hugging Face GPU Job.

	HF Jobs smoke path:

	```bash
	.venv/bin/python training/launch_hf_job.py \
	--mode import-smoke \
	--timeout 45m

	.venv/bin/python training/launch_hf_job.py \
	--mode train-smoke \
	--episodes 50 \
	--timeout 2h
	```

	If `import-smoke` passes, run the full job:

	```bash
	.venv/bin/python training/launch_hf_job.py \
	--mode train-full \
	--episodes 200 \
	--timeout 4h
	```

	The launcher uses `pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel` because the
	current Unsloth stack pulls `torchao`, which expects torch `>=2.11`.

	## Success Criteria

	Before the final demo, make sure these exist:

	```text
	outputs/trained_policy_replay.jsonl
	outputs/charts/baseline_grouped_bars.png
	outputs/charts/grpo_reward_curve.png
	outputs/charts/trust_evolution.png
	outputs/charts/detection_vs_poisoning.png
	outputs/charts/cluster_health_timeline.png
	outputs/charts/task_radar.png
	outputs/charts/ablation.png
	outputs/charts/baseline_delta_lines.png
	outputs/charts/cluster_health_policy_lines.png
	outputs/charts/trust_gap_over_time.png
	outputs/charts/reward_component_stacked_area.png
	outputs/charts/failure_fishbone_map.png
	```

	Then verify:

	```bash
	python -m pytest -q
	python training/evaluate.py --episodes 5 --task task3 \
	--policies random,heuristic,oracle_lite,trained \
	--replay outputs/trained_policy_replay.jsonl
	```