Spaces:
Running
Running
Prepare SENTINEL onsite deployment proof
Browse files- .gitignore +4 -1
- README.md +50 -3
- openenv.yaml +5 -4
- outputs/baseline_comparison.png +0 -0
- outputs/baseline_scores.json +531 -0
- outputs/evaluation_results.json +0 -0
- training/evaluate.py +169 -5
- training/train.py +122 -15
.gitignore
CHANGED
|
@@ -5,7 +5,10 @@ __pycache__/
|
|
| 5 |
.mypy_cache/
|
| 6 |
.ruff_cache/
|
| 7 |
.venv/
|
| 8 |
-
outputs/
|
|
|
|
|
|
|
|
|
|
| 9 |
.env
|
| 10 |
.env.*
|
| 11 |
!.env.example
|
|
|
|
| 5 |
.mypy_cache/
|
| 6 |
.ruff_cache/
|
| 7 |
.venv/
|
| 8 |
+
outputs/*
|
| 9 |
+
!outputs/baseline_comparison.png
|
| 10 |
+
!outputs/baseline_scores.json
|
| 11 |
+
!outputs/evaluation_results.json
|
| 12 |
.env
|
| 13 |
.env.*
|
| 14 |
!.env.example
|
README.md
CHANGED
|
@@ -24,6 +24,12 @@ SENTINEL turns that failure mode into a trainable environment. The model only se
|
|
| 24 |
- Rewards: per-step reward plus terminal score, normalized to `0.0-1.0`
|
| 25 |
- Dataset: 120 abstract multi-agent scenarios
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
## Specialist Behaviors
|
| 28 |
|
| 29 |
| Public Slot | Hidden Behavior |
|
|
@@ -133,10 +139,11 @@ pip install pytest
|
|
| 133 |
Run checks:
|
| 134 |
|
| 135 |
```bash
|
| 136 |
-
python -m py_compile app.py environment.py models.py graders.py specialists.py trust_ledger.py task_graph.py scenarios.py inference.py
|
| 137 |
python -m pytest -q
|
| 138 |
python inference.py
|
| 139 |
-
python training/evaluate.py --episodes 20 --task
|
|
|
|
| 140 |
```
|
| 141 |
|
| 142 |
Run the server:
|
|
@@ -175,7 +182,47 @@ docker run -p 7860:7860 sentinel-env
|
|
| 175 |
- `heuristic`
|
| 176 |
- `oracle_lite`
|
| 177 |
|
| 178 |
-
The evaluator writes `outputs/evaluation_results.json`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
## Hackathon Alignment
|
| 181 |
|
|
|
|
| 24 |
- Rewards: per-step reward plus terminal score, normalized to `0.0-1.0`
|
| 25 |
- Dataset: 120 abstract multi-agent scenarios
|
| 26 |
|
| 27 |
+
## Live Submission Targets
|
| 28 |
+
|
| 29 |
+
- GitHub: `https://github.com/ADITYAGABA1322/sentinel-env`
|
| 30 |
+
- Hugging Face Space: `https://xcodeaddy-sentinel-env.hf.space`
|
| 31 |
+
- OpenEnv base URL: `https://xcodeaddy-sentinel-env.hf.space`
|
| 32 |
+
|
| 33 |
## Specialist Behaviors
|
| 34 |
|
| 35 |
| Public Slot | Hidden Behavior |
|
|
|
|
| 139 |
Run checks:
|
| 140 |
|
| 141 |
```bash
|
| 142 |
+
python -m py_compile app.py server/app.py environment.py models.py graders.py specialists.py trust_ledger.py task_graph.py scenarios.py inference.py comms_bus.py training/evaluate.py training/train.py
|
| 143 |
python -m pytest -q
|
| 144 |
python inference.py
|
| 145 |
+
python training/evaluate.py --episodes 20 --task all --plot outputs/baseline_comparison.png
|
| 146 |
+
python training/train.py --dry-run --episodes 5
|
| 147 |
```
|
| 148 |
|
| 149 |
Run the server:
|
|
|
|
| 182 |
- `heuristic`
|
| 183 |
- `oracle_lite`
|
| 184 |
|
| 185 |
+
The evaluator writes `outputs/evaluation_results.json` and `outputs/baseline_comparison.png`.
|
| 186 |
+
|
| 187 |
+

|
| 188 |
+
|
| 189 |
+
Latest local comparison, 20 episodes per task and policy:
|
| 190 |
+
|
| 191 |
+
| Policy | Overall | Task 1 | Task 2 | Task 3 |
|
| 192 |
+
| --- | ---: | ---: | ---: | ---: |
|
| 193 |
+
| Random | 0.7144 | 0.7948 | 0.6493 | 0.6990 |
|
| 194 |
+
| Heuristic trust-weighted | 0.8162 | 0.8911 | 0.7736 | 0.7838 |
|
| 195 |
+
| Oracle-lite upper bound | 0.8718 | 0.9445 | 0.7760 | 0.8950 |
|
| 196 |
+
|
| 197 |
+
The demo story is the score gap: the reward function distinguishes blind delegation from trust-aware routing, and the oracle-lite upper bound shows room for onsite RL training.
|
| 198 |
+
|
| 199 |
+
## Hugging Face Deployment
|
| 200 |
+
|
| 201 |
+
```bash
|
| 202 |
+
huggingface-cli login
|
| 203 |
+
huggingface-cli repo create sentinel-env --type space --space-sdk docker --private false
|
| 204 |
+
git remote add hf https://huggingface.co/spaces/XcodeAddy/sentinel-env
|
| 205 |
+
git push hf main
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
After the Space builds:
|
| 209 |
+
|
| 210 |
+
```bash
|
| 211 |
+
curl https://xcodeaddy-sentinel-env.hf.space/health
|
| 212 |
+
curl https://xcodeaddy-sentinel-env.hf.space/
|
| 213 |
+
curl -X POST https://xcodeaddy-sentinel-env.hf.space/reset \
|
| 214 |
+
-H "Content-Type: application/json" \
|
| 215 |
+
-d '{"task_type":"task3","seed":42}'
|
| 216 |
+
openenv validate . --json
|
| 217 |
+
```
|
| 218 |
+
|
| 219 |
+
## Mini-Blog Draft
|
| 220 |
+
|
| 221 |
+
Title: `SENTINEL: Training AI to Trust Wisely in Multi-Agent Systems`
|
| 222 |
+
|
| 223 |
+
SENTINEL is an OpenEnv RL environment for one failure mode: multi-agent systems delegate blindly. One orchestrator must complete long tasks by routing work across five specialist agents whose reliability profiles are hidden and reshuffled every episode. The orchestrator only sees behavior, confidence, stakes, and history, so it must learn skepticism, verification, recovery, and calibrated trust.
|
| 224 |
+
|
| 225 |
+
The specialists are deterministic FSMs on purpose: they give stable reward signals while the orchestrator remains the trainable target. Random routing scores `0.7144`, trust-weighted routing scores `0.8162`, and oracle-lite reaches `0.8718`, showing the environment has a meaningful learning signal before onsite GRPO training.
|
| 226 |
|
| 227 |
## Hackathon Alignment
|
| 228 |
|
openenv.yaml
CHANGED
|
@@ -23,7 +23,7 @@ description: >
|
|
| 23 |
transferable skill, not memorized identities.
|
| 24 |
|
| 25 |
api:
|
| 26 |
-
base_url:
|
| 27 |
endpoints:
|
| 28 |
health:
|
| 29 |
method: GET
|
|
@@ -140,9 +140,10 @@ baseline:
|
|
| 140 |
script: inference.py
|
| 141 |
required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
|
| 142 |
optional_env_vars: [ENV_URL]
|
| 143 |
-
latest_local_score: 0.
|
| 144 |
-
latest_local_episodes:
|
|
|
|
| 145 |
reproducibility:
|
| 146 |
inference_temperature: 0.0
|
| 147 |
agent: heuristic-trust-weighted
|
| 148 |
-
dataset_order: fixed SCN-TASK*-001 through SCN-TASK*-
|
|
|
|
| 23 |
transferable skill, not memorized identities.
|
| 24 |
|
| 25 |
api:
|
| 26 |
+
base_url: https://xcodeaddy-sentinel-env.hf.space
|
| 27 |
endpoints:
|
| 28 |
health:
|
| 29 |
method: GET
|
|
|
|
| 140 |
script: inference.py
|
| 141 |
required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
|
| 142 |
optional_env_vars: [ENV_URL]
|
| 143 |
+
latest_local_score: 0.8162
|
| 144 |
+
latest_local_episodes: 60
|
| 145 |
+
comparison_artifact: outputs/baseline_comparison.png
|
| 146 |
reproducibility:
|
| 147 |
inference_temperature: 0.0
|
| 148 |
agent: heuristic-trust-weighted
|
| 149 |
+
dataset_order: fixed SCN-TASK*-001 through SCN-TASK*-020 per task
|
outputs/baseline_comparison.png
ADDED
|
outputs/baseline_scores.json
ADDED
|
@@ -0,0 +1,531 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model": "heuristic-baseline",
|
| 3 |
+
"total_episodes": 30,
|
| 4 |
+
"avg_score": 0.7942,
|
| 5 |
+
"by_task": {
|
| 6 |
+
"task1": {
|
| 7 |
+
"episodes": 10,
|
| 8 |
+
"avg_score": 0.8706
|
| 9 |
+
},
|
| 10 |
+
"task2": {
|
| 11 |
+
"episodes": 10,
|
| 12 |
+
"avg_score": 0.7475
|
| 13 |
+
},
|
| 14 |
+
"task3": {
|
| 15 |
+
"episodes": 10,
|
| 16 |
+
"avg_score": 0.7646
|
| 17 |
+
}
|
| 18 |
+
},
|
| 19 |
+
"episodes": [
|
| 20 |
+
{
|
| 21 |
+
"scenario_id": "SCN-TASK1-001",
|
| 22 |
+
"task_type": "task1",
|
| 23 |
+
"steps": 13,
|
| 24 |
+
"score": 0.765,
|
| 25 |
+
"total_reward": 10.71,
|
| 26 |
+
"completion_rate": 0.8,
|
| 27 |
+
"adversarial_detections": 0,
|
| 28 |
+
"adversarial_poisonings": 0,
|
| 29 |
+
"final_trust": {
|
| 30 |
+
"S0": 0.473,
|
| 31 |
+
"S1": 0.743,
|
| 32 |
+
"S2": 0.5,
|
| 33 |
+
"S3": 0.5,
|
| 34 |
+
"S4": 0.5
|
| 35 |
+
}
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"scenario_id": "SCN-TASK1-002",
|
| 39 |
+
"task_type": "task1",
|
| 40 |
+
"steps": 12,
|
| 41 |
+
"score": 0.7962,
|
| 42 |
+
"total_reward": 10.35,
|
| 43 |
+
"completion_rate": 0.8,
|
| 44 |
+
"adversarial_detections": 0,
|
| 45 |
+
"adversarial_poisonings": 0,
|
| 46 |
+
"final_trust": {
|
| 47 |
+
"S0": 0.473,
|
| 48 |
+
"S1": 0.888,
|
| 49 |
+
"S2": 0.5,
|
| 50 |
+
"S3": 0.5,
|
| 51 |
+
"S4": 0.5
|
| 52 |
+
}
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"scenario_id": "SCN-TASK1-003",
|
| 56 |
+
"task_type": "task1",
|
| 57 |
+
"steps": 11,
|
| 58 |
+
"score": 0.885,
|
| 59 |
+
"total_reward": 10.62,
|
| 60 |
+
"completion_rate": 0.9,
|
| 61 |
+
"adversarial_detections": 0,
|
| 62 |
+
"adversarial_poisonings": 0,
|
| 63 |
+
"final_trust": {
|
| 64 |
+
"S0": 0.296,
|
| 65 |
+
"S1": 0.296,
|
| 66 |
+
"S2": 0.94,
|
| 67 |
+
"S3": 0.5,
|
| 68 |
+
"S4": 0.5
|
| 69 |
+
}
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"scenario_id": "SCN-TASK1-004",
|
| 73 |
+
"task_type": "task1",
|
| 74 |
+
"steps": 8,
|
| 75 |
+
"score": 0.99,
|
| 76 |
+
"total_reward": 8.91,
|
| 77 |
+
"completion_rate": 0.8,
|
| 78 |
+
"adversarial_detections": 0,
|
| 79 |
+
"adversarial_poisonings": 0,
|
| 80 |
+
"final_trust": {
|
| 81 |
+
"S0": 0.931,
|
| 82 |
+
"S1": 0.5,
|
| 83 |
+
"S2": 0.5,
|
| 84 |
+
"S3": 0.5,
|
| 85 |
+
"S4": 0.5
|
| 86 |
+
}
|
| 87 |
+
},
|
| 88 |
+
{
|
| 89 |
+
"scenario_id": "SCN-TASK1-005",
|
| 90 |
+
"task_type": "task1",
|
| 91 |
+
"steps": 11,
|
| 92 |
+
"score": 0.9375,
|
| 93 |
+
"total_reward": 11.25,
|
| 94 |
+
"completion_rate": 1.0,
|
| 95 |
+
"adversarial_detections": 0,
|
| 96 |
+
"adversarial_poisonings": 0,
|
| 97 |
+
"final_trust": {
|
| 98 |
+
"S0": 0.86,
|
| 99 |
+
"S1": 0.5,
|
| 100 |
+
"S2": 0.5,
|
| 101 |
+
"S3": 0.5,
|
| 102 |
+
"S4": 0.5
|
| 103 |
+
}
|
| 104 |
+
},
|
| 105 |
+
{
|
| 106 |
+
"scenario_id": "SCN-TASK1-006",
|
| 107 |
+
"task_type": "task1",
|
| 108 |
+
"steps": 8,
|
| 109 |
+
"score": 0.85,
|
| 110 |
+
"total_reward": 7.65,
|
| 111 |
+
"completion_rate": 0.6,
|
| 112 |
+
"adversarial_detections": 0,
|
| 113 |
+
"adversarial_poisonings": 0,
|
| 114 |
+
"final_trust": {
|
| 115 |
+
"S0": 0.71,
|
| 116 |
+
"S1": 0.5,
|
| 117 |
+
"S2": 0.5,
|
| 118 |
+
"S3": 0.5,
|
| 119 |
+
"S4": 0.5
|
| 120 |
+
}
|
| 121 |
+
},
|
| 122 |
+
{
|
| 123 |
+
"scenario_id": "SCN-TASK1-007",
|
| 124 |
+
"task_type": "task1",
|
| 125 |
+
"steps": 10,
|
| 126 |
+
"score": 0.99,
|
| 127 |
+
"total_reward": 10.89,
|
| 128 |
+
"completion_rate": 1.0,
|
| 129 |
+
"adversarial_detections": 0,
|
| 130 |
+
"adversarial_poisonings": 0,
|
| 131 |
+
"final_trust": {
|
| 132 |
+
"S0": 0.943,
|
| 133 |
+
"S1": 0.5,
|
| 134 |
+
"S2": 0.5,
|
| 135 |
+
"S3": 0.5,
|
| 136 |
+
"S4": 0.5
|
| 137 |
+
}
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"scenario_id": "SCN-TASK1-008",
|
| 141 |
+
"task_type": "task1",
|
| 142 |
+
"steps": 11,
|
| 143 |
+
"score": 0.8325,
|
| 144 |
+
"total_reward": 9.99,
|
| 145 |
+
"completion_rate": 0.8,
|
| 146 |
+
"adversarial_detections": 0,
|
| 147 |
+
"adversarial_poisonings": 0,
|
| 148 |
+
"final_trust": {
|
| 149 |
+
"S0": 0.482,
|
| 150 |
+
"S1": 0.9,
|
| 151 |
+
"S2": 0.5,
|
| 152 |
+
"S3": 0.5,
|
| 153 |
+
"S4": 0.5
|
| 154 |
+
}
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"scenario_id": "SCN-TASK1-009",
|
| 158 |
+
"task_type": "task1",
|
| 159 |
+
"steps": 9,
|
| 160 |
+
"score": 0.864,
|
| 161 |
+
"total_reward": 8.64,
|
| 162 |
+
"completion_rate": 0.7,
|
| 163 |
+
"adversarial_detections": 0,
|
| 164 |
+
"adversarial_poisonings": 0,
|
| 165 |
+
"final_trust": {
|
| 166 |
+
"S0": 0.492,
|
| 167 |
+
"S1": 0.801,
|
| 168 |
+
"S2": 0.5,
|
| 169 |
+
"S3": 0.5,
|
| 170 |
+
"S4": 0.5
|
| 171 |
+
}
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"scenario_id": "SCN-TASK1-010",
|
| 175 |
+
"task_type": "task1",
|
| 176 |
+
"steps": 12,
|
| 177 |
+
"score": 0.7962,
|
| 178 |
+
"total_reward": 10.35,
|
| 179 |
+
"completion_rate": 0.8,
|
| 180 |
+
"adversarial_detections": 0,
|
| 181 |
+
"adversarial_poisonings": 0,
|
| 182 |
+
"final_trust": {
|
| 183 |
+
"S0": 0.494,
|
| 184 |
+
"S1": 0.885,
|
| 185 |
+
"S2": 0.5,
|
| 186 |
+
"S3": 0.5,
|
| 187 |
+
"S4": 0.5
|
| 188 |
+
}
|
| 189 |
+
},
|
| 190 |
+
{
|
| 191 |
+
"scenario_id": "SCN-TASK2-001",
|
| 192 |
+
"task_type": "task2",
|
| 193 |
+
"steps": 19,
|
| 194 |
+
"score": 0.6054,
|
| 195 |
+
"total_reward": 12.1087,
|
| 196 |
+
"completion_rate": 0.8,
|
| 197 |
+
"adversarial_detections": 0,
|
| 198 |
+
"adversarial_poisonings": 0,
|
| 199 |
+
"final_trust": {
|
| 200 |
+
"S0": 0.476,
|
| 201 |
+
"S1": 0.26,
|
| 202 |
+
"S2": 0.717,
|
| 203 |
+
"S3": 0.5,
|
| 204 |
+
"S4": 0.5
|
| 205 |
+
}
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"scenario_id": "SCN-TASK2-002",
|
| 209 |
+
"task_type": "task2",
|
| 210 |
+
"steps": 17,
|
| 211 |
+
"score": 0.7762,
|
| 212 |
+
"total_reward": 13.9711,
|
| 213 |
+
"completion_rate": 0.933,
|
| 214 |
+
"adversarial_detections": 0,
|
| 215 |
+
"adversarial_poisonings": 0,
|
| 216 |
+
"final_trust": {
|
| 217 |
+
"S0": 0.478,
|
| 218 |
+
"S1": 0.958,
|
| 219 |
+
"S2": 0.5,
|
| 220 |
+
"S3": 0.5,
|
| 221 |
+
"S4": 0.5
|
| 222 |
+
}
|
| 223 |
+
},
|
| 224 |
+
{
|
| 225 |
+
"scenario_id": "SCN-TASK2-003",
|
| 226 |
+
"task_type": "task2",
|
| 227 |
+
"steps": 17,
|
| 228 |
+
"score": 0.7377,
|
| 229 |
+
"total_reward": 13.2781,
|
| 230 |
+
"completion_rate": 0.867,
|
| 231 |
+
"adversarial_detections": 0,
|
| 232 |
+
"adversarial_poisonings": 0,
|
| 233 |
+
"final_trust": {
|
| 234 |
+
"S0": 0.289,
|
| 235 |
+
"S1": 0.289,
|
| 236 |
+
"S2": 0.818,
|
| 237 |
+
"S3": 0.5,
|
| 238 |
+
"S4": 0.5
|
| 239 |
+
}
|
| 240 |
+
},
|
| 241 |
+
{
|
| 242 |
+
"scenario_id": "SCN-TASK2-004",
|
| 243 |
+
"task_type": "task2",
|
| 244 |
+
"steps": 15,
|
| 245 |
+
"score": 0.7783,
|
| 246 |
+
"total_reward": 12.4521,
|
| 247 |
+
"completion_rate": 0.933,
|
| 248 |
+
"adversarial_detections": 0,
|
| 249 |
+
"adversarial_poisonings": 0,
|
| 250 |
+
"final_trust": {
|
| 251 |
+
"S0": 0.9,
|
| 252 |
+
"S1": 0.5,
|
| 253 |
+
"S2": 0.5,
|
| 254 |
+
"S3": 0.5,
|
| 255 |
+
"S4": 0.5
|
| 256 |
+
}
|
| 257 |
+
},
|
| 258 |
+
{
|
| 259 |
+
"scenario_id": "SCN-TASK2-005",
|
| 260 |
+
"task_type": "task2",
|
| 261 |
+
"steps": 17,
|
| 262 |
+
"score": 0.8174,
|
| 263 |
+
"total_reward": 14.7129,
|
| 264 |
+
"completion_rate": 1.0,
|
| 265 |
+
"adversarial_detections": 0,
|
| 266 |
+
"adversarial_poisonings": 0,
|
| 267 |
+
"final_trust": {
|
| 268 |
+
"S0": 0.849,
|
| 269 |
+
"S1": 0.5,
|
| 270 |
+
"S2": 0.5,
|
| 271 |
+
"S3": 0.5,
|
| 272 |
+
"S4": 0.5
|
| 273 |
+
}
|
| 274 |
+
},
|
| 275 |
+
{
|
| 276 |
+
"scenario_id": "SCN-TASK2-006",
|
| 277 |
+
"task_type": "task2",
|
| 278 |
+
"steps": 15,
|
| 279 |
+
"score": 0.6476,
|
| 280 |
+
"total_reward": 10.3617,
|
| 281 |
+
"completion_rate": 0.733,
|
| 282 |
+
"adversarial_detections": 0,
|
| 283 |
+
"adversarial_poisonings": 0,
|
| 284 |
+
"final_trust": {
|
| 285 |
+
"S0": 0.708,
|
| 286 |
+
"S1": 0.5,
|
| 287 |
+
"S2": 0.5,
|
| 288 |
+
"S3": 0.5,
|
| 289 |
+
"S4": 0.5
|
| 290 |
+
}
|
| 291 |
+
},
|
| 292 |
+
{
|
| 293 |
+
"scenario_id": "SCN-TASK2-007",
|
| 294 |
+
"task_type": "task2",
|
| 295 |
+
"steps": 15,
|
| 296 |
+
"score": 0.8967,
|
| 297 |
+
"total_reward": 14.3478,
|
| 298 |
+
"completion_rate": 1.0,
|
| 299 |
+
"adversarial_detections": 0,
|
| 300 |
+
"adversarial_poisonings": 0,
|
| 301 |
+
"final_trust": {
|
| 302 |
+
"S0": 0.967,
|
| 303 |
+
"S1": 0.5,
|
| 304 |
+
"S2": 0.5,
|
| 305 |
+
"S3": 0.5,
|
| 306 |
+
"S4": 0.5
|
| 307 |
+
}
|
| 308 |
+
},
|
| 309 |
+
{
|
| 310 |
+
"scenario_id": "SCN-TASK2-008",
|
| 311 |
+
"task_type": "task2",
|
| 312 |
+
"steps": 17,
|
| 313 |
+
"score": 0.7442,
|
| 314 |
+
"total_reward": 13.3953,
|
| 315 |
+
"completion_rate": 0.933,
|
| 316 |
+
"adversarial_detections": 0,
|
| 317 |
+
"adversarial_poisonings": 0,
|
| 318 |
+
"final_trust": {
|
| 319 |
+
"S0": 0.49,
|
| 320 |
+
"S1": 0.959,
|
| 321 |
+
"S2": 0.5,
|
| 322 |
+
"S3": 0.5,
|
| 323 |
+
"S4": 0.5
|
| 324 |
+
}
|
| 325 |
+
},
|
| 326 |
+
{
|
| 327 |
+
"scenario_id": "SCN-TASK2-009",
|
| 328 |
+
"task_type": "task2",
|
| 329 |
+
"steps": 16,
|
| 330 |
+
"score": 0.7525,
|
| 331 |
+
"total_reward": 12.792,
|
| 332 |
+
"completion_rate": 0.933,
|
| 333 |
+
"adversarial_detections": 0,
|
| 334 |
+
"adversarial_poisonings": 0,
|
| 335 |
+
"final_trust": {
|
| 336 |
+
"S0": 0.492,
|
| 337 |
+
"S1": 0.906,
|
| 338 |
+
"S2": 0.5,
|
| 339 |
+
"S3": 0.5,
|
| 340 |
+
"S4": 0.5
|
| 341 |
+
}
|
| 342 |
+
},
|
| 343 |
+
{
|
| 344 |
+
"scenario_id": "SCN-TASK2-010",
|
| 345 |
+
"task_type": "task2",
|
| 346 |
+
"steps": 18,
|
| 347 |
+
"score": 0.7191,
|
| 348 |
+
"total_reward": 13.6622,
|
| 349 |
+
"completion_rate": 0.933,
|
| 350 |
+
"adversarial_detections": 0,
|
| 351 |
+
"adversarial_poisonings": 0,
|
| 352 |
+
"final_trust": {
|
| 353 |
+
"S0": 0.474,
|
| 354 |
+
"S1": 0.955,
|
| 355 |
+
"S2": 0.5,
|
| 356 |
+
"S3": 0.5,
|
| 357 |
+
"S4": 0.5
|
| 358 |
+
}
|
| 359 |
+
},
|
| 360 |
+
{
|
| 361 |
+
"scenario_id": "SCN-TASK3-001",
|
| 362 |
+
"task_type": "task3",
|
| 363 |
+
"steps": 25,
|
| 364 |
+
"score": 0.7354,
|
| 365 |
+
"total_reward": 19.1204,
|
| 366 |
+
"completion_rate": 0.85,
|
| 367 |
+
"adversarial_detections": 0,
|
| 368 |
+
"adversarial_poisonings": 0,
|
| 369 |
+
"final_trust": {
|
| 370 |
+
"S0": 0.456,
|
| 371 |
+
"S1": 0.258,
|
| 372 |
+
"S2": 0.76,
|
| 373 |
+
"S3": 0.5,
|
| 374 |
+
"S4": 0.5
|
| 375 |
+
}
|
| 376 |
+
},
|
| 377 |
+
{
|
| 378 |
+
"scenario_id": "SCN-TASK3-002",
|
| 379 |
+
"task_type": "task3",
|
| 380 |
+
"steps": 25,
|
| 381 |
+
"score": 0.7054,
|
| 382 |
+
"total_reward": 18.341,
|
| 383 |
+
"completion_rate": 0.85,
|
| 384 |
+
"adversarial_detections": 3,
|
| 385 |
+
"adversarial_poisonings": 5,
|
| 386 |
+
"final_trust": {
|
| 387 |
+
"S0": 0.458,
|
| 388 |
+
"S1": 0.473,
|
| 389 |
+
"S2": 0.868,
|
| 390 |
+
"S3": 0.5,
|
| 391 |
+
"S4": 0.5
|
| 392 |
+
}
|
| 393 |
+
},
|
| 394 |
+
{
|
| 395 |
+
"scenario_id": "SCN-TASK3-003",
|
| 396 |
+
"task_type": "task3",
|
| 397 |
+
"steps": 19,
|
| 398 |
+
"score": 0.6438,
|
| 399 |
+
"total_reward": 12.8767,
|
| 400 |
+
"completion_rate": 0.6,
|
| 401 |
+
"adversarial_detections": 0,
|
| 402 |
+
"adversarial_poisonings": 5,
|
| 403 |
+
"final_trust": {
|
| 404 |
+
"S0": 0.299,
|
| 405 |
+
"S1": 0.299,
|
| 406 |
+
"S2": 0.633,
|
| 407 |
+
"S3": 0.5,
|
| 408 |
+
"S4": 0.5
|
| 409 |
+
}
|
| 410 |
+
},
|
| 411 |
+
{
|
| 412 |
+
"scenario_id": "SCN-TASK3-004",
|
| 413 |
+
"task_type": "task3",
|
| 414 |
+
"steps": 21,
|
| 415 |
+
"score": 0.8954,
|
| 416 |
+
"total_reward": 19.6992,
|
| 417 |
+
"completion_rate": 1.0,
|
| 418 |
+
"adversarial_detections": 0,
|
| 419 |
+
"adversarial_poisonings": 0,
|
| 420 |
+
"final_trust": {
|
| 421 |
+
"S0": 0.93,
|
| 422 |
+
"S1": 0.5,
|
| 423 |
+
"S2": 0.5,
|
| 424 |
+
"S3": 0.5,
|
| 425 |
+
"S4": 0.5
|
| 426 |
+
}
|
| 427 |
+
},
|
| 428 |
+
{
|
| 429 |
+
"scenario_id": "SCN-TASK3-005",
|
| 430 |
+
"task_type": "task3",
|
| 431 |
+
"steps": 24,
|
| 432 |
+
"score": 0.7134,
|
| 433 |
+
"total_reward": 17.8339,
|
| 434 |
+
"completion_rate": 0.85,
|
| 435 |
+
"adversarial_detections": 3,
|
| 436 |
+
"adversarial_poisonings": 6,
|
| 437 |
+
"final_trust": {
|
| 438 |
+
"S0": 0.491,
|
| 439 |
+
"S1": 0.797,
|
| 440 |
+
"S2": 0.5,
|
| 441 |
+
"S3": 0.5,
|
| 442 |
+
"S4": 0.5
|
| 443 |
+
}
|
| 444 |
+
},
|
| 445 |
+
{
|
| 446 |
+
"scenario_id": "SCN-TASK3-006",
|
| 447 |
+
"task_type": "task3",
|
| 448 |
+
"steps": 23,
|
| 449 |
+
"score": 0.7857,
|
| 450 |
+
"total_reward": 18.8578,
|
| 451 |
+
"completion_rate": 0.9,
|
| 452 |
+
"adversarial_detections": 0,
|
| 453 |
+
"adversarial_poisonings": 0,
|
| 454 |
+
"final_trust": {
|
| 455 |
+
"S0": 0.774,
|
| 456 |
+
"S1": 0.5,
|
| 457 |
+
"S2": 0.5,
|
| 458 |
+
"S3": 0.5,
|
| 459 |
+
"S4": 0.5
|
| 460 |
+
}
|
| 461 |
+
},
|
| 462 |
+
{
|
| 463 |
+
"scenario_id": "SCN-TASK3-007",
|
| 464 |
+
"task_type": "task3",
|
| 465 |
+
"steps": 24,
|
| 466 |
+
"score": 0.7045,
|
| 467 |
+
"total_reward": 17.6133,
|
| 468 |
+
"completion_rate": 0.85,
|
| 469 |
+
"adversarial_detections": 3,
|
| 470 |
+
"adversarial_poisonings": 7,
|
| 471 |
+
"final_trust": {
|
| 472 |
+
"S0": 0.498,
|
| 473 |
+
"S1": 0.5,
|
| 474 |
+
"S2": 0.5,
|
| 475 |
+
"S3": 0.5,
|
| 476 |
+
"S4": 0.5
|
| 477 |
+
}
|
| 478 |
+
},
|
| 479 |
+
{
|
| 480 |
+
"scenario_id": "SCN-TASK3-008",
|
| 481 |
+
"task_type": "task3",
|
| 482 |
+
"steps": 24,
|
| 483 |
+
"score": 0.8057,
|
| 484 |
+
"total_reward": 20.1435,
|
| 485 |
+
"completion_rate": 0.95,
|
| 486 |
+
"adversarial_detections": 0,
|
| 487 |
+
"adversarial_poisonings": 0,
|
| 488 |
+
"final_trust": {
|
| 489 |
+
"S0": 0.479,
|
| 490 |
+
"S1": 0.856,
|
| 491 |
+
"S2": 0.5,
|
| 492 |
+
"S3": 0.5,
|
| 493 |
+
"S4": 0.5
|
| 494 |
+
}
|
| 495 |
+
},
|
| 496 |
+
{
|
| 497 |
+
"scenario_id": "SCN-TASK3-009",
|
| 498 |
+
"task_type": "task3",
|
| 499 |
+
"steps": 23,
|
| 500 |
+
"score": 0.8456,
|
| 501 |
+
"total_reward": 20.2932,
|
| 502 |
+
"completion_rate": 1.0,
|
| 503 |
+
"adversarial_detections": 0,
|
| 504 |
+
"adversarial_poisonings": 0,
|
| 505 |
+
"final_trust": {
|
| 506 |
+
"S0": 0.488,
|
| 507 |
+
"S1": 0.891,
|
| 508 |
+
"S2": 0.5,
|
| 509 |
+
"S3": 0.5,
|
| 510 |
+
"S4": 0.5
|
| 511 |
+
}
|
| 512 |
+
},
|
| 513 |
+
{
|
| 514 |
+
"scenario_id": "SCN-TASK3-010",
|
| 515 |
+
"task_type": "task3",
|
| 516 |
+
"steps": 24,
|
| 517 |
+
"score": 0.8106,
|
| 518 |
+
"total_reward": 20.2645,
|
| 519 |
+
"completion_rate": 0.95,
|
| 520 |
+
"adversarial_detections": 0,
|
| 521 |
+
"adversarial_poisonings": 0,
|
| 522 |
+
"final_trust": {
|
| 523 |
+
"S0": 0.473,
|
| 524 |
+
"S1": 0.91,
|
| 525 |
+
"S2": 0.5,
|
| 526 |
+
"S3": 0.5,
|
| 527 |
+
"S4": 0.5
|
| 528 |
+
}
|
| 529 |
+
}
|
| 530 |
+
]
|
| 531 |
+
}
|
outputs/evaluation_results.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
training/evaluate.py
CHANGED
|
@@ -3,7 +3,9 @@ from __future__ import annotations
|
|
| 3 |
import argparse
|
| 4 |
import json
|
| 5 |
import random
|
|
|
|
| 6 |
import sys
|
|
|
|
| 7 |
from pathlib import Path
|
| 8 |
from typing import Callable
|
| 9 |
|
|
@@ -16,6 +18,8 @@ from environment import SentinelEnv, _GROUND_TRUTH_RELIABILITY
|
|
| 16 |
|
| 17 |
Policy = Callable[[SentinelEnv, dict, random.Random], dict]
|
| 18 |
|
|
|
|
|
|
|
| 19 |
|
| 20 |
def random_policy(env: SentinelEnv, obs: dict, rng: random.Random) -> dict:
|
| 21 |
specialist = rng.choice(obs["available_specialists"])
|
|
@@ -117,11 +121,162 @@ def _avg(rows: list[dict], key: str) -> float:
|
|
| 117 |
return round(sum(float(row.get(key, 0.0)) for row in rows) / max(1, len(rows)), 4)
|
| 118 |
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
def main() -> None:
|
| 121 |
parser = argparse.ArgumentParser(description="Evaluate SENTINEL policies.")
|
| 122 |
parser.add_argument("--episodes", type=int, default=20, help="Episodes per policy.")
|
| 123 |
-
parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3"])
|
| 124 |
parser.add_argument("--out", default="outputs/evaluation_results.json")
|
|
|
|
|
|
|
| 125 |
args = parser.parse_args()
|
| 126 |
|
| 127 |
policies: dict[str, Policy] = {
|
|
@@ -130,23 +285,32 @@ def main() -> None:
|
|
| 130 |
"oracle_lite": oracle_lite_policy,
|
| 131 |
}
|
| 132 |
|
|
|
|
| 133 |
rows = []
|
| 134 |
-
for
|
| 135 |
-
for
|
| 136 |
-
|
|
|
|
| 137 |
|
| 138 |
payload = {
|
| 139 |
"task": args.task,
|
|
|
|
| 140 |
"episodes_per_policy": args.episodes,
|
| 141 |
"summary": summarize(rows),
|
|
|
|
| 142 |
"episodes": rows,
|
| 143 |
}
|
| 144 |
|
| 145 |
out_path = ROOT / args.out
|
| 146 |
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 147 |
out_path.write_text(json.dumps(payload, indent=2) + "\n")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
print(json.dumps(payload["summary"], indent=2))
|
| 150 |
|
| 151 |
|
| 152 |
if __name__ == "__main__":
|
|
|
|
| 3 |
import argparse
|
| 4 |
import json
|
| 5 |
import random
|
| 6 |
+
import struct
|
| 7 |
import sys
|
| 8 |
+
import zlib
|
| 9 |
from pathlib import Path
|
| 10 |
from typing import Callable
|
| 11 |
|
|
|
|
| 18 |
|
| 19 |
Policy = Callable[[SentinelEnv, dict, random.Random], dict]
|
| 20 |
|
| 21 |
+
POLICIES: dict[str, Policy] = {}
|
| 22 |
+
|
| 23 |
|
| 24 |
def random_policy(env: SentinelEnv, obs: dict, rng: random.Random) -> dict:
|
| 25 |
specialist = rng.choice(obs["available_specialists"])
|
|
|
|
| 121 |
return round(sum(float(row.get(key, 0.0)) for row in rows) / max(1, len(rows)), 4)
|
| 122 |
|
| 123 |
|
| 124 |
+
def summarize_by_task(rows: list[dict]) -> dict:
|
| 125 |
+
grouped: dict[str, list[dict]] = {}
|
| 126 |
+
for row in rows:
|
| 127 |
+
grouped.setdefault(row["task_type"], []).append(row)
|
| 128 |
+
return {task: summarize(task_rows) for task, task_rows in sorted(grouped.items())}
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
FONT_5X7 = {
|
| 132 |
+
" ": ["00000", "00000", "00000", "00000", "00000", "00000", "00000"],
|
| 133 |
+
"-": ["00000", "00000", "00000", "11111", "00000", "00000", "00000"],
|
| 134 |
+
".": ["00000", "00000", "00000", "00000", "00000", "01100", "01100"],
|
| 135 |
+
":": ["00000", "01100", "01100", "00000", "01100", "01100", "00000"],
|
| 136 |
+
"0": ["01110", "10001", "10011", "10101", "11001", "10001", "01110"],
|
| 137 |
+
"1": ["00100", "01100", "00100", "00100", "00100", "00100", "01110"],
|
| 138 |
+
"2": ["01110", "10001", "00001", "00010", "00100", "01000", "11111"],
|
| 139 |
+
"3": ["11110", "00001", "00001", "01110", "00001", "00001", "11110"],
|
| 140 |
+
"4": ["00010", "00110", "01010", "10010", "11111", "00010", "00010"],
|
| 141 |
+
"5": ["11111", "10000", "10000", "11110", "00001", "00001", "11110"],
|
| 142 |
+
"6": ["01110", "10000", "10000", "11110", "10001", "10001", "01110"],
|
| 143 |
+
"7": ["11111", "00001", "00010", "00100", "01000", "01000", "01000"],
|
| 144 |
+
"8": ["01110", "10001", "10001", "01110", "10001", "10001", "01110"],
|
| 145 |
+
"9": ["01110", "10001", "10001", "01111", "00001", "00001", "01110"],
|
| 146 |
+
"A": ["01110", "10001", "10001", "11111", "10001", "10001", "10001"],
|
| 147 |
+
"B": ["11110", "10001", "10001", "11110", "10001", "10001", "11110"],
|
| 148 |
+
"C": ["01110", "10001", "10000", "10000", "10000", "10001", "01110"],
|
| 149 |
+
"D": ["11110", "10001", "10001", "10001", "10001", "10001", "11110"],
|
| 150 |
+
"E": ["11111", "10000", "10000", "11110", "10000", "10000", "11111"],
|
| 151 |
+
"F": ["11111", "10000", "10000", "11110", "10000", "10000", "10000"],
|
| 152 |
+
"G": ["01110", "10001", "10000", "10111", "10001", "10001", "01110"],
|
| 153 |
+
"H": ["10001", "10001", "10001", "11111", "10001", "10001", "10001"],
|
| 154 |
+
"I": ["01110", "00100", "00100", "00100", "00100", "00100", "01110"],
|
| 155 |
+
"J": ["00001", "00001", "00001", "00001", "10001", "10001", "01110"],
|
| 156 |
+
"K": ["10001", "10010", "10100", "11000", "10100", "10010", "10001"],
|
| 157 |
+
"L": ["10000", "10000", "10000", "10000", "10000", "10000", "11111"],
|
| 158 |
+
"M": ["10001", "11011", "10101", "10101", "10001", "10001", "10001"],
|
| 159 |
+
"N": ["10001", "11001", "10101", "10011", "10001", "10001", "10001"],
|
| 160 |
+
"O": ["01110", "10001", "10001", "10001", "10001", "10001", "01110"],
|
| 161 |
+
"P": ["11110", "10001", "10001", "11110", "10000", "10000", "10000"],
|
| 162 |
+
"Q": ["01110", "10001", "10001", "10001", "10101", "10010", "01101"],
|
| 163 |
+
"R": ["11110", "10001", "10001", "11110", "10100", "10010", "10001"],
|
| 164 |
+
"S": ["01111", "10000", "10000", "01110", "00001", "00001", "11110"],
|
| 165 |
+
"T": ["11111", "00100", "00100", "00100", "00100", "00100", "00100"],
|
| 166 |
+
"U": ["10001", "10001", "10001", "10001", "10001", "10001", "01110"],
|
| 167 |
+
"V": ["10001", "10001", "10001", "10001", "10001", "01010", "00100"],
|
| 168 |
+
"W": ["10001", "10001", "10001", "10101", "10101", "10101", "01010"],
|
| 169 |
+
"X": ["10001", "10001", "01010", "00100", "01010", "10001", "10001"],
|
| 170 |
+
"Y": ["10001", "10001", "01010", "00100", "00100", "00100", "00100"],
|
| 171 |
+
"Z": ["11111", "00001", "00010", "00100", "01000", "10000", "11111"],
|
| 172 |
+
}
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
def write_baseline_chart(payload: dict, path: Path) -> None:
|
| 176 |
+
"""Write a dependency-free PNG chart for README and onsite demos."""
|
| 177 |
+
by_task = payload["by_task"]
|
| 178 |
+
tasks = list(by_task.keys())
|
| 179 |
+
policies = [name for name in ("random", "heuristic", "oracle_lite") if any(name in by_task[t] for t in tasks)]
|
| 180 |
+
colors = {
|
| 181 |
+
"random": (239, 68, 68),
|
| 182 |
+
"heuristic": (59, 130, 246),
|
| 183 |
+
"oracle_lite": (16, 185, 129),
|
| 184 |
+
}
|
| 185 |
+
labels = {"random": "RANDOM", "heuristic": "HEURISTIC", "oracle_lite": "ORACLE LITE"}
|
| 186 |
+
|
| 187 |
+
width, height = 1200, 720
|
| 188 |
+
canvas = bytearray([255, 255, 255] * width * height)
|
| 189 |
+
|
| 190 |
+
def rect(x0: int, y0: int, x1: int, y1: int, color: tuple[int, int, int]) -> None:
|
| 191 |
+
x0, y0 = max(0, x0), max(0, y0)
|
| 192 |
+
x1, y1 = min(width, x1), min(height, y1)
|
| 193 |
+
for y in range(y0, y1):
|
| 194 |
+
row = y * width * 3
|
| 195 |
+
for x in range(x0, x1):
|
| 196 |
+
idx = row + x * 3
|
| 197 |
+
canvas[idx : idx + 3] = bytes(color)
|
| 198 |
+
|
| 199 |
+
def text(x: int, y: int, value: str, color: tuple[int, int, int] = (20, 20, 20), scale: int = 2) -> None:
|
| 200 |
+
cursor = x
|
| 201 |
+
for ch in value.upper():
|
| 202 |
+
glyph = FONT_5X7.get(ch, FONT_5X7[" "])
|
| 203 |
+
for gy, line in enumerate(glyph):
|
| 204 |
+
for gx, bit in enumerate(line):
|
| 205 |
+
if bit == "1":
|
| 206 |
+
rect(cursor + gx * scale, y + gy * scale, cursor + (gx + 1) * scale, y + (gy + 1) * scale, color)
|
| 207 |
+
cursor += 6 * scale
|
| 208 |
+
|
| 209 |
+
def line_h(y: int, x0: int, x1: int, color: tuple[int, int, int]) -> None:
|
| 210 |
+
rect(x0, y, x1, y + 1, color)
|
| 211 |
+
|
| 212 |
+
def line_v(x: int, y0: int, y1: int, color: tuple[int, int, int]) -> None:
|
| 213 |
+
rect(x, y0, x + 1, y1, color)
|
| 214 |
+
|
| 215 |
+
margin_left, margin_top, margin_right, margin_bottom = 100, 115, 40, 115
|
| 216 |
+
plot_x0, plot_y0 = margin_left, margin_top
|
| 217 |
+
plot_x1, plot_y1 = width - margin_right, height - margin_bottom
|
| 218 |
+
plot_w, plot_h = plot_x1 - plot_x0, plot_y1 - plot_y0
|
| 219 |
+
|
| 220 |
+
text(50, 28, "SENTINEL BASELINE COMPARISON", (17, 24, 39), 3)
|
| 221 |
+
text(52, 70, "EPISODE SCORE 0.0 TO 1.0 - RANDOM VS TRUST WEIGHTED VS ORACLE LITE", (75, 85, 99), 2)
|
| 222 |
+
|
| 223 |
+
for tick in (0.0, 0.25, 0.5, 0.75, 1.0):
|
| 224 |
+
y = int(plot_y1 - tick * plot_h)
|
| 225 |
+
line_h(y, plot_x0, plot_x1, (226, 232, 240))
|
| 226 |
+
text(32, y - 7, f"{tick:.2f}", (100, 116, 139), 2)
|
| 227 |
+
line_v(plot_x0, plot_y0, plot_y1, (148, 163, 184))
|
| 228 |
+
line_h(plot_y1, plot_x0, plot_x1, (148, 163, 184))
|
| 229 |
+
|
| 230 |
+
group_w = plot_w / max(1, len(tasks))
|
| 231 |
+
bar_w = max(34, min(76, int((group_w - 80) / max(1, len(policies)))))
|
| 232 |
+
for task_idx, task in enumerate(tasks):
|
| 233 |
+
group_center = int(plot_x0 + group_w * task_idx + group_w / 2)
|
| 234 |
+
start_x = group_center - int((len(policies) * bar_w + (len(policies) - 1) * 18) / 2)
|
| 235 |
+
for policy_idx, policy in enumerate(policies):
|
| 236 |
+
value = float(by_task[task].get(policy, {}).get("avg_score", 0.0))
|
| 237 |
+
x0 = start_x + policy_idx * (bar_w + 18)
|
| 238 |
+
y0 = int(plot_y1 - value * plot_h)
|
| 239 |
+
rect(x0 + 3, y0 + 3, x0 + bar_w + 3, plot_y1 + 3, (203, 213, 225))
|
| 240 |
+
rect(x0, y0, x0 + bar_w, plot_y1, colors[policy])
|
| 241 |
+
text(x0 - 4, max(plot_y0 - 2, y0 - 24), f"{value:.2f}", (15, 23, 42), 2)
|
| 242 |
+
text(group_center - 36, plot_y1 + 30, task.upper(), (15, 23, 42), 2)
|
| 243 |
+
|
| 244 |
+
legend_x, legend_y = 780, 32
|
| 245 |
+
for idx, policy in enumerate(policies):
|
| 246 |
+
x = legend_x
|
| 247 |
+
y = legend_y + idx * 24
|
| 248 |
+
rect(x, y, x + 16, y + 16, colors[policy])
|
| 249 |
+
text(x + 24, y + 1, labels[policy], (51, 65, 85), 2)
|
| 250 |
+
|
| 251 |
+
path.parent.mkdir(parents=True, exist_ok=True)
|
| 252 |
+
_write_png(path, width, height, canvas)
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
def _write_png(path: Path, width: int, height: int, rgb: bytearray) -> None:
|
| 256 |
+
def chunk(tag: bytes, data: bytes) -> bytes:
|
| 257 |
+
return struct.pack(">I", len(data)) + tag + data + struct.pack(">I", zlib.crc32(tag + data) & 0xFFFFFFFF)
|
| 258 |
+
|
| 259 |
+
rows = []
|
| 260 |
+
stride = width * 3
|
| 261 |
+
for y in range(height):
|
| 262 |
+
rows.append(b"\x00" + bytes(rgb[y * stride : (y + 1) * stride]))
|
| 263 |
+
raw = b"".join(rows)
|
| 264 |
+
png = (
|
| 265 |
+
b"\x89PNG\r\n\x1a\n"
|
| 266 |
+
+ chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
|
| 267 |
+
+ chunk(b"IDAT", zlib.compress(raw, 9))
|
| 268 |
+
+ chunk(b"IEND", b"")
|
| 269 |
+
)
|
| 270 |
+
path.write_bytes(png)
|
| 271 |
+
|
| 272 |
+
|
| 273 |
def main() -> None:
|
| 274 |
parser = argparse.ArgumentParser(description="Evaluate SENTINEL policies.")
|
| 275 |
parser.add_argument("--episodes", type=int, default=20, help="Episodes per policy.")
|
| 276 |
+
parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3", "all"])
|
| 277 |
parser.add_argument("--out", default="outputs/evaluation_results.json")
|
| 278 |
+
parser.add_argument("--plot", default="outputs/baseline_comparison.png")
|
| 279 |
+
parser.add_argument("--no-plot", action="store_true")
|
| 280 |
args = parser.parse_args()
|
| 281 |
|
| 282 |
policies: dict[str, Policy] = {
|
|
|
|
| 285 |
"oracle_lite": oracle_lite_policy,
|
| 286 |
}
|
| 287 |
|
| 288 |
+
tasks = ["task1", "task2", "task3"] if args.task == "all" else [args.task]
|
| 289 |
rows = []
|
| 290 |
+
for task_type in tasks:
|
| 291 |
+
for policy_name, policy in policies.items():
|
| 292 |
+
for seed in range(args.episodes):
|
| 293 |
+
rows.append(run_episode(policy_name, policy, task_type, seed))
|
| 294 |
|
| 295 |
payload = {
|
| 296 |
"task": args.task,
|
| 297 |
+
"tasks": tasks,
|
| 298 |
"episodes_per_policy": args.episodes,
|
| 299 |
"summary": summarize(rows),
|
| 300 |
+
"by_task": summarize_by_task(rows),
|
| 301 |
"episodes": rows,
|
| 302 |
}
|
| 303 |
|
| 304 |
out_path = ROOT / args.out
|
| 305 |
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 306 |
out_path.write_text(json.dumps(payload, indent=2) + "\n")
|
| 307 |
+
if not args.no_plot:
|
| 308 |
+
chart_path = ROOT / args.plot
|
| 309 |
+
write_baseline_chart(payload, chart_path)
|
| 310 |
+
payload["chart"] = str(chart_path.relative_to(ROOT))
|
| 311 |
+
out_path.write_text(json.dumps(payload, indent=2) + "\n")
|
| 312 |
|
| 313 |
+
print(json.dumps({"summary": payload["summary"], "by_task": payload["by_task"], "chart": payload.get("chart")}, indent=2))
|
| 314 |
|
| 315 |
|
| 316 |
if __name__ == "__main__":
|
training/train.py
CHANGED
|
@@ -1,11 +1,11 @@
|
|
| 1 |
from __future__ import annotations
|
| 2 |
|
| 3 |
"""
|
| 4 |
-
|
| 5 |
|
| 6 |
This file is intentionally import-light so it can run locally without GPU
|
| 7 |
packages. On the finale machine, install the training extras from pyproject and
|
| 8 |
-
|
| 9 |
"""
|
| 10 |
|
| 11 |
import argparse
|
|
@@ -37,6 +37,24 @@ def build_prompt(observation: dict) -> str:
|
|
| 37 |
)
|
| 38 |
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
def parse_action(text: str, observation: dict) -> dict:
|
| 41 |
match = ACTION_RE.search(text or "")
|
| 42 |
payload = {}
|
|
@@ -66,6 +84,44 @@ def parse_action(text: str, observation: dict) -> dict:
|
|
| 66 |
}
|
| 67 |
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
def dry_run_rollouts(episodes: int, seed: int) -> dict:
|
| 70 |
rng = random.Random(seed)
|
| 71 |
scores = []
|
|
@@ -88,30 +144,81 @@ def dry_run_rollouts(episodes: int, seed: int) -> dict:
|
|
| 88 |
return {"episodes": episodes, "avg_score": round(sum(scores) / max(1, len(scores)), 4)}
|
| 89 |
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
def main() -> None:
|
| 92 |
parser = argparse.ArgumentParser(description="SENTINEL GRPO training harness.")
|
| 93 |
parser.add_argument("--dry-run", action="store_true", help="Run local rollouts without GPU dependencies.")
|
| 94 |
parser.add_argument("--episodes", type=int, default=5)
|
| 95 |
parser.add_argument("--seed", type=int, default=0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
args = parser.parse_args()
|
| 97 |
|
| 98 |
if args.dry_run:
|
| 99 |
print(json.dumps(dry_run_rollouts(args.episodes, args.seed), indent=2))
|
| 100 |
return
|
| 101 |
|
| 102 |
-
|
| 103 |
-
import trl # noqa: F401
|
| 104 |
-
import unsloth # noqa: F401
|
| 105 |
-
except ImportError as exc:
|
| 106 |
-
raise SystemExit(
|
| 107 |
-
"Training dependencies are not installed. Run with --dry-run locally, "
|
| 108 |
-
"or install the pyproject training extras on the finale GPU machine."
|
| 109 |
-
) from exc
|
| 110 |
-
|
| 111 |
-
raise SystemExit(
|
| 112 |
-
"GPU training hook is ready. Wire GRPOTrainer here using build_prompt(), "
|
| 113 |
-
"parse_action(), and SentinelEnv.step() as the reward source."
|
| 114 |
-
)
|
| 115 |
|
| 116 |
|
| 117 |
if __name__ == "__main__":
|
|
|
|
| 1 |
from __future__ import annotations
|
| 2 |
|
| 3 |
"""
|
| 4 |
+
Onsite training entrypoint.
|
| 5 |
|
| 6 |
This file is intentionally import-light so it can run locally without GPU
|
| 7 |
packages. On the finale machine, install the training extras from pyproject and
|
| 8 |
+
run without --dry-run to train a small orchestrator policy with GRPO.
|
| 9 |
"""
|
| 10 |
|
| 11 |
import argparse
|
|
|
|
| 37 |
)
|
| 38 |
|
| 39 |
|
| 40 |
+
def build_dataset_records(episodes: int, task_type: str, seed: int) -> list[dict]:
|
| 41 |
+
records = []
|
| 42 |
+
task_choices = ["task1", "task2", "task3"] if task_type == "all" else [task_type]
|
| 43 |
+
for idx in range(episodes):
|
| 44 |
+
selected_task = task_choices[idx % len(task_choices)]
|
| 45 |
+
env = SentinelEnv()
|
| 46 |
+
result = env.reset(task_type=selected_task, seed=seed + idx)
|
| 47 |
+
obs = result["observation"]
|
| 48 |
+
records.append(
|
| 49 |
+
{
|
| 50 |
+
"prompt": build_prompt(obs),
|
| 51 |
+
"task_type": selected_task,
|
| 52 |
+
"seed": seed + idx,
|
| 53 |
+
}
|
| 54 |
+
)
|
| 55 |
+
return records
|
| 56 |
+
|
| 57 |
+
|
| 58 |
def parse_action(text: str, observation: dict) -> dict:
|
| 59 |
match = ACTION_RE.search(text or "")
|
| 60 |
payload = {}
|
|
|
|
| 84 |
}
|
| 85 |
|
| 86 |
|
| 87 |
+
def score_completion(completion: str, task_type: str, seed: int) -> float:
|
| 88 |
+
env = SentinelEnv()
|
| 89 |
+
result = env.reset(task_type=task_type, seed=seed)
|
| 90 |
+
obs = result["observation"]
|
| 91 |
+
action = parse_action(completion, obs)
|
| 92 |
+
result = env.step(action)
|
| 93 |
+
return float(result["reward"]["value"])
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def sentinel_reward(completions, prompts=None, task_type=None, seed=None, **kwargs):
|
| 97 |
+
rewards = []
|
| 98 |
+
task_values = task_type or kwargs.get("task_type") or ["task3"] * len(completions)
|
| 99 |
+
seed_values = seed or kwargs.get("seed") or list(range(len(completions)))
|
| 100 |
+
for idx, completion in enumerate(completions):
|
| 101 |
+
text = _completion_text(completion)
|
| 102 |
+
try:
|
| 103 |
+
rewards.append(score_completion(text, str(task_values[idx]), int(seed_values[idx])))
|
| 104 |
+
except Exception:
|
| 105 |
+
rewards.append(0.01)
|
| 106 |
+
return rewards
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def _completion_text(completion) -> str:
|
| 110 |
+
if isinstance(completion, str):
|
| 111 |
+
return completion
|
| 112 |
+
if isinstance(completion, list):
|
| 113 |
+
parts = []
|
| 114 |
+
for item in completion:
|
| 115 |
+
if isinstance(item, dict):
|
| 116 |
+
parts.append(str(item.get("content", "")))
|
| 117 |
+
else:
|
| 118 |
+
parts.append(str(item))
|
| 119 |
+
return "\n".join(parts)
|
| 120 |
+
if isinstance(completion, dict):
|
| 121 |
+
return str(completion.get("content", completion))
|
| 122 |
+
return str(completion)
|
| 123 |
+
|
| 124 |
+
|
| 125 |
def dry_run_rollouts(episodes: int, seed: int) -> dict:
|
| 126 |
rng = random.Random(seed)
|
| 127 |
scores = []
|
|
|
|
| 144 |
return {"episodes": episodes, "avg_score": round(sum(scores) / max(1, len(scores)), 4)}
|
| 145 |
|
| 146 |
|
| 147 |
+
def run_grpo(args) -> None:
|
| 148 |
+
try:
|
| 149 |
+
from datasets import Dataset
|
| 150 |
+
from trl import GRPOConfig, GRPOTrainer
|
| 151 |
+
from unsloth import FastLanguageModel
|
| 152 |
+
except ImportError:
|
| 153 |
+
print("Training dependencies are not installed locally.")
|
| 154 |
+
print("Local check passed. For onsite GPU training run:")
|
| 155 |
+
print(" pip install '.[training]'")
|
| 156 |
+
print(" python training/train.py --episodes 300 --task all")
|
| 157 |
+
return
|
| 158 |
+
|
| 159 |
+
records = build_dataset_records(args.episodes, args.task, args.seed)
|
| 160 |
+
dataset = Dataset.from_list(records)
|
| 161 |
+
|
| 162 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 163 |
+
model_name=args.model,
|
| 164 |
+
max_seq_length=args.max_seq_length,
|
| 165 |
+
load_in_4bit=True,
|
| 166 |
+
)
|
| 167 |
+
model = FastLanguageModel.get_peft_model(
|
| 168 |
+
model,
|
| 169 |
+
r=args.lora_rank,
|
| 170 |
+
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
|
| 171 |
+
lora_alpha=args.lora_rank,
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
config = GRPOConfig(
|
| 175 |
+
output_dir=args.output_dir,
|
| 176 |
+
learning_rate=args.learning_rate,
|
| 177 |
+
num_train_epochs=args.epochs,
|
| 178 |
+
per_device_train_batch_size=args.batch_size,
|
| 179 |
+
logging_steps=10,
|
| 180 |
+
save_steps=50,
|
| 181 |
+
max_prompt_length=args.max_seq_length,
|
| 182 |
+
max_completion_length=192,
|
| 183 |
+
)
|
| 184 |
+
|
| 185 |
+
trainer_kwargs = {
|
| 186 |
+
"model": model,
|
| 187 |
+
"reward_funcs": [sentinel_reward],
|
| 188 |
+
"args": config,
|
| 189 |
+
"train_dataset": dataset,
|
| 190 |
+
}
|
| 191 |
+
try:
|
| 192 |
+
trainer = GRPOTrainer(processing_class=tokenizer, **trainer_kwargs)
|
| 193 |
+
except TypeError:
|
| 194 |
+
trainer = GRPOTrainer(tokenizer=tokenizer, **trainer_kwargs)
|
| 195 |
+
|
| 196 |
+
trainer.train()
|
| 197 |
+
model.save_pretrained(args.output_dir)
|
| 198 |
+
tokenizer.save_pretrained(args.output_dir)
|
| 199 |
+
print(f"Training complete. Saved LoRA adapter to {args.output_dir}")
|
| 200 |
+
|
| 201 |
+
|
| 202 |
def main() -> None:
|
| 203 |
parser = argparse.ArgumentParser(description="SENTINEL GRPO training harness.")
|
| 204 |
parser.add_argument("--dry-run", action="store_true", help="Run local rollouts without GPU dependencies.")
|
| 205 |
parser.add_argument("--episodes", type=int, default=5)
|
| 206 |
parser.add_argument("--seed", type=int, default=0)
|
| 207 |
+
parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3", "all"])
|
| 208 |
+
parser.add_argument("--model", default="unsloth/Qwen2.5-1.5B-Instruct")
|
| 209 |
+
parser.add_argument("--output-dir", default="training/sentinel_model")
|
| 210 |
+
parser.add_argument("--epochs", type=int, default=1)
|
| 211 |
+
parser.add_argument("--batch-size", type=int, default=2)
|
| 212 |
+
parser.add_argument("--learning-rate", type=float, default=5e-6)
|
| 213 |
+
parser.add_argument("--max-seq-length", type=int, default=1024)
|
| 214 |
+
parser.add_argument("--lora-rank", type=int, default=16)
|
| 215 |
args = parser.parse_args()
|
| 216 |
|
| 217 |
if args.dry_run:
|
| 218 |
print(json.dumps(dry_run_rollouts(args.episodes, args.seed), indent=2))
|
| 219 |
return
|
| 220 |
|
| 221 |
+
run_grpo(args)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
|
| 224 |
if __name__ == "__main__":
|