Spaces:

XcodeAddy
/

sentinel-env

Running

App Files Files Community

XcodeAddy commited on 23 days ago

Commit

1f43fa9

1 Parent(s): d82d9e8

Prepare SENTINEL onsite deployment proof

Browse files

Files changed (8) hide show

.gitignore +4 -1
README.md +50 -3
openenv.yaml +5 -4
outputs/baseline_comparison.png +0 -0
outputs/baseline_scores.json +531 -0
outputs/evaluation_results.json +0 -0
training/evaluate.py +169 -5
training/train.py +122 -15

.gitignore CHANGED Viewed

@@ -5,7 +5,10 @@ __pycache__/
 .mypy_cache/
 .ruff_cache/
 .venv/
-outputs/
 .env
 .env.*
 !.env.example

 .mypy_cache/
 .ruff_cache/
 .venv/
+outputs/*
+!outputs/baseline_comparison.png
+!outputs/baseline_scores.json
+!outputs/evaluation_results.json
 .env
 .env.*
 !.env.example

README.md CHANGED Viewed

@@ -24,6 +24,12 @@ SENTINEL turns that failure mode into a trainable environment. The model only se
 - Rewards: per-step reward plus terminal score, normalized to `0.0-1.0`
 - Dataset: 120 abstract multi-agent scenarios
 ## Specialist Behaviors
 | Public Slot | Hidden Behavior |
@@ -133,10 +139,11 @@ pip install pytest
 Run checks:
 ```bash
-python -m py_compile app.py environment.py models.py graders.py specialists.py trust_ledger.py task_graph.py scenarios.py inference.py
 python -m pytest -q
 python inference.py
-python training/evaluate.py --episodes 20 --task task3
 ```
 Run the server:
@@ -175,7 +182,47 @@ docker run -p 7860:7860 sentinel-env
 - `heuristic`
 - `oracle_lite`
-The evaluator writes `outputs/evaluation_results.json` for demo charts.
 ## Hackathon Alignment

 - Rewards: per-step reward plus terminal score, normalized to `0.0-1.0`
 - Dataset: 120 abstract multi-agent scenarios
+## Live Submission Targets
+- GitHub: `https://github.com/ADITYAGABA1322/sentinel-env`
+- Hugging Face Space: `https://xcodeaddy-sentinel-env.hf.space`
+- OpenEnv base URL: `https://xcodeaddy-sentinel-env.hf.space`
 ## Specialist Behaviors
 | Public Slot | Hidden Behavior |
 Run checks:
 ```bash
+python -m py_compile app.py server/app.py environment.py models.py graders.py specialists.py trust_ledger.py task_graph.py scenarios.py inference.py comms_bus.py training/evaluate.py training/train.py
 python -m pytest -q
 python inference.py
+python training/evaluate.py --episodes 20 --task all --plot outputs/baseline_comparison.png
+python training/train.py --dry-run --episodes 5
 ```
 Run the server:
 - `heuristic`
 - `oracle_lite`
+The evaluator writes `outputs/evaluation_results.json` and `outputs/baseline_comparison.png`.
+![Baseline Comparison](outputs/baseline_comparison.png)
+Latest local comparison, 20 episodes per task and policy:
+| Policy | Overall | Task 1 | Task 2 | Task 3 |
+| --- | ---: | ---: | ---: | ---: |
+| Random | 0.7144 | 0.7948 | 0.6493 | 0.6990 |
+| Heuristic trust-weighted | 0.8162 | 0.8911 | 0.7736 | 0.7838 |
+| Oracle-lite upper bound | 0.8718 | 0.9445 | 0.7760 | 0.8950 |
+The demo story is the score gap: the reward function distinguishes blind delegation from trust-aware routing, and the oracle-lite upper bound shows room for onsite RL training.
+## Hugging Face Deployment
+```bash
+huggingface-cli login
+huggingface-cli repo create sentinel-env --type space --space-sdk docker --private false
+git remote add hf https://huggingface.co/spaces/XcodeAddy/sentinel-env
+git push hf main
+```
+After the Space builds:
+```bash
+curl https://xcodeaddy-sentinel-env.hf.space/health
+curl https://xcodeaddy-sentinel-env.hf.space/
+curl -X POST https://xcodeaddy-sentinel-env.hf.space/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_type":"task3","seed":42}'
+openenv validate . --json
+```
+## Mini-Blog Draft
+Title: `SENTINEL: Training AI to Trust Wisely in Multi-Agent Systems`
+SENTINEL is an OpenEnv RL environment for one failure mode: multi-agent systems delegate blindly. One orchestrator must complete long tasks by routing work across five specialist agents whose reliability profiles are hidden and reshuffled every episode. The orchestrator only sees behavior, confidence, stakes, and history, so it must learn skepticism, verification, recovery, and calibrated trust.
+The specialists are deterministic FSMs on purpose: they give stable reward signals while the orchestrator remains the trainable target. Random routing scores `0.7144`, trust-weighted routing scores `0.8162`, and oracle-lite reaches `0.8718`, showing the environment has a meaningful learning signal before onsite GRPO training.
 ## Hackathon Alignment

openenv.yaml CHANGED Viewed

@@ -23,7 +23,7 @@ description: >
   transferable skill, not memorized identities.
 api:
-  base_url: http://0.0.0.0:7860
   endpoints:
     health:
       method: GET
@@ -140,9 +140,10 @@ baseline:
   script: inference.py
   required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
   optional_env_vars: [ENV_URL]
-  latest_local_score: 0.7942
-  latest_local_episodes: 30
   reproducibility:
     inference_temperature: 0.0
     agent: heuristic-trust-weighted
-    dataset_order: fixed SCN-TASK*-001 through SCN-TASK*-010 per task

   transferable skill, not memorized identities.
 api:
+  base_url: https://xcodeaddy-sentinel-env.hf.space
   endpoints:
     health:
       method: GET
   script: inference.py
   required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
   optional_env_vars: [ENV_URL]
+  latest_local_score: 0.8162
+  latest_local_episodes: 60
+  comparison_artifact: outputs/baseline_comparison.png
   reproducibility:
     inference_temperature: 0.0
     agent: heuristic-trust-weighted
+    dataset_order: fixed SCN-TASK*-001 through SCN-TASK*-020 per task

outputs/baseline_comparison.png ADDED Viewed

outputs/baseline_scores.json ADDED Viewed

	@@ -0,0 +1,531 @@

+{
+  "model": "heuristic-baseline",
+  "total_episodes": 30,
+  "avg_score": 0.7942,
+  "by_task": {
+    "task1": {
+      "episodes": 10,
+      "avg_score": 0.8706
+    },
+    "task2": {
+      "episodes": 10,
+      "avg_score": 0.7475
+    },
+    "task3": {
+      "episodes": 10,
+      "avg_score": 0.7646
+    }
+  },
+  "episodes": [
+    {
+      "scenario_id": "SCN-TASK1-001",
+      "task_type": "task1",
+      "steps": 13,
+      "score": 0.765,
+      "total_reward": 10.71,
+      "completion_rate": 0.8,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.473,
+        "S1": 0.743,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-002",
+      "task_type": "task1",
+      "steps": 12,
+      "score": 0.7962,
+      "total_reward": 10.35,
+      "completion_rate": 0.8,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.473,
+        "S1": 0.888,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-003",
+      "task_type": "task1",
+      "steps": 11,
+      "score": 0.885,
+      "total_reward": 10.62,
+      "completion_rate": 0.9,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.296,
+        "S1": 0.296,
+        "S2": 0.94,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-004",
+      "task_type": "task1",
+      "steps": 8,
+      "score": 0.99,
+      "total_reward": 8.91,
+      "completion_rate": 0.8,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.931,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-005",
+      "task_type": "task1",
+      "steps": 11,
+      "score": 0.9375,
+      "total_reward": 11.25,
+      "completion_rate": 1.0,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.86,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-006",
+      "task_type": "task1",
+      "steps": 8,
+      "score": 0.85,
+      "total_reward": 7.65,
+      "completion_rate": 0.6,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.71,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-007",
+      "task_type": "task1",
+      "steps": 10,
+      "score": 0.99,
+      "total_reward": 10.89,
+      "completion_rate": 1.0,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.943,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-008",
+      "task_type": "task1",
+      "steps": 11,
+      "score": 0.8325,
+      "total_reward": 9.99,
+      "completion_rate": 0.8,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.482,
+        "S1": 0.9,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-009",
+      "task_type": "task1",
+      "steps": 9,
+      "score": 0.864,
+      "total_reward": 8.64,
+      "completion_rate": 0.7,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.492,
+        "S1": 0.801,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK1-010",
+      "task_type": "task1",
+      "steps": 12,
+      "score": 0.7962,
+      "total_reward": 10.35,
+      "completion_rate": 0.8,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.494,
+        "S1": 0.885,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-001",
+      "task_type": "task2",
+      "steps": 19,
+      "score": 0.6054,
+      "total_reward": 12.1087,
+      "completion_rate": 0.8,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.476,
+        "S1": 0.26,
+        "S2": 0.717,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-002",
+      "task_type": "task2",
+      "steps": 17,
+      "score": 0.7762,
+      "total_reward": 13.9711,
+      "completion_rate": 0.933,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.478,
+        "S1": 0.958,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-003",
+      "task_type": "task2",
+      "steps": 17,
+      "score": 0.7377,
+      "total_reward": 13.2781,
+      "completion_rate": 0.867,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.289,
+        "S1": 0.289,
+        "S2": 0.818,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-004",
+      "task_type": "task2",
+      "steps": 15,
+      "score": 0.7783,
+      "total_reward": 12.4521,
+      "completion_rate": 0.933,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.9,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-005",
+      "task_type": "task2",
+      "steps": 17,
+      "score": 0.8174,
+      "total_reward": 14.7129,
+      "completion_rate": 1.0,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.849,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-006",
+      "task_type": "task2",
+      "steps": 15,
+      "score": 0.6476,
+      "total_reward": 10.3617,
+      "completion_rate": 0.733,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.708,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-007",
+      "task_type": "task2",
+      "steps": 15,
+      "score": 0.8967,
+      "total_reward": 14.3478,
+      "completion_rate": 1.0,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.967,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-008",
+      "task_type": "task2",
+      "steps": 17,
+      "score": 0.7442,
+      "total_reward": 13.3953,
+      "completion_rate": 0.933,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.49,
+        "S1": 0.959,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-009",
+      "task_type": "task2",
+      "steps": 16,
+      "score": 0.7525,
+      "total_reward": 12.792,
+      "completion_rate": 0.933,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.492,
+        "S1": 0.906,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK2-010",
+      "task_type": "task2",
+      "steps": 18,
+      "score": 0.7191,
+      "total_reward": 13.6622,
+      "completion_rate": 0.933,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.474,
+        "S1": 0.955,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-001",
+      "task_type": "task3",
+      "steps": 25,
+      "score": 0.7354,
+      "total_reward": 19.1204,
+      "completion_rate": 0.85,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.456,
+        "S1": 0.258,
+        "S2": 0.76,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-002",
+      "task_type": "task3",
+      "steps": 25,
+      "score": 0.7054,
+      "total_reward": 18.341,
+      "completion_rate": 0.85,
+      "adversarial_detections": 3,
+      "adversarial_poisonings": 5,
+      "final_trust": {
+        "S0": 0.458,
+        "S1": 0.473,
+        "S2": 0.868,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-003",
+      "task_type": "task3",
+      "steps": 19,
+      "score": 0.6438,
+      "total_reward": 12.8767,
+      "completion_rate": 0.6,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 5,
+      "final_trust": {
+        "S0": 0.299,
+        "S1": 0.299,
+        "S2": 0.633,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-004",
+      "task_type": "task3",
+      "steps": 21,
+      "score": 0.8954,
+      "total_reward": 19.6992,
+      "completion_rate": 1.0,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.93,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-005",
+      "task_type": "task3",
+      "steps": 24,
+      "score": 0.7134,
+      "total_reward": 17.8339,
+      "completion_rate": 0.85,
+      "adversarial_detections": 3,
+      "adversarial_poisonings": 6,
+      "final_trust": {
+        "S0": 0.491,
+        "S1": 0.797,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-006",
+      "task_type": "task3",
+      "steps": 23,
+      "score": 0.7857,
+      "total_reward": 18.8578,
+      "completion_rate": 0.9,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.774,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-007",
+      "task_type": "task3",
+      "steps": 24,
+      "score": 0.7045,
+      "total_reward": 17.6133,
+      "completion_rate": 0.85,
+      "adversarial_detections": 3,
+      "adversarial_poisonings": 7,
+      "final_trust": {
+        "S0": 0.498,
+        "S1": 0.5,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-008",
+      "task_type": "task3",
+      "steps": 24,
+      "score": 0.8057,
+      "total_reward": 20.1435,
+      "completion_rate": 0.95,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.479,
+        "S1": 0.856,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-009",
+      "task_type": "task3",
+      "steps": 23,
+      "score": 0.8456,
+      "total_reward": 20.2932,
+      "completion_rate": 1.0,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.488,
+        "S1": 0.891,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    },
+    {
+      "scenario_id": "SCN-TASK3-010",
+      "task_type": "task3",
+      "steps": 24,
+      "score": 0.8106,
+      "total_reward": 20.2645,
+      "completion_rate": 0.95,
+      "adversarial_detections": 0,
+      "adversarial_poisonings": 0,
+      "final_trust": {
+        "S0": 0.473,
+        "S1": 0.91,
+        "S2": 0.5,
+        "S3": 0.5,
+        "S4": 0.5
+      }
+    }
+  ]
+}

outputs/evaluation_results.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training/evaluate.py CHANGED Viewed

@@ -3,7 +3,9 @@ from __future__ import annotations
 import argparse
 import json
 import random
 import sys
 from pathlib import Path
 from typing import Callable
@@ -16,6 +18,8 @@ from environment import SentinelEnv, _GROUND_TRUTH_RELIABILITY
 Policy = Callable[[SentinelEnv, dict, random.Random], dict]
 def random_policy(env: SentinelEnv, obs: dict, rng: random.Random) -> dict:
     specialist = rng.choice(obs["available_specialists"])
@@ -117,11 +121,162 @@ def _avg(rows: list[dict], key: str) -> float:
     return round(sum(float(row.get(key, 0.0)) for row in rows) / max(1, len(rows)), 4)
 def main() -> None:
     parser = argparse.ArgumentParser(description="Evaluate SENTINEL policies.")
     parser.add_argument("--episodes", type=int, default=20, help="Episodes per policy.")
-    parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3"])
     parser.add_argument("--out", default="outputs/evaluation_results.json")
     args = parser.parse_args()
     policies: dict[str, Policy] = {
@@ -130,23 +285,32 @@ def main() -> None:
         "oracle_lite": oracle_lite_policy,
     }
     rows = []
-    for policy_name, policy in policies.items():
-        for seed in range(args.episodes):
-            rows.append(run_episode(policy_name, policy, args.task, seed))
     payload = {
         "task": args.task,
         "episodes_per_policy": args.episodes,
         "summary": summarize(rows),
         "episodes": rows,
     }
     out_path = ROOT / args.out
     out_path.parent.mkdir(parents=True, exist_ok=True)
     out_path.write_text(json.dumps(payload, indent=2) + "\n")
-    print(json.dumps(payload["summary"], indent=2))
 if __name__ == "__main__":

 import argparse
 import json
 import random
+import struct
 import sys
+import zlib
 from pathlib import Path
 from typing import Callable
 Policy = Callable[[SentinelEnv, dict, random.Random], dict]
+POLICIES: dict[str, Policy] = {}
 def random_policy(env: SentinelEnv, obs: dict, rng: random.Random) -> dict:
     specialist = rng.choice(obs["available_specialists"])
     return round(sum(float(row.get(key, 0.0)) for row in rows) / max(1, len(rows)), 4)
+def summarize_by_task(rows: list[dict]) -> dict:
+    grouped: dict[str, list[dict]] = {}
+    for row in rows:
+        grouped.setdefault(row["task_type"], []).append(row)
+    return {task: summarize(task_rows) for task, task_rows in sorted(grouped.items())}
+FONT_5X7 = {
+    " ": ["00000", "00000", "00000", "00000", "00000", "00000", "00000"],
+    "-": ["00000", "00000", "00000", "11111", "00000", "00000", "00000"],
+    ".": ["00000", "00000", "00000", "00000", "00000", "01100", "01100"],
+    ":": ["00000", "01100", "01100", "00000", "01100", "01100", "00000"],
+    "0": ["01110", "10001", "10011", "10101", "11001", "10001", "01110"],
+    "1": ["00100", "01100", "00100", "00100", "00100", "00100", "01110"],
+    "2": ["01110", "10001", "00001", "00010", "00100", "01000", "11111"],
+    "3": ["11110", "00001", "00001", "01110", "00001", "00001", "11110"],
+    "4": ["00010", "00110", "01010", "10010", "11111", "00010", "00010"],
+    "5": ["11111", "10000", "10000", "11110", "00001", "00001", "11110"],
+    "6": ["01110", "10000", "10000", "11110", "10001", "10001", "01110"],
+    "7": ["11111", "00001", "00010", "00100", "01000", "01000", "01000"],
+    "8": ["01110", "10001", "10001", "01110", "10001", "10001", "01110"],
+    "9": ["01110", "10001", "10001", "01111", "00001", "00001", "01110"],
+    "A": ["01110", "10001", "10001", "11111", "10001", "10001", "10001"],
+    "B": ["11110", "10001", "10001", "11110", "10001", "10001", "11110"],
+    "C": ["01110", "10001", "10000", "10000", "10000", "10001", "01110"],
+    "D": ["11110", "10001", "10001", "10001", "10001", "10001", "11110"],
+    "E": ["11111", "10000", "10000", "11110", "10000", "10000", "11111"],
+    "F": ["11111", "10000", "10000", "11110", "10000", "10000", "10000"],
+    "G": ["01110", "10001", "10000", "10111", "10001", "10001", "01110"],
+    "H": ["10001", "10001", "10001", "11111", "10001", "10001", "10001"],
+    "I": ["01110", "00100", "00100", "00100", "00100", "00100", "01110"],
+    "J": ["00001", "00001", "00001", "00001", "10001", "10001", "01110"],
+    "K": ["10001", "10010", "10100", "11000", "10100", "10010", "10001"],
+    "L": ["10000", "10000", "10000", "10000", "10000", "10000", "11111"],
+    "M": ["10001", "11011", "10101", "10101", "10001", "10001", "10001"],
+    "N": ["10001", "11001", "10101", "10011", "10001", "10001", "10001"],
+    "O": ["01110", "10001", "10001", "10001", "10001", "10001", "01110"],
+    "P": ["11110", "10001", "10001", "11110", "10000", "10000", "10000"],
+    "Q": ["01110", "10001", "10001", "10001", "10101", "10010", "01101"],
+    "R": ["11110", "10001", "10001", "11110", "10100", "10010", "10001"],
+    "S": ["01111", "10000", "10000", "01110", "00001", "00001", "11110"],
+    "T": ["11111", "00100", "00100", "00100", "00100", "00100", "00100"],
+    "U": ["10001", "10001", "10001", "10001", "10001", "10001", "01110"],
+    "V": ["10001", "10001", "10001", "10001", "10001", "01010", "00100"],
+    "W": ["10001", "10001", "10001", "10101", "10101", "10101", "01010"],
+    "X": ["10001", "10001", "01010", "00100", "01010", "10001", "10001"],
+    "Y": ["10001", "10001", "01010", "00100", "00100", "00100", "00100"],
+    "Z": ["11111", "00001", "00010", "00100", "01000", "10000", "11111"],
+}
+def write_baseline_chart(payload: dict, path: Path) -> None:
+    """Write a dependency-free PNG chart for README and onsite demos."""
+    by_task = payload["by_task"]
+    tasks = list(by_task.keys())
+    policies = [name for name in ("random", "heuristic", "oracle_lite") if any(name in by_task[t] for t in tasks)]
+    colors = {
+        "random": (239, 68, 68),
+        "heuristic": (59, 130, 246),
+        "oracle_lite": (16, 185, 129),
+    }
+    labels = {"random": "RANDOM", "heuristic": "HEURISTIC", "oracle_lite": "ORACLE LITE"}
+    width, height = 1200, 720
+    canvas = bytearray([255, 255, 255] * width * height)
+    def rect(x0: int, y0: int, x1: int, y1: int, color: tuple[int, int, int]) -> None:
+        x0, y0 = max(0, x0), max(0, y0)
+        x1, y1 = min(width, x1), min(height, y1)
+        for y in range(y0, y1):
+            row = y * width * 3
+            for x in range(x0, x1):
+                idx = row + x * 3
+                canvas[idx : idx + 3] = bytes(color)
+    def text(x: int, y: int, value: str, color: tuple[int, int, int] = (20, 20, 20), scale: int = 2) -> None:
+        cursor = x
+        for ch in value.upper():
+            glyph = FONT_5X7.get(ch, FONT_5X7[" "])
+            for gy, line in enumerate(glyph):
+                for gx, bit in enumerate(line):
+                    if bit == "1":
+                        rect(cursor + gx * scale, y + gy * scale, cursor + (gx + 1) * scale, y + (gy + 1) * scale, color)
+            cursor += 6 * scale
+    def line_h(y: int, x0: int, x1: int, color: tuple[int, int, int]) -> None:
+        rect(x0, y, x1, y + 1, color)
+    def line_v(x: int, y0: int, y1: int, color: tuple[int, int, int]) -> None:
+        rect(x, y0, x + 1, y1, color)
+    margin_left, margin_top, margin_right, margin_bottom = 100, 115, 40, 115
+    plot_x0, plot_y0 = margin_left, margin_top
+    plot_x1, plot_y1 = width - margin_right, height - margin_bottom
+    plot_w, plot_h = plot_x1 - plot_x0, plot_y1 - plot_y0
+    text(50, 28, "SENTINEL BASELINE COMPARISON", (17, 24, 39), 3)
+    text(52, 70, "EPISODE SCORE 0.0 TO 1.0 - RANDOM VS TRUST WEIGHTED VS ORACLE LITE", (75, 85, 99), 2)
+    for tick in (0.0, 0.25, 0.5, 0.75, 1.0):
+        y = int(plot_y1 - tick * plot_h)
+        line_h(y, plot_x0, plot_x1, (226, 232, 240))
+        text(32, y - 7, f"{tick:.2f}", (100, 116, 139), 2)
+    line_v(plot_x0, plot_y0, plot_y1, (148, 163, 184))
+    line_h(plot_y1, plot_x0, plot_x1, (148, 163, 184))
+    group_w = plot_w / max(1, len(tasks))
+    bar_w = max(34, min(76, int((group_w - 80) / max(1, len(policies)))))
+    for task_idx, task in enumerate(tasks):
+        group_center = int(plot_x0 + group_w * task_idx + group_w / 2)
+        start_x = group_center - int((len(policies) * bar_w + (len(policies) - 1) * 18) / 2)
+        for policy_idx, policy in enumerate(policies):
+            value = float(by_task[task].get(policy, {}).get("avg_score", 0.0))
+            x0 = start_x + policy_idx * (bar_w + 18)
+            y0 = int(plot_y1 - value * plot_h)
+            rect(x0 + 3, y0 + 3, x0 + bar_w + 3, plot_y1 + 3, (203, 213, 225))
+            rect(x0, y0, x0 + bar_w, plot_y1, colors[policy])
+            text(x0 - 4, max(plot_y0 - 2, y0 - 24), f"{value:.2f}", (15, 23, 42), 2)
+        text(group_center - 36, plot_y1 + 30, task.upper(), (15, 23, 42), 2)
+    legend_x, legend_y = 780, 32
+    for idx, policy in enumerate(policies):
+        x = legend_x
+        y = legend_y + idx * 24
+        rect(x, y, x + 16, y + 16, colors[policy])
+        text(x + 24, y + 1, labels[policy], (51, 65, 85), 2)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    _write_png(path, width, height, canvas)
+def _write_png(path: Path, width: int, height: int, rgb: bytearray) -> None:
+    def chunk(tag: bytes, data: bytes) -> bytes:
+        return struct.pack(">I", len(data)) + tag + data + struct.pack(">I", zlib.crc32(tag + data) & 0xFFFFFFFF)
+    rows = []
+    stride = width * 3
+    for y in range(height):
+        rows.append(b"\x00" + bytes(rgb[y * stride : (y + 1) * stride]))
+    raw = b"".join(rows)
+    png = (
+        b"\x89PNG\r\n\x1a\n"
+        + chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
+        + chunk(b"IDAT", zlib.compress(raw, 9))
+        + chunk(b"IEND", b"")
+    )
+    path.write_bytes(png)
 def main() -> None:
     parser = argparse.ArgumentParser(description="Evaluate SENTINEL policies.")
     parser.add_argument("--episodes", type=int, default=20, help="Episodes per policy.")
+    parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3", "all"])
     parser.add_argument("--out", default="outputs/evaluation_results.json")
+    parser.add_argument("--plot", default="outputs/baseline_comparison.png")
+    parser.add_argument("--no-plot", action="store_true")
     args = parser.parse_args()
     policies: dict[str, Policy] = {
         "oracle_lite": oracle_lite_policy,
     }
+    tasks = ["task1", "task2", "task3"] if args.task == "all" else [args.task]
     rows = []
+    for task_type in tasks:
+        for policy_name, policy in policies.items():
+            for seed in range(args.episodes):
+                rows.append(run_episode(policy_name, policy, task_type, seed))
     payload = {
         "task": args.task,
+        "tasks": tasks,
         "episodes_per_policy": args.episodes,
         "summary": summarize(rows),
+        "by_task": summarize_by_task(rows),
         "episodes": rows,
     }
     out_path = ROOT / args.out
     out_path.parent.mkdir(parents=True, exist_ok=True)
     out_path.write_text(json.dumps(payload, indent=2) + "\n")
+    if not args.no_plot:
+        chart_path = ROOT / args.plot
+        write_baseline_chart(payload, chart_path)
+        payload["chart"] = str(chart_path.relative_to(ROOT))
+        out_path.write_text(json.dumps(payload, indent=2) + "\n")
+    print(json.dumps({"summary": payload["summary"], "by_task": payload["by_task"], "chart": payload.get("chart")}, indent=2))
 if __name__ == "__main__":

training/train.py CHANGED Viewed

@@ -1,11 +1,11 @@
 from __future__ import annotations
 """
-Minimal onsite training entrypoint.
 This file is intentionally import-light so it can run locally without GPU
 packages. On the finale machine, install the training extras from pyproject and
-use this script as the GRPO wiring point.
 """
 import argparse
@@ -37,6 +37,24 @@ def build_prompt(observation: dict) -> str:
     )
 def parse_action(text: str, observation: dict) -> dict:
     match = ACTION_RE.search(text or "")
     payload = {}
@@ -66,6 +84,44 @@ def parse_action(text: str, observation: dict) -> dict:
     }
 def dry_run_rollouts(episodes: int, seed: int) -> dict:
     rng = random.Random(seed)
     scores = []
@@ -88,30 +144,81 @@ def dry_run_rollouts(episodes: int, seed: int) -> dict:
     return {"episodes": episodes, "avg_score": round(sum(scores) / max(1, len(scores)), 4)}
 def main() -> None:
     parser = argparse.ArgumentParser(description="SENTINEL GRPO training harness.")
     parser.add_argument("--dry-run", action="store_true", help="Run local rollouts without GPU dependencies.")
     parser.add_argument("--episodes", type=int, default=5)
     parser.add_argument("--seed", type=int, default=0)
     args = parser.parse_args()
     if args.dry_run:
         print(json.dumps(dry_run_rollouts(args.episodes, args.seed), indent=2))
         return
-    try:
-        import trl  # noqa: F401
-        import unsloth  # noqa: F401
-    except ImportError as exc:
-        raise SystemExit(
-            "Training dependencies are not installed. Run with --dry-run locally, "
-            "or install the pyproject training extras on the finale GPU machine."
-        ) from exc
-    raise SystemExit(
-        "GPU training hook is ready. Wire GRPOTrainer here using build_prompt(), "
-        "parse_action(), and SentinelEnv.step() as the reward source."
-    )
 if __name__ == "__main__":

 from __future__ import annotations
 """
+Onsite training entrypoint.
 This file is intentionally import-light so it can run locally without GPU
 packages. On the finale machine, install the training extras from pyproject and
+run without --dry-run to train a small orchestrator policy with GRPO.
 """
 import argparse
     )
+def build_dataset_records(episodes: int, task_type: str, seed: int) -> list[dict]:
+    records = []
+    task_choices = ["task1", "task2", "task3"] if task_type == "all" else [task_type]
+    for idx in range(episodes):
+        selected_task = task_choices[idx % len(task_choices)]
+        env = SentinelEnv()
+        result = env.reset(task_type=selected_task, seed=seed + idx)
+        obs = result["observation"]
+        records.append(
+            {
+                "prompt": build_prompt(obs),
+                "task_type": selected_task,
+                "seed": seed + idx,
+            }
+        )
+    return records
 def parse_action(text: str, observation: dict) -> dict:
     match = ACTION_RE.search(text or "")
     payload = {}
     }
+def score_completion(completion: str, task_type: str, seed: int) -> float:
+    env = SentinelEnv()
+    result = env.reset(task_type=task_type, seed=seed)
+    obs = result["observation"]
+    action = parse_action(completion, obs)
+    result = env.step(action)
+    return float(result["reward"]["value"])
+def sentinel_reward(completions, prompts=None, task_type=None, seed=None, **kwargs):
+    rewards = []
+    task_values = task_type or kwargs.get("task_type") or ["task3"] * len(completions)
+    seed_values = seed or kwargs.get("seed") or list(range(len(completions)))
+    for idx, completion in enumerate(completions):
+        text = _completion_text(completion)
+        try:
+            rewards.append(score_completion(text, str(task_values[idx]), int(seed_values[idx])))
+        except Exception:
+            rewards.append(0.01)
+    return rewards
+def _completion_text(completion) -> str:
+    if isinstance(completion, str):
+        return completion
+    if isinstance(completion, list):
+        parts = []
+        for item in completion:
+            if isinstance(item, dict):
+                parts.append(str(item.get("content", "")))
+            else:
+                parts.append(str(item))
+        return "\n".join(parts)
+    if isinstance(completion, dict):
+        return str(completion.get("content", completion))
+    return str(completion)
 def dry_run_rollouts(episodes: int, seed: int) -> dict:
     rng = random.Random(seed)
     scores = []
     return {"episodes": episodes, "avg_score": round(sum(scores) / max(1, len(scores)), 4)}
+def run_grpo(args) -> None:
+    try:
+        from datasets import Dataset
+        from trl import GRPOConfig, GRPOTrainer
+        from unsloth import FastLanguageModel
+    except ImportError:
+        print("Training dependencies are not installed locally.")
+        print("Local check passed. For onsite GPU training run:")
+        print("  pip install '.[training]'")
+        print("  python training/train.py --episodes 300 --task all")
+        return
+    records = build_dataset_records(args.episodes, args.task, args.seed)
+    dataset = Dataset.from_list(records)
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name=args.model,
+        max_seq_length=args.max_seq_length,
+        load_in_4bit=True,
+    )
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=args.lora_rank,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+        lora_alpha=args.lora_rank,
+    )
+    config = GRPOConfig(
+        output_dir=args.output_dir,
+        learning_rate=args.learning_rate,
+        num_train_epochs=args.epochs,
+        per_device_train_batch_size=args.batch_size,
+        logging_steps=10,
+        save_steps=50,
+        max_prompt_length=args.max_seq_length,
+        max_completion_length=192,
+    )
+    trainer_kwargs = {
+        "model": model,
+        "reward_funcs": [sentinel_reward],
+        "args": config,
+        "train_dataset": dataset,
+    }
+    try:
+        trainer = GRPOTrainer(processing_class=tokenizer, **trainer_kwargs)
+    except TypeError:
+        trainer = GRPOTrainer(tokenizer=tokenizer, **trainer_kwargs)
+    trainer.train()
+    model.save_pretrained(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    print(f"Training complete. Saved LoRA adapter to {args.output_dir}")
 def main() -> None:
     parser = argparse.ArgumentParser(description="SENTINEL GRPO training harness.")
     parser.add_argument("--dry-run", action="store_true", help="Run local rollouts without GPU dependencies.")
     parser.add_argument("--episodes", type=int, default=5)
     parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--task", default="task3", choices=["task1", "task2", "task3", "all"])
+    parser.add_argument("--model", default="unsloth/Qwen2.5-1.5B-Instruct")
+    parser.add_argument("--output-dir", default="training/sentinel_model")
+    parser.add_argument("--epochs", type=int, default=1)
+    parser.add_argument("--batch-size", type=int, default=2)
+    parser.add_argument("--learning-rate", type=float, default=5e-6)
+    parser.add_argument("--max-seq-length", type=int, default=1024)
+    parser.add_argument("--lora-rank", type=int, default=16)
     args = parser.parse_args()
     if args.dry_run:
         print(json.dumps(dry_run_rollouts(args.episodes, args.seed), indent=2))
         return
+    run_grpo(args)
 if __name__ == "__main__":