# SENTINEL Rollout This file is the execution spine for the project. The rule is simple: 1. Finish one phase. 2. Verify it. 3. Only then move to the next phase. SENTINEL wins if the repo, Space, README, UI, and pitch all tell the same story: > Train an orchestrator to decide who to trust, when to verify, and how to recover in long multi-agent tasks when specialists are unreliable or adversarial. ## Current Status | Area | Status | Notes | | --- | --- | --- | | Environment core | Strong | `reset()`, `step()`, `state()`, reward v2, task graph, specialists, trust ledger | | OpenEnv / deploy | Strong | Space live, Docker passing, validation passing | | UI clarity | Improving | Trust Mission Control is live, but still needs full judge-demo mode | | Presentation assets | Partial | Story exists, but diagrams and finale pack need stronger structure | | Training evidence | Partial | Baselines are refreshed under Reward Engine v2; final onsite GRPO curve still missing | | Submission completeness | Partial | Mini-blog/video and final finale package still needed | ## What We Borrow From MiroFish We borrow **presentation discipline**, not product scope. Use these MiroFish-style strengths: - one sharp promise at the top - visible workflow - screenshot and diagram density - live demo-first presentation - clean quick-start and deployment instructions Do **not** copy these patterns into SENTINEL: - giant "predict anything" scope - too many use cases - vague platform framing - vision language that is larger than the actual judged artifact ## Phase Rules - Phase 1 must lock the narrative. - Phase 2 must lock the diagram system. - Phase 3 must make the UI explain the backend and the story. - Phase 4 must make learning evidence obvious. - Phase 5 must make the submission complete and reproducible. - Phase 6 must make the final pitch unforgettable. Do not skip a verification gate just because the feature "looks done." --- ## Phase 1 - Narrative Lock **Goal** Create one judge-safe project story and use it everywhere. **Outputs** - [Narrative Lock](./presentation/NARRATIVE_LOCK.md) - final one-line thesis - final hook - final problem framing - final before/after claim - final "what not to say" guardrails **Done means** - README, UI, demo script, and pitch all use the same project sentence - no outdated numbers or mismatched claims remain in primary docs - the problem statement is clearly software-first, RL-first, and OpenEnv-first **Verification** - README top section matches the narrative lock - UI top section uses the same thesis - team can explain SENTINEL in 20 seconds and 2 minutes without changing the core message **Status** `In progress` --- ## Phase 2 - Visual System Pack **Goal** Turn scattered diagrams into one visual language. **Outputs** - [Visual System](./diagrams/VISUAL_SYSTEM.md) - architecture diagram - episode lifecycle diagram - trust / reward dataflow diagram - before / after failure chain - theme fit diagram - training loop diagram **Done means** - every diagram uses the same naming and system boundaries - no diagram contradicts the actual code - diagrams can be embedded in README, blog, pitch, and UI **Verification** - `app.py`, `environment.py`, `specialists.py`, `trust_ledger.py`, `graders.py`, `task_graph.py`, and `inference.py` are all represented correctly - before/after flow uses real baseline numbers, not aspirational placeholders **Status** `In progress` --- ## Phase 3 - Productized Demo UI **Goal** Make the frontend explain the backend to judges and first-time users. **Outputs** - `Overview` mode - `Playground` mode - `Judge Demo` mode - raw request/response visibility - guided walkthrough of one episode - profile swap demo path **Done means** - a first-time viewer can answer: - what is SENTINEL? - what does the agent observe? - what action did the UI send? - what did the backend return? - why does trust change? - why is this hard? **Verification** - local `/`, `/reset`, `/step`, `/state`, and `/assets/baseline_comparison.png` all behave correctly - live Space reflects the same experience - no section feels like internal tooling only **Status** `Pending` --- ## Phase 4 - Learning Evidence **Goal** Make reward improvement impossible to miss. **Outputs** - random vs heuristic vs oracle-lite comparison - visible completion, detection, calibration, efficiency metrics - onsite GRPO / Unsloth reward curve - trained vs untrained comparison block **Done means** - judges can see measurable improvement in one screen and one README section - there is a visible path from baseline -> better policy -> trained model **Verification** - `training/evaluate.py` outputs are committed and linked - onsite curve is committed once available - numbers shown in UI and README match evaluation artifacts **Status** `Pending` --- ## Phase 5 - Submission Pack **Goal** Make the project submission-complete. **Outputs** - final README with all links - HF Space link - Colab / training notebook link - blog or video link - screenshots and diagram links - reproduction commands **Done means** - a judge can clone, run, inspect, and understand the project without asking for missing context **Verification** - README links are live - Space is live - `openenv validate . --json` passes - Docker build passes **Status** `Pending` --- ## Phase 6 - Finale Pack **Goal** Package the repo for the room, not just for the validator. **Outputs** - 3-minute script - 5 likely judge questions + answers - backup screenshots - fallback demo sequence - one-click "killer moment" path **Done means** - the pitch works even if the live environment is slow - the trained-vs-baseline story is memorable - the profile swap moment is rehearsed **Verification** - demo path can be run without improvising architecture details - every claim can be grounded in repo assets **Status** `Pending` --- ## Execution Order ```text Phase 1 -> Phase 2 -> Phase 3 -> Phase 4 -> Phase 5 -> Phase 6 ``` ## Next Immediate Build Target Phase 1 and Phase 2 are the current active work. Once both are fully stable in-repo, Phase 3 starts on top of them.