Spaces:
Running
Running
File size: 6,179 Bytes
2c0b609 b3b9bbd 2c0b609 b3b9bbd 2c0b609 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | # SENTINEL Rollout
This file is the execution spine for the project. The rule is simple:
1. Finish one phase.
2. Verify it.
3. Only then move to the next phase.
SENTINEL wins if the repo, Space, README, UI, and pitch all tell the same story:
> Train an orchestrator to decide who to trust, when to verify, and how to recover in long multi-agent tasks when specialists are unreliable or adversarial.
## Current Status
| Area | Status | Notes |
| --- | --- | --- |
| Environment core | Strong | `reset()`, `step()`, `state()`, reward v2, task graph, specialists, trust ledger |
| OpenEnv / deploy | Strong | Space live, Docker passing, validation passing |
| UI clarity | Improving | Trust Mission Control is live, but still needs full judge-demo mode |
| Presentation assets | Partial | Story exists, but diagrams and finale pack need stronger structure |
| Training evidence | Partial | Baselines are refreshed under Reward Engine v2; final onsite GRPO curve still missing |
| Submission completeness | Partial | Mini-blog/video and final finale package still needed |
## What We Borrow From MiroFish
We borrow **presentation discipline**, not product scope.
Use these MiroFish-style strengths:
- one sharp promise at the top
- visible workflow
- screenshot and diagram density
- live demo-first presentation
- clean quick-start and deployment instructions
Do **not** copy these patterns into SENTINEL:
- giant "predict anything" scope
- too many use cases
- vague platform framing
- vision language that is larger than the actual judged artifact
## Phase Rules
- Phase 1 must lock the narrative.
- Phase 2 must lock the diagram system.
- Phase 3 must make the UI explain the backend and the story.
- Phase 4 must make learning evidence obvious.
- Phase 5 must make the submission complete and reproducible.
- Phase 6 must make the final pitch unforgettable.
Do not skip a verification gate just because the feature "looks done."
---
## Phase 1 - Narrative Lock
**Goal**
Create one judge-safe project story and use it everywhere.
**Outputs**
- [Narrative Lock](./presentation/NARRATIVE_LOCK.md)
- final one-line thesis
- final hook
- final problem framing
- final before/after claim
- final "what not to say" guardrails
**Done means**
- README, UI, demo script, and pitch all use the same project sentence
- no outdated numbers or mismatched claims remain in primary docs
- the problem statement is clearly software-first, RL-first, and OpenEnv-first
**Verification**
- README top section matches the narrative lock
- UI top section uses the same thesis
- team can explain SENTINEL in 20 seconds and 2 minutes without changing the core message
**Status**
`In progress`
---
## Phase 2 - Visual System Pack
**Goal**
Turn scattered diagrams into one visual language.
**Outputs**
- [Visual System](./diagrams/VISUAL_SYSTEM.md)
- architecture diagram
- episode lifecycle diagram
- trust / reward dataflow diagram
- before / after failure chain
- theme fit diagram
- training loop diagram
**Done means**
- every diagram uses the same naming and system boundaries
- no diagram contradicts the actual code
- diagrams can be embedded in README, blog, pitch, and UI
**Verification**
- `app.py`, `environment.py`, `specialists.py`, `trust_ledger.py`, `graders.py`, `task_graph.py`, and `inference.py` are all represented correctly
- before/after flow uses real baseline numbers, not aspirational placeholders
**Status**
`In progress`
---
## Phase 3 - Productized Demo UI
**Goal**
Make the frontend explain the backend to judges and first-time users.
**Outputs**
- `Overview` mode
- `Playground` mode
- `Judge Demo` mode
- raw request/response visibility
- guided walkthrough of one episode
- profile swap demo path
**Done means**
- a first-time viewer can answer:
- what is SENTINEL?
- what does the agent observe?
- what action did the UI send?
- what did the backend return?
- why does trust change?
- why is this hard?
**Verification**
- local `/`, `/reset`, `/step`, `/state`, and `/assets/baseline_comparison.png` all behave correctly
- live Space reflects the same experience
- no section feels like internal tooling only
**Status**
`Pending`
---
## Phase 4 - Learning Evidence
**Goal**
Make reward improvement impossible to miss.
**Outputs**
- random vs heuristic vs oracle-lite comparison
- visible completion, detection, calibration, efficiency metrics
- onsite GRPO / Unsloth reward curve
- trained vs untrained comparison block
**Done means**
- judges can see measurable improvement in one screen and one README section
- there is a visible path from baseline -> better policy -> trained model
**Verification**
- `training/evaluate.py` outputs are committed and linked
- onsite curve is committed once available
- numbers shown in UI and README match evaluation artifacts
**Status**
`Pending`
---
## Phase 5 - Submission Pack
**Goal**
Make the project submission-complete.
**Outputs**
- final README with all links
- HF Space link
- Colab / training notebook link
- blog or video link
- screenshots and diagram links
- reproduction commands
**Done means**
- a judge can clone, run, inspect, and understand the project without asking for missing context
**Verification**
- README links are live
- Space is live
- `openenv validate . --json` passes
- Docker build passes
**Status**
`Pending`
---
## Phase 6 - Finale Pack
**Goal**
Package the repo for the room, not just for the validator.
**Outputs**
- 3-minute script
- 5 likely judge questions + answers
- backup screenshots
- fallback demo sequence
- one-click "killer moment" path
**Done means**
- the pitch works even if the live environment is slow
- the trained-vs-baseline story is memorable
- the profile swap moment is rehearsed
**Verification**
- demo path can be run without improvising architecture details
- every claim can be grounded in repo assets
**Status**
`Pending`
---
## Execution Order
```text
Phase 1 -> Phase 2 -> Phase 3 -> Phase 4 -> Phase 5 -> Phase 6
```
## Next Immediate Build Target
Phase 1 and Phase 2 are the current active work.
Once both are fully stable in-repo, Phase 3 starts on top of them.
|