Spaces:
Running
SENTINEL Rollout
This file is the execution spine for the project. The rule is simple:
- Finish one phase.
- Verify it.
- Only then move to the next phase.
SENTINEL wins if the repo, Space, README, UI, and pitch all tell the same story:
Train an orchestrator to decide who to trust, when to verify, and how to recover in long multi-agent tasks when specialists are unreliable or adversarial.
Current Status
| Area | Status | Notes |
|---|---|---|
| Environment core | Strong | reset(), step(), state(), reward v2, task graph, specialists, trust ledger |
| OpenEnv / deploy | Strong | Space live, Docker passing, validation passing |
| UI clarity | Improving | Trust Mission Control is live, but still needs full judge-demo mode |
| Presentation assets | Partial | Story exists, but diagrams and finale pack need stronger structure |
| Training evidence | Partial | Baselines are refreshed under Reward Engine v2; final onsite GRPO curve still missing |
| Submission completeness | Partial | Mini-blog/video and final finale package still needed |
What We Borrow From MiroFish
We borrow presentation discipline, not product scope.
Use these MiroFish-style strengths:
- one sharp promise at the top
- visible workflow
- screenshot and diagram density
- live demo-first presentation
- clean quick-start and deployment instructions
Do not copy these patterns into SENTINEL:
- giant "predict anything" scope
- too many use cases
- vague platform framing
- vision language that is larger than the actual judged artifact
Phase Rules
- Phase 1 must lock the narrative.
- Phase 2 must lock the diagram system.
- Phase 3 must make the UI explain the backend and the story.
- Phase 4 must make learning evidence obvious.
- Phase 5 must make the submission complete and reproducible.
- Phase 6 must make the final pitch unforgettable.
Do not skip a verification gate just because the feature "looks done."
Phase 1 - Narrative Lock
Goal
Create one judge-safe project story and use it everywhere.
Outputs
- Narrative Lock
- final one-line thesis
- final hook
- final problem framing
- final before/after claim
- final "what not to say" guardrails
Done means
- README, UI, demo script, and pitch all use the same project sentence
- no outdated numbers or mismatched claims remain in primary docs
- the problem statement is clearly software-first, RL-first, and OpenEnv-first
Verification
- README top section matches the narrative lock
- UI top section uses the same thesis
- team can explain SENTINEL in 20 seconds and 2 minutes without changing the core message
StatusIn progress
Phase 2 - Visual System Pack
Goal
Turn scattered diagrams into one visual language.
Outputs
- Visual System
- architecture diagram
- episode lifecycle diagram
- trust / reward dataflow diagram
- before / after failure chain
- theme fit diagram
- training loop diagram
Done means
- every diagram uses the same naming and system boundaries
- no diagram contradicts the actual code
- diagrams can be embedded in README, blog, pitch, and UI
Verification
app.py,environment.py,specialists.py,trust_ledger.py,graders.py,task_graph.py, andinference.pyare all represented correctly- before/after flow uses real baseline numbers, not aspirational placeholders
StatusIn progress
Phase 3 - Productized Demo UI
Goal
Make the frontend explain the backend to judges and first-time users.
Outputs
OverviewmodePlaygroundmodeJudge Demomode- raw request/response visibility
- guided walkthrough of one episode
- profile swap demo path
Done means
- a first-time viewer can answer:
- what is SENTINEL?
- what does the agent observe?
- what action did the UI send?
- what did the backend return?
- why does trust change?
- why is this hard?
Verification
- local
/,/reset,/step,/state, and/assets/baseline_comparison.pngall behave correctly - live Space reflects the same experience
- no section feels like internal tooling only
StatusPending
Phase 4 - Learning Evidence
Goal
Make reward improvement impossible to miss.
Outputs
- random vs heuristic vs oracle-lite comparison
- visible completion, detection, calibration, efficiency metrics
- onsite GRPO / Unsloth reward curve
- trained vs untrained comparison block
Done means
- judges can see measurable improvement in one screen and one README section
- there is a visible path from baseline -> better policy -> trained model
Verification
training/evaluate.pyoutputs are committed and linked- onsite curve is committed once available
- numbers shown in UI and README match evaluation artifacts
StatusPending
Phase 5 - Submission Pack
Goal
Make the project submission-complete.
Outputs
- final README with all links
- HF Space link
- Colab / training notebook link
- blog or video link
- screenshots and diagram links
- reproduction commands
Done means
- a judge can clone, run, inspect, and understand the project without asking for missing context
Verification
- README links are live
- Space is live
openenv validate . --jsonpasses- Docker build passes
StatusPending
Phase 6 - Finale Pack
Goal
Package the repo for the room, not just for the validator.
Outputs
- 3-minute script
- 5 likely judge questions + answers
- backup screenshots
- fallback demo sequence
- one-click "killer moment" path
Done means
- the pitch works even if the live environment is slow
- the trained-vs-baseline story is memorable
- the profile swap moment is rehearsed
Verification
- demo path can be run without improvising architecture details
- every claim can be grounded in repo assets
StatusPending
Execution Order
Phase 1 -> Phase 2 -> Phase 3 -> Phase 4 -> Phase 5 -> Phase 6
Next Immediate Build Target
Phase 1 and Phase 2 are the current active work.
Once both are fully stable in-repo, Phase 3 starts on top of them.