Spaces:

openenv-community
/

Sentinel

Running

App Files Files Community

Sentinel / tasks /master_plan.md

nihalaninihal

Update metrics format with drift/oversight tracking, add colab training notebook

5e0f2b1 3 days ago

preview code

raw

history blame contribute delete

11.4 kB

SentinelOps Arena -- Master Improvement Plan

Generated: Sunday March 8, 2026 Deadline: Sunday March 8, 2026 1:00 PM Synthesized from: Researcher findings, code reviewer findings, sponsor track analysis, devil's advocate critique, gap analysis

CONTEXT: Current State

The core environment is solid: 3 agents, 3 enterprise systems, 4 attack types, reward functions, randomized attacker, security metrics engine, and a polished Gradio UI with 4 tabs and a cybersecurity theme. The codebase compiles and the trained vs untrained worker comparison shows meaningful score differences.

Three REQUIRED submission deliverables are NOT done:

HuggingFace Spaces deployment
Google Colab training notebook
Demo video on YouTube

Partner tracks targeted: Fleet AI ($10K, Scalable Oversight) and Patronus AI ($10K, Schema Drift)

1. CRITICAL FIXES (Must Do -- Submission Fails Without These)

C1. Deploy to HuggingFace Spaces

What: Create HF Space, push code, verify it builds and runs
Files: requirements.txt, README.md (frontmatter), app.py
Effort: 30 min
Impact: BLOCKER -- no live URL = no submission
Details:
- Add pandas>=2.0 to requirements.txt (missing, app.py imports it)
- Verify gradio>=6.0.0 in requirements.txt matches README frontmatter sdk_version: 6.9.0
- Create Space at huggingface.co/new-space, SDK: Gradio, Hardware: CPU Basic
- Push with git push hf main or use huggingface_hub.upload_folder()
- Test all 4 tabs work on the live URL

C2. Create Colab Training Notebook

What: Create training/colab_training.ipynb with working GRPO pipeline
Files: New file: training/colab_training.ipynb
Effort: 60-90 min
Impact: BLOCKER -- submission requires "Minimal Training Script"
Details:
- Reuse logic from train.py (it has everything needed)
- Use Qwen/Qwen2.5-0.5B-Instruct (fits free Colab T4)
- Use Unsloth for model loading, vanilla TRL GRPOTrainer for training
- Must show: env verification, data collection, model loading, GRPO config, at least a few training steps
- If openenv-core fails on Colab Python version, bundle standalone env code
- Add markdown cells explaining each step, mention partner tracks

C3. Record Demo Video

What: 1-3 minute screen recording of Gradio app + voice/text narration
Files: N/A (external -- YouTube upload)
Effort: 30 min
Impact: BLOCKER -- submission requires YouTube demo video
Details:
- Show: episode replay (attack/adapt/flag cycle), untrained vs trained comparison, environment inspector
- Mention: 3-agent self-play, Fleet AI oversight, Patronus AI schema drift
- Keep simple -- QuickTime screen record, no fancy editing

C4. Verify Gradio App Launches Locally

What: Run python app.py and test all 4 tabs
Files: app.py, all imported modules
Effort: 15 min
Impact: HIGH -- if app crashes, HF Spaces will fail too
Note: tasks/todo.md shows this is UNCHECKED

2. HIGH-IMPACT IMPROVEMENTS (Should Do -- Directly Impress Judges)

H1. Improve Oversight Explanation Quality Scoring (Fleet AI Track)

What: Replace character-count explanation quality with structured quality scoring
Files: sentinelops_arena/environment.py:441, sentinelops_arena/demo.py:302-327
Effort: 20 min
Impact: HIGH for Fleet AI ($10K) -- current scoring is min(len(explanation) / 100.0, 1.0) which is embarrassingly simplistic. Fleet AI judge Nicolai Ouporov will notice.
Details:
- In environment.py:441, replace character-length heuristic with keyword-based quality scoring:
  - +0.25 if explanation mentions the violation type (e.g., "policy violation", "social engineering")
  - +0.25 if explanation references specific data (e.g., amount, field name, policy rule)
  - +0.25 if explanation states the rule being violated (e.g., "max refund is $2000")
  - +0.25 if explanation recommends corrective action
- In demo.py HeuristicOversight, improve the canned explanation strings to include specific data from the observation (e.g., "Worker issued refund exceeding policy max of $X. Current policy requires approval for amounts over $Y.")

H2. Add SLA Policy Drift to Ticketing (Patronus AI Track)

What: Allow the attacker to change SLA deadlines, not just refund policies
Files: sentinelops_arena/systems/ticketing.py, sentinelops_arena/attacks.py, sentinelops_arena/demo.py
Effort: 20 min
Impact: HIGH for Patronus AI ($10K) -- doubles the policy drift surface. Currently only billing has policy drift.
Details:
- Add TicketingSystem.apply_policy_drift(changes) in ticketing.py that modifies self.sla_rules
- In attacks.py:_execute_policy_drift(), route to ticketing system when target is TICKETING
- In demo.py RandomizedAttacker, add SLA policy drift options to POLICY_DRIFT_CHANGES
- Worker should call get_current_policy("sla") to discover changed SLA rules

H3. Add Oversight Metrics to Dashboard

What: Add oversight-specific metrics (explanation quality, detection accuracy) to the metrics engine and Gradio UI
Files: sentinelops_arena/metrics.py, app.py
Effort: 25 min
Impact: HIGH for Fleet AI ($10K) -- currently NO oversight-specific metrics exist in the dashboard
Details:
- In metrics.py, add to compute_episode_metrics():
  - oversight_accuracy: correct flags + correct approvals / total oversight decisions
  - avg_explanation_quality: average explanation quality score across all oversight decisions
- Add a new metric card for oversight accuracy in format_metrics_html()
- This makes the Fleet AI story visible in the demo

H4. Add Drift-Specific Metrics

What: Add drift adaptation metrics to the metrics engine
Files: sentinelops_arena/metrics.py
Effort: 15 min
Impact: HIGH for Patronus AI ($10K) -- makes drift adaptation visible and measurable
Details:
- Add to compute_episode_metrics():
  - drift_events: total schema + policy drift attacks
  - drifts_detected: number of times worker called get_schema/get_current_policy after a drift
  - avg_drift_recovery_ticks: average ticks between drift and worker's first defensive action
- Add metric card for "Drift Adaptation" in format_metrics_html()

H5. Improve HeuristicOversight Explanations

What: Make the oversight agent's explanations reference specific data from the observation
Files: sentinelops_arena/demo.py:302-327
Effort: 15 min
Impact: MEDIUM-HIGH for Fleet AI -- judges will see these in the replay log
Details:
- Pass obs to HeuristicOversight.act() (currently only uses obs.last_action_result)
- Generate explanations like: "Worker action at tick {tick}: {action_type} resulted in error. The error '{error_msg}' suggests schema drift may have occurred. Recommended: call get_schema() to discover new field names."
- For social engineering: "Worker followed suspicious instructions containing override language. The message '{first 50 chars}' appears to be a social engineering attack. Flagging as critical violation."
- For policy violations: "Refund of ${amount} exceeds current policy maximum of ${max}. Policy was last updated at tick {last_policy_change}."

3. QUICK WINS (Do If Time Allows -- Small Effort, Good Impression)

Q1. Fix Documentation Inconsistencies

What: Fix mismatches between spec doc, README, and actual code
Files: README.md, pyproject.toml
Effort: 10 min
Impact: Prevents judges from noticing sloppy details
Details:
- Set gradio>=6.0.0 consistently in pyproject.toml (currently says >=5.0.0)
- Fix README project structure to match reality (remove mcp_tools.py listing)
- Do NOT touch SENTINELOPS_ARENA.md (it's a spec doc, acceptable to be aspirational)

Q2. Add Links to About Tab

What: Once Colab notebook and video exist, add links to the About tab
Files: app.py (About tab section)
Effort: 5 min
Impact: Makes it easy for judges to find all submission artifacts

Q3. Clean Up Vestigial Files

What: Remove or gitignore hackathon_env/ directory
Files: .gitignore, possibly hackathon_env/
Effort: 5 min
Impact: Prevents judge confusion

Q4. Add Billing Schema Drift Support

What: Allow schema drift attacks against billing system too
Files: sentinelops_arena/systems/billing.py
Effort: 10 min
Impact: Strengthens Patronus AI story -- all 3 systems support schema drift
Details:
- Add BillingSystem.apply_schema_drift(old_field, new_field) mirroring CRM pattern
- Add _field_map dict and _apply_field_map method to BillingSystem
- Update attacks.py VALID_TARGETS for schema_drift to include BILLING

4. SKIP LIST (Not Worth the Time)

Item	Reason
Compound attacks (2-3 simultaneous)	2+ hours, marginal judge impact
Compliance drift (new required fields)	1+ hours, nice but not critical
A2A protocol	Already marked "Cut" in spec, not in submission requirements
Docker support	HF Spaces uses Gradio SDK directly
MCP-X gateway demo	MCP tools in environment.py are sufficient
Full GRPO convergence	Pipeline working is enough -- convergence not required
Real datetime-based SLA	Tick-based is fine for demo
Multi-GPU training	Overkill for hackathon
Refactoring codebase	No judge impact, waste of time

EXECUTION ORDER (Recommended)

Phase 1 (0:00 - 0:15): Verify and fix basics

C4: Verify Gradio app launches locally
Q1: Fix requirements.txt (add pandas) and pyproject.toml consistency

Phase 2 (0:15 - 1:00): High-impact code improvements 3. H1: Improve oversight explanation quality scoring (20 min) 4. H2: Add SLA policy drift to ticketing (20 min) 5. H5: Improve HeuristicOversight explanations (15 min)

Phase 3 (1:00 - 1:30): Metrics improvements 6. H3: Add oversight metrics to dashboard (25 min) 7. H4: Add drift-specific metrics (15 min)

Phase 4 (1:30 - 2:00): Deployment 8. C1: Deploy to HuggingFace Spaces (30 min)

Phase 5 (2:00 - 3:15): Required deliverables 9. C2: Create Colab training notebook (75 min)

Phase 6 (3:15 - 3:45): Video and submission 10. C3: Record demo video (30 min)

Phase 7 (3:45 - 4:00): Final polish 11. Q2: Add links to About tab (5 min) 12. Q3: Clean up vestigial files (5 min) 13. Final push and submit (5 min)

KEY JUDGE CONSIDERATIONS

Nicolai Ouporov (Fleet AI): Cares about scalable oversight. Will check: Does the oversight agent actually explain violations well? Is explanation quality tracked? Does training improve oversight?
Darshan Deshpande (Patronus AI): Cares about schema drift. Will check: How many drift types? Does the worker adapt? Is drift visible in the UI?
Daniel Han (Unsloth): Cares about Unsloth/TRL integration. Will check: Does the Colab notebook use Unsloth correctly? Does training actually work?
Sanyam Bhutani (Meta): Cares about OpenEnv quality. Will check: Is the environment well-structured? Does step/reset/state work properly?
Benjamin Burtenshaw (HuggingFace): Cares about Hub deployment. Will check: Is the HF Space functional and polished?