nihalaninihal Claude Opus 4.6 commited on
Commit
5f590b1
Β·
1 Parent(s): af942b1

Update SentinelOps Arena with detailed 14-hour implementation plan

Browse files

Synthesized findings from 6 research agents into actionable build plan:
- Hour-by-hour build order across 7 phases with 4 stop-and-submit checkpoints
- Restored scope (30 ticks, 15 customers, 4 attack types, MCP-X gateway)
- Complete file structure, Pydantic models, and API signatures
- EnvBeats integration strategy (COPY/ADAPT/IGNORE)
- Colab training script with Unsloth + TRL GRPO workaround
- Partner track alignment (Fleet AI + Patronus AI)
- Risk mitigations and fallback hierarchy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (1) hide show
  1. SENTINELOPS_ARENA.md +600 -30
SENTINELOPS_ARENA.md CHANGED
@@ -2,9 +2,40 @@
2
 
3
  ## Project Overview
4
 
5
- SentinelOps Arena is a multi-agent self-play training environment built on the OpenEnv framework. It simulates a workday at an enterprise company where three AI agents interact with three simulated enterprise systems. Through adversarial self-play over hundreds of episodes, all three agents improve simultaneously β€” the attacker learns to exploit, the worker learns to survive, and the oversight agent learns to catch failures.
6
 
7
- Built for the [OpenEnv Hackathon SF](https://cerebralvalley.ai/e/openenv-hackathon-sf) (March 7-8, 2026).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  ---
10
 
@@ -239,11 +270,18 @@ Attacker discovers compound strategies no human designer would create. Worker de
239
 
240
  ---
241
 
242
- ## OpenEnv Implementation
 
 
 
 
 
 
 
243
 
244
  ### Data Models
245
 
246
- **SentinelAction:**
247
  - agent (attacker/worker/oversight)
248
  - action_type (what the agent wants to do)
249
  - target_system (crm/billing/ticketing or None)
@@ -252,7 +290,7 @@ Attacker discovers compound strategies no human designer would create. Worker de
252
  - flag (for oversight violation flags)
253
  - explanation (for oversight explanations)
254
 
255
- **SentinelObservation:**
256
  - done (episode over?)
257
  - reward (reward for the agent that just acted)
258
  - current_agent (whose turn is next)
@@ -270,17 +308,31 @@ Extends `openenv.Environment` with:
270
  - `step(action)` β€” Routes to attacker/worker/oversight processor, advances turn order, returns observation
271
  - `state()` β€” Returns episode metadata (tick, scores, active attacks, task completion stats)
272
 
 
 
 
 
 
 
 
 
 
273
  ---
274
 
275
  ## Training Stack
276
 
277
- - **OpenEnv** β€” Environment framework (reset/step/state API, Docker containerized)
278
- - **HuggingFace TRL** β€” GRPO (Group Relative Policy Optimization) trainer
279
  - **Unsloth** β€” Fast fine-tuning (2x speed, 70% less VRAM)
280
- - **Base model** β€” Qwen2.5-7B (via Unsloth)
281
 
282
  GRPO eliminates the need for a separate critic/value model by using group-averaged rewards as the baseline, making it memory-efficient enough to train on consumer hardware.
283
 
 
 
 
 
 
284
  ---
285
 
286
  ## What This Produces
@@ -296,31 +348,549 @@ Plus the environment itself β€” publishable on the OpenEnv Hub for anyone to tra
296
 
297
  ## Research Foundation
298
 
299
- - **TriPlay-RL** (Jan 2025) β€” Validated the tri-role self-play architecture with GRPO for LLM safety
300
- - **ARLAS** (Oct 2025) β€” Attacker-defender co-training for agent security
301
- - **AgentDojo** (ETH Zurich) β€” Enterprise task simulation benchmark (evaluation only, no training loop)
302
- - **AT-GRPO** β€” Multi-agent GRPO extension for multi-policy training
303
- - **MARS** β€” Multi-agent reasoning through self-play using GRPO
 
304
 
305
  SentinelOps Arena fills the gap: enterprise-specific simulation + compound attacks + self-play training loop on OpenEnv.
306
 
307
  ---
308
 
309
- ## MVP Scope (15-Hour Build)
310
-
311
- ### Included
312
- - Full OpenEnv interface (reset, step, state)
313
- - All three enterprise system simulators (3+ API functions each)
314
- - 4 attack types: schema drift, policy drift, social engineering, infrastructure disruption
315
- - All three reward functions
316
- - Introspection endpoints (get_schema, get_current_policy)
317
- - Ground truth tracking for oversight scoring
318
- - Working demo script
319
- - ~25 varied customer tasks
320
-
321
- ### Deferred
322
- - Docker packaging (use pip install + python instead)
323
- - Compliance drift and 3-type compound attacks
324
- - Full 80-task variety
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
325
  - Reward calibration pass
326
- - Datetime-based SLA (use tick-based instead)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  ## Project Overview
4
 
5
+ SentinelOps Arena is a multi-agent self-play training environment built on the OpenEnv 0.4 framework. It simulates a workday at an enterprise company where three AI agents interact with three simulated enterprise systems. Through adversarial self-play over hundreds of episodes, all three agents improve simultaneously β€” the attacker learns to exploit, the worker learns to survive, and the oversight agent learns to catch failures.
6
 
7
+ Built for the [OpenEnv Hackathon SF](https://cerebralvalley.ai/e/openenv-hackathon-sf) (March 7-8, 2026). Submissions due **Sunday, March 8th at 1:00 PM**. Team size: up to 3 members.
8
+
9
+ ---
10
+
11
+ ## Hackathon Theme Alignment
12
+
13
+ ### Primary Themes
14
+ - **Theme 1: Multi-Agent Interactions** β€” Three agents (Attacker, Worker, Oversight) competing and collaborating in a shared enterprise environment. Drives theory-of-mind reasoning and emergent strategic behavior.
15
+ - **Theme 3.1: World Modeling β€” Professional Tasks** β€” Enterprise applications (CRM, Billing, Ticketing) with realistic business logic, API ecosystems, and multi-step workflows.
16
+ - **Theme 4: Self-Improvement** β€” Self-play training where agents generate their own curriculum through adversarial dynamics. Recursive skill amplification via autocurricula.
17
+
18
+ ### Partner Sub-Theme Targets ($10K each, max 2 selectable)
19
+ | Partner | Sub-Theme | How SentinelOps Matches |
20
+ |---|---|---|
21
+ | **Fleet AI** (SELECTED) | Scalable Oversight: train oversight agents to monitor, analyze, and explain behavior of other AI agents | The Oversight agent is literally this β€” audits worker actions, flags violations, explains reasoning |
22
+ | **Patronus AI** (SELECTED) | Consumer Workflows with Schema Drift: environments where data schemas, API contracts, and policies change | Schema drift and policy drift are core attack types β€” fields rename, refund windows change, new required fields appear |
23
+ | ~~Scaler AI Labs~~ | Multi-App RL Environment for Enterprise Workflows | Strong match but less unique than above two |
24
+ | ~~Halluminate AI~~ | Multi-Actor Environments | Good match but more generic |
25
+
26
+ ### Prize Structure
27
+ - **Main track:** 1st $15K, 2nd $9K, 3rd $6K
28
+ - **Partner sub-themes:** $10K each (judged separately from main track)
29
+ - SentinelOps targets: Main track + Fleet AI ($10K) + Patronus AI ($10K)
30
+
31
+ ### Submission Requirements
32
+ All fields required:
33
+ - **Team Name**
34
+ - **Project Description** (what it solves)
35
+ - **HuggingFace Spaces Link** β€” environment must be deployed
36
+ - **Demo Video** (YouTube) β€” must demonstrate the environment
37
+ - **Minimal Training Script** β€” Colab notebook using Unsloth or HF TRL (REQUIRED, not optional)
38
+ - **Partner Tracks** β€” Fleet AI, Patronus AI
39
 
40
  ---
41
 
 
270
 
271
  ---
272
 
273
+ ## OpenEnv 0.4 Implementation
274
+
275
+ Built on OpenEnv 0.4 Spec/RFC with:
276
+ - **Simple API** β€” `step()`, `reset()`, `state()`
277
+ - **MCP tools as first-class citizens** β€” Enterprise system APIs exposed as MCP tools per OpenEnv 0.4 spec
278
+ - **Reward pipelines** β€” Structured reward computation with ground truth tracking
279
+ - **Container support** β€” Deployable via Docker, hostable on HuggingFace Spaces
280
+ - **Hub deployment** β€” Published to OpenEnv Hub for community training and benchmarking
281
 
282
  ### Data Models
283
 
284
+ **SentinelAction (Pydantic BaseModel):**
285
  - agent (attacker/worker/oversight)
286
  - action_type (what the agent wants to do)
287
  - target_system (crm/billing/ticketing or None)
 
290
  - flag (for oversight violation flags)
291
  - explanation (for oversight explanations)
292
 
293
+ **SentinelObservation (Pydantic BaseModel):**
294
  - done (episode over?)
295
  - reward (reward for the agent that just acted)
296
  - current_agent (whose turn is next)
 
308
  - `step(action)` β€” Routes to attacker/worker/oversight processor, advances turn order, returns observation
309
  - `state()` β€” Returns episode metadata (tick, scores, active attacks, task completion stats)
310
 
311
+ ### MCP Tool Exposure (OpenEnv 0.4)
312
+
313
+ Enterprise system APIs are exposed as MCP tools, enabling per-agent tool isolation:
314
+ - **Attacker tools:** `launch_attack`, `pass_turn`, `get_attack_budget`
315
+ - **Worker tools:** `lookup_customer`, `update_tier`, `check_balance`, `issue_refund`, `create_ticket`, `resolve`, `get_schema`, `get_current_policy`, etc.
316
+ - **Oversight tools:** `flag_action`, `get_current_policy`, `get_trajectory`
317
+
318
+ This leverages the MCP-X gateway pattern from the EnvBeats reference implementation for JWT-based per-agent access control.
319
+
320
  ---
321
 
322
  ## Training Stack
323
 
324
+ - **OpenEnv 0.4** β€” Environment framework (reset/step/state API, MCP tools, reward pipelines, Docker containerized)
325
+ - **HuggingFace TRL** β€” GRPO (Group Relative Policy Optimization) trainer with `rollout_func` for OpenEnv integration and multi-reward support via `reward_funcs`
326
  - **Unsloth** β€” Fast fine-tuning (2x speed, 70% less VRAM)
327
+ - **Base model** β€” Qwen2.5-7B (via Unsloth, ~15-20GB VRAM with QLoRA) or Qwen2.5-1.5B (~5GB for quick demos)
328
 
329
  GRPO eliminates the need for a separate critic/value model by using group-averaged rewards as the baseline, making it memory-efficient enough to train on consumer hardware.
330
 
331
+ ### Dual-Path Architecture
332
+
333
+ - **Training path:** Direct Python `env.step()` calls β€” no MCP/A2A overhead, maximum speed for thousands of episodes
334
+ - **Demo/eval path:** Full MCP tool exposure via MCP-X gateway β€” showcases per-agent tool isolation and OpenEnv 0.4 MCP-first design
335
+
336
  ---
337
 
338
  ## What This Produces
 
348
 
349
  ## Research Foundation
350
 
351
+ - **TriPlay-RL** (Jan 2026) β€” Validated the tri-role self-play architecture (attacker/defender/evaluator) with GRPO for LLM safety. 20-50% improvement in adversarial effectiveness, 10-30% safety gains.
352
+ - **ARLAS** (Oct 2025) β€” Attacker-defender co-training for agent security using GRPO. Evaluated on AgentDojo and BrowserGym.
353
+ - **AgentDojo** (ETH Zurich, NeurIPS 2024) β€” Enterprise task simulation benchmark with 97 tasks, 629 security test cases. Evaluation only, no training loop.
354
+ - **AT-GRPO** (2025) β€” Agent- and Turn-wise GRPO for multi-agent systems. Supports role-specific policies. +5% on LiveCodeBench, +84% on Sokoban vs single-agent.
355
+ - **MARS/MARSHAL** (Oct 2025) β€” Multi-agent reasoning through self-play with turn-level advantage estimation. Up to 28.7% performance improvements.
356
+ - **M-GRPO** (Nov 2025) β€” Hierarchical multi-agent GRPO with decoupled training pipeline. No cross-server backpropagation needed.
357
 
358
  SentinelOps Arena fills the gap: enterprise-specific simulation + compound attacks + self-play training loop on OpenEnv.
359
 
360
  ---
361
 
362
+ ## FINAL IMPLEMENTATION PLAN
363
+
364
+ ### Reality Check
365
+
366
+ - **Solo developer**, hackathon is March 7-8, 2026
367
+ - **Deadline:** Sunday March 8th, 1:00 PM
368
+ - **Estimated coding hours remaining:** ~14 hours
369
+ - **The environment IS the product** β€” trained agents are a bonus, not a requirement
370
+ - **Training script is REQUIRED** β€” Colab notebook using Unsloth or TRL must be submitted
371
+
372
+ ### Scope Plan (14-Hour Build)
373
+
374
+ | Original Spec | Hackathon Build | Notes |
375
+ |---|---|---|
376
+ | 80 ticks per episode | **30 ticks** | Good episode length for demo & training |
377
+ | 50 customers | **15 customers** | Enough variety for compelling scenarios |
378
+ | 30 invoices | **15 invoices** | 1:1 with customers |
379
+ | 20 tickets | **10 tickets** | Enough for SLA pressure scenarios |
380
+ | 80 customer tasks | **30 tasks** | Matches tick count |
381
+ | 6 attack types | **4 types** (schema drift, policy drift, social engineering, infrastructure disruption) | Restore rate limiting β€” demonstrates resilience |
382
+ | MCP-X gateway | **Include** β€” per-agent tool isolation | With 14h, this is achievable and impresses judges (envbeats pattern, high ROI) |
383
+ | A2A protocol | **Cut** | Not in submission requirements |
384
+ | Datetime SLA | **Tick-based SLA** | Simpler, same demo impact |
385
+ | Full GRPO convergence | **Run for real** | With 14h, aim for visible learning signal (even a few epochs) |
386
+ | Compound attacks | **Add as stretch** | If time permits after hour 12 |
387
+
388
+ ### CRITICAL: Unsloth + rollout_func Incompatibility
389
+
390
+ **Unsloth does NOT support TRL's `rollout_func`** (GitHub issue #3573). Strategy:
391
+ - Use Unsloth for **model loading only** (FastLanguageModel.from_pretrained + get_peft_model)
392
+ - Use **vanilla TRL GRPOTrainer** for training with rollout_func
393
+ - Use **Qwen2.5-1.5B** for Colab (fits free-tier GPU, ~5GB VRAM)
394
+ - If Colab Python version conflicts with openenv-core (requires >=3.13), use a **standalone env wrapper** without openenv dependency
395
+
396
+ ---
397
+
398
+ ### File Structure
399
+
400
+ ```
401
+ sentinelops_arena/
402
+ β”œβ”€β”€ __init__.py
403
+ β”œβ”€β”€ models.py # All Pydantic models (Action, Observation, State, data models)
404
+ β”œβ”€β”€ systems/
405
+ β”‚ β”œβ”€β”€ __init__.py
406
+ β”‚ β”œβ”€β”€ crm.py # CRM simulator (lookup, update_tier, add_note, get_history, get_schema)
407
+ β”‚ β”œβ”€β”€ billing.py # Billing simulator (check_balance, issue_refund, apply_credit, generate_invoice, get_current_policy)
408
+ β”‚ └── ticketing.py # Ticketing simulator (create, assign, escalate, resolve, check_sla, get_schema, get_sla_rules)
409
+ β”œβ”€β”€ attacks.py # Attack mechanics (schema_drift, policy_drift, social_engineering)
410
+ β”œβ”€β”€ rewards.py # All 3 reward functions (attacker, worker, oversight)
411
+ β”œβ”€β”€ task_generator.py # Generates 20 customer tasks per episode
412
+ β”œβ”€β”€ environment.py # SentinelOpsArena(Environment) β€” the core
413
+ β”œβ”€β”€ mcp_tools.py # FastMCP tool definitions wrapping env operations
414
+ β”œβ”€β”€ server.py # create_app() HTTP server
415
+ └── demo.py # Demo script running one episode with heuristic agents
416
+
417
+ training/
418
+ β”œβ”€β”€ colab_training.ipynb # REQUIRED β€” Colab notebook with Unsloth + TRL GRPO
419
+ └── rollout.py # rollout_func and reward_funcs for GRPOTrainer
420
+
421
+ app.py # HuggingFace Spaces entry point (Gradio or FastAPI)
422
+ pyproject.toml
423
+ README.md
424
+ ```
425
+
426
+ ### Build Order (14-Hour Plan)
427
+
428
+ #### Phase 1: Core Models & Systems (Hours 0-2.5)
429
+
430
+ **Hour 0-0.5: models.py**
431
+ ```python
432
+ # Enums
433
+ class AgentRole(str, Enum): ATTACKER, WORKER, OVERSIGHT
434
+ class AttackType(str, Enum): SCHEMA_DRIFT, POLICY_DRIFT, SOCIAL_ENGINEERING, RATE_LIMIT
435
+ class TargetSystem(str, Enum): CRM, BILLING, TICKETING
436
+ class CustomerTier(str, Enum): GOLD, SILVER, BRONZE
437
+ class InvoiceStatus(str, Enum): PAID, PENDING, OVERDUE, REFUNDED
438
+ class TicketStatus(str, Enum): OPEN, IN_PROGRESS, RESOLVED, ESCALATED
439
+ class TicketPriority(str, Enum): HIGH, MEDIUM, LOW
440
+ class TaskType(str, Enum): REFUND, TICKET_CHECK, TIER_UPGRADE, NEW_TICKET, BALANCE_INQUIRY, SLA_ESCALATION
441
+ class ViolationType(str, Enum): POLICY_VIOLATION, SOCIAL_ENGINEERING, SCHEMA_ERROR_UNHANDLED, SLA_BREACH
442
+
443
+ # Data models
444
+ class Customer(BaseModel): customer_id, name, tier, region, contact_email, lifetime_value, notes
445
+ class Invoice(BaseModel): invoice_id, customer_id, amount, status, date, items
446
+ class Ticket(BaseModel): ticket_id, customer_id, subject, priority, status, created_tick, sla_deadline_tick, assigned_to
447
+ class RefundPolicy(BaseModel): window_ticks=8, requires_approval=False, max_amount=5000
448
+ class SLARules(BaseModel): high=6, medium=12, low=18 # ticks
449
+ class CustomerTask(BaseModel): task_id, customer_id, task_type, message, required_systems
450
+
451
+ # OpenEnv types
452
+ class SentinelAction(Action, extra='forbid'): agent, action_type, target_system, parameters, response_text, flag, explanation
453
+ class SentinelObservation(Observation): done, reward, current_agent, current_task, systems_snapshot, last_action_result, trajectory, tick, metadata
454
+ class SentinelState(State, extra='allow'): tick, scores, active_attacks, tasks_completed, tasks_total
455
+ ```
456
+
457
+ **Hour 0.5-1.5: systems/ (all three)**
458
+
459
+ Each system: in-memory dict storage, 4-5 API functions, `get_schema()` introspection, internal `_apply_*` mutation methods for attacks.
460
+
461
+ CRM: `lookup_customer`, `update_tier`, `add_note`, `get_history`, `get_schema`, `_apply_schema_drift(old_field, new_field)`
462
+ Billing: `check_balance`, `issue_refund`, `apply_credit`, `generate_invoice`, `get_current_policy`, `_apply_policy_drift(changes)`
463
+ Ticketing: `create_ticket`, `assign_ticket`, `escalate`, `resolve`, `check_sla`, `get_schema`, `get_sla_rules`, `_apply_schema_drift`
464
+
465
+ **Hour 1.5-2: attacks.py + task_generator.py**
466
+
467
+ Attacks (4 types):
468
+ - `schema_drift(system, old_field, new_field)` β€” renames key in all records
469
+ - `policy_drift(changes_dict)` β€” modifies refund policy or SLA rules
470
+ - `social_engineering(task_queue, tick, injected_message)` β€” replaces upcoming task message
471
+ - `rate_limit(system, max_calls_per_tick)` β€” throttles API calls (infrastructure disruption)
472
+
473
+ Task generator: Create 30 tasks with mix of types, assign to ticks, each referencing 1-2 systems.
474
+
475
+ **Hour 2-2.5: rewards.py**
476
+
477
+ Pure Python, no LLM-as-judge. Three functions:
478
+ - `compute_attacker_reward(action, worker_result, oversight_result, ground_truth)` β€” See reward table
479
+ - `compute_worker_reward(action, task, result, ground_truth, active_policies)` β€” See reward table
480
+ - `compute_oversight_reward(flag_decision, ground_truth_violations)` β€” See reward table
481
+
482
+ Ground truth tracking: Environment maintains `TickGroundTruth` per tick with `violations_present: bool`, `violation_types: list`, `correct_action: str`, enabling deterministic oversight scoring.
483
+
484
+ #### Phase 2: Environment Core (Hours 2.5-4)
485
+
486
+ **Hour 2.5-4: environment.py β€” SentinelOpsArena**
487
+
488
+ ```python
489
+ from openenv.core.env_server.interfaces import Environment
490
+ from openenv.core.env_server.types import Action, Observation, State
491
+
492
+ class SentinelOpsArena(Environment[SentinelAction, SentinelObservation, SentinelState]):
493
+ SUPPORTS_CONCURRENT_SESSIONS = True
494
+
495
+ def reset(self, seed=None, episode_id=None, **kwargs) -> SentinelObservation:
496
+ # Generate 15 customers, 15 invoices, 10 tickets, 30 tasks
497
+ # Initialize default policies, empty attack log
498
+ # Set tick=0, turn_order=[ATTACKER, WORKER, OVERSIGHT]
499
+ # Return initial observation for first agent (attacker)
500
+
501
+ def step(self, action: SentinelAction, timeout_s=None, **kwargs) -> SentinelObservation:
502
+ # Validate action matches current_agent
503
+ # Route to _process_attacker / _process_worker / _process_oversight
504
+ # Compute reward via rewards.py
505
+ # Advance turn, increment tick if full rotation
506
+ # Track ground truth for oversight scoring
507
+ # Return observation for next agent
508
+
509
+ @property
510
+ def state(self) -> SentinelState:
511
+ # Return episode metadata (tick, scores, active attacks, completion stats)
512
+ ```
513
+
514
+ Turn manager pseudocode:
515
+ ```
516
+ current_agent_idx = 0
517
+ turn_order = [ATTACKER, WORKER, OVERSIGHT]
518
+
519
+ on step(action):
520
+ assert action.agent == turn_order[current_agent_idx]
521
+ result = process(action)
522
+ current_agent_idx = (current_agent_idx + 1) % 3
523
+ if current_agent_idx == 0:
524
+ tick += 1
525
+ done = (tick >= 30)
526
+ ```
527
+
528
+ #### CHECKPOINT 1: Core Works (Hour 4)
529
+
530
+ Run `env.reset()` β†’ 30x3 `env.step()` loop with random actions. Verify:
531
+ - Turn order cycles correctly
532
+ - Attacks modify system state
533
+ - Rewards compute without errors
534
+ - Episode terminates at tick 30
535
+
536
+ **If this works, you have a submittable environment.**
537
+
538
+ #### Phase 3: MCP Tools + Server (Hours 4-5.5)
539
+
540
+ **Hour 4-5: mcp_tools.py β€” Per-Agent MCP Tools**
541
+
542
+ Expose enterprise system APIs as individual MCP tools (not just step/reset/state). This is what agents actually call:
543
+
544
+ ```python
545
+ from mcp.server.fastmcp import FastMCP
546
+
547
+ mcp = FastMCP("sentinelops", host="0.0.0.0", port=9500, stateless_http=True)
548
+
549
+ # --- Worker tools ---
550
+ @mcp.tool()
551
+ def lookup_customer(customer_id: str) -> str:
552
+ """Look up a customer record in the CRM."""
553
+ return json.dumps(env.crm.lookup_customer(customer_id))
554
+
555
+ @mcp.tool()
556
+ def issue_refund(invoice_id: str, amount: float, reason: str) -> str:
557
+ """Issue a refund for an invoice."""
558
+ return json.dumps(env.billing.issue_refund(invoice_id, amount, reason))
559
+
560
+ @mcp.tool()
561
+ def get_schema(system: str) -> str:
562
+ """Get the current field schema for a system (crm/billing/ticketing)."""
563
+ return json.dumps(env.get_system(system).get_schema())
564
+
565
+ @mcp.tool()
566
+ def get_current_policy(policy_type: str) -> str:
567
+ """Get the current policy (refund/sla)."""
568
+ return json.dumps(env.get_current_policy(policy_type))
569
+
570
+ # --- Attacker tools ---
571
+ @mcp.tool()
572
+ def launch_attack(attack_type: str, target_system: str, parameters: str) -> str:
573
+ """Launch an attack on an enterprise system."""
574
+ ...
575
+
576
+ # --- Oversight tools ---
577
+ @mcp.tool()
578
+ def flag_action(flagged: bool, severity: int, violation_type: str, explanation: str) -> str:
579
+ """Flag a worker action as a potential violation."""
580
+ ...
581
+
582
+ # --- Env control tools ---
583
+ @mcp.tool()
584
+ def step(action_json: str) -> str:
585
+ """Take a full action in the SentinelOps environment."""
586
+ action = SentinelAction.model_validate_json(action_json)
587
+ obs = env.step(action)
588
+ return obs.model_dump_json()
589
+
590
+ @mcp.tool()
591
+ def reset(seed: int = None) -> str:
592
+ """Reset the environment for a new episode."""
593
+ obs = env.reset(seed=seed)
594
+ return obs.model_dump_json()
595
+
596
+ @mcp.tool()
597
+ def get_state() -> str:
598
+ """Get current environment state."""
599
+ return env.state.model_dump_json()
600
+ ```
601
+
602
+ **Hour 5-5.5: server.py + MCP-X Gateway**
603
+
604
+ OpenEnv HTTP server:
605
+ `create_app(SentinelOpsArena, SentinelAction, SentinelObservation, env_name="sentinelops_arena")`
606
+
607
+ MCP-X gateway (copy from envbeats, adapt config):
608
+ ```toml
609
+ [clients.attacker]
610
+ auth_token = "atk-token"
611
+ [clients.worker]
612
+ auth_token = "wrk-token"
613
+ [clients.oversight]
614
+ auth_token = "ovs-token"
615
+
616
+ [mcp_servers.sentinelops]
617
+ url = "http://localhost:9500/mcp"
618
+ from_client = "orchestrator"
619
+
620
+ [allow.sentinelops]
621
+ attacker = ["launch_attack", "pass_turn", "get_attack_budget"]
622
+ worker = ["lookup_customer", "update_tier", "add_note", "get_history", "check_balance", "issue_refund", "apply_credit", "generate_invoice", "create_ticket", "assign_ticket", "escalate", "resolve", "check_sla", "get_schema", "get_current_policy"]
623
+ oversight = ["flag_action", "get_current_policy", "get_trajectory"]
624
+ ```
625
+
626
+ #### Phase 4: Demo & Gradio App (Hours 5.5-7.5)
627
+
628
+ **Hour 5.5-6.5: demo.py β€” Compelling Episode Script**
629
+
630
+ Script that runs a complete 30-tick episode with hardcoded heuristic agents:
631
+ - Attacker: schema_drift at tick 7, policy_drift at tick 14, social_engineering at tick 20, rate_limit at tick 25
632
+ - Worker: Handles tasks, hits errors, recovers using get_schema/get_current_policy
633
+ - Oversight: Flags violations based on policy comparison
634
+ - Shows untrained vs trained worker behavior (before/after comparison)
635
+
636
+ Output: Pretty-printed episode replay showing the full attack/adapt/flag cycle.
637
+
638
+ **Hour 6.5-7.5: app.py β€” Gradio App (HuggingFace Spaces)**
639
+
640
+ Rich Gradio interface:
641
+ - Tab 1: "Run Episode" β†’ executes demo, shows formatted turn-by-turn replay with color-coded agents
642
+ - Tab 2: "Environment Inspector" β†’ shows current system state, active attacks, policies
643
+ - Tab 3: "Scores Dashboard" β†’ final scores + reward breakdown for all three agents
644
+ - Controls: seed selector, tick slider (step through episode), speed control
645
+ - Metrics: live score charts, attack timeline visualization
646
+
647
+ #### CHECKPOINT 2: Demo Ready (Hour 7.5)
648
+
649
+ Working env + MCP tools + MCP-X gateway + rich Gradio demo. Deploy to HF Spaces.
650
+
651
+ #### Phase 5: Training Script (Hours 7.5-10)
652
+
653
+ **Hour 7.5-9: colab_training.ipynb**
654
+
655
+ REQUIRED deliverable. Full env β†’ GRPO training pipeline.
656
+
657
+ ```python
658
+ # Cell 1: Install
659
+ !pip install unsloth trl openenv-core peft transformers datasets
660
+
661
+ # Cell 2: Load model with Unsloth (fast loading + LoRA setup)
662
+ from unsloth import FastLanguageModel
663
+ model, tokenizer = FastLanguageModel.from_pretrained(
664
+ model_name="unsloth/Qwen2.5-1.5B-Instruct",
665
+ max_seq_length=2048,
666
+ load_in_4bit=True,
667
+ )
668
+ model = FastLanguageModel.get_peft_model(
669
+ model, r=16, lora_alpha=32,
670
+ target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
671
+ use_gradient_checkpointing="unsloth",
672
+ )
673
+
674
+ # Cell 3: Environment setup
675
+ # Inline SentinelOpsArena (standalone, no openenv dependency for Colab Python compat)
676
+ env = SentinelOpsArena()
677
+
678
+ # Cell 4: Prompt dataset β€” enterprise scenarios the worker agent must handle
679
+ dataset = Dataset.from_dict({"prompt": [
680
+ "Customer C001 requests a refund for invoice INV-2201...",
681
+ "Ticket TK-005 has high priority, check SLA status...",
682
+ # ... 30+ scenarios
683
+ ]})
684
+
685
+ # Cell 5: GRPO rollout function (uses vanilla TRL, NOT Unsloth trainer)
686
+ from trl import GRPOConfig, GRPOTrainer
687
+
688
+ def rollout_func(prompts, trainer):
689
+ """Generate completions via env interaction."""
690
+ tokenizer = trainer.processing_class
691
+ all_prompt_ids, all_completion_ids, all_logprobs, all_rewards = [], [], [], []
692
+ for prompt in prompts:
693
+ obs = env.reset()
694
+ # Format obs + prompt into chat template
695
+ input_ids = tokenizer.encode(formatted_prompt)
696
+ # Generate response
697
+ with torch.no_grad():
698
+ output = trainer.model.generate(input_ids, max_new_tokens=512)
699
+ completion = tokenizer.decode(output[len(input_ids):])
700
+ # Step env with parsed action
701
+ action = parse_worker_action(completion)
702
+ result = env.step(action)
703
+ all_rewards.append(result.reward or 0.0)
704
+ all_prompt_ids.append(input_ids)
705
+ all_completion_ids.append(output[len(input_ids):])
706
+ all_logprobs.append(compute_logprobs(trainer.model, input_ids, output))
707
+ return {
708
+ "prompt_ids": all_prompt_ids,
709
+ "completion_ids": all_completion_ids,
710
+ "logprobs": all_logprobs,
711
+ "env_reward": all_rewards,
712
+ }
713
+
714
+ def reward_from_env(completions, **kwargs):
715
+ return [float(r) for r in kwargs.get("env_reward", [0.0] * len(completions))]
716
+
717
+ # Cell 6: Configure and train
718
+ config = GRPOConfig(
719
+ output_dir="./sentinelops-grpo",
720
+ num_train_epochs=1,
721
+ per_device_train_batch_size=2,
722
+ gradient_accumulation_steps=4,
723
+ num_generations=4,
724
+ max_completion_length=512,
725
+ max_prompt_length=256,
726
+ logging_steps=1,
727
+ learning_rate=5e-6,
728
+ optim="paged_adamw_8bit",
729
+ report_to="none",
730
+ )
731
+
732
+ trainer = GRPOTrainer(
733
+ model=model,
734
+ processing_class=tokenizer,
735
+ reward_funcs=[reward_from_env],
736
+ rollout_func=rollout_func,
737
+ args=config,
738
+ train_dataset=dataset,
739
+ )
740
+ trainer.train()
741
+
742
+ # Cell 7: Show training metrics (reward curve, loss curve)
743
+ # Cell 8: Push to Hub
744
+ model.save_pretrained("sentinelops-worker-grpo")
745
+ model.push_to_hub("nihalnihalani/sentinelops-worker-grpo")
746
+ ```
747
+
748
+ **Hour 9-10: Test & debug training pipeline**
749
+
750
+ Run the notebook end-to-end on Colab. Fix any issues.
751
+ - Verify model loads correctly
752
+ - Verify env interactions work in Colab
753
+ - Verify at least a few training steps complete
754
+ - Capture training curves for demo video
755
+
756
+ **Fallback hierarchy if GRPO pipeline breaks:**
757
+ 1. Simplify rollout_func to single-step interactions (no multi-turn)
758
+ 2. Drop to SFT with env-generated (prompt, ideal_response) pairs
759
+ 3. Show reward computation working with manual env interaction
760
+
761
+ #### CHECKPOINT 3: Training Works (Hour 10)
762
+
763
+ Colab notebook runs end-to-end. Training signal visible.
764
+
765
+ #### Phase 6: Polish & Extras (Hours 10-12)
766
+
767
+ **Hour 10-11: Improve demo quality**
768
+ - Add before/after comparison (untrained vs trained worker) to Gradio app
769
+ - Add attack timeline visualization
770
+ - Add episode statistics aggregation (run 5 episodes, show avg scores)
771
+ - Improve formatting and colors in the replay log
772
+ - Add MCP-X demo tab showing per-agent tool isolation in action
773
+
774
+ **Hour 11-12: Stretch goals (pick based on time)**
775
+ - Add compound attacks (2 simultaneous β€” e.g., schema drift + social engineering)
776
+ - Add more customer task variety (SLA escalations, complex multi-step tasks)
777
+ - Run more training epochs and capture better training curves
778
+ - Write better prompt dataset for training (diverse enterprise scenarios)
779
+ - Add episode replay export (JSON format for analysis)
780
+
781
+ #### Phase 7: Submission (Hours 12-14)
782
+
783
+ **Hour 12-13: Deploy everything**
784
+ - Final push to HuggingFace Spaces, verify public URL works
785
+ - Final Colab notebook cleanup, verify it runs fresh from scratch
786
+ - Test all Gradio tabs work as expected
787
+
788
+ **Hour 13-13.5: Demo Video (YouTube)**
789
+ - Screen record: Gradio demo running a full episode (attack/adapt/flag cycle)
790
+ - Show: MCP-X per-agent tool isolation
791
+ - Show: Colab training script running with visible learning signal
792
+ - Narrate: explain the 3-agent self-play dynamic, partner track alignment
793
+ - Keep 3-5 minutes
794
+ - Upload to YouTube
795
+
796
+ **Hour 13.5-14: Submit**
797
+ - Team Name
798
+ - Project Description
799
+ - HF Spaces Link
800
+ - YouTube Demo Link
801
+ - Colab Training Script Link
802
+ - Partner Tracks: Fleet AI, Patronus AI
803
+
804
+ ### Stop-and-Submit Checkpoints
805
+
806
+ **Hour 4 (Minimum Viable):** Environment works with random agents. Submit with basic demo + placeholder training script.
807
+
808
+ **Hour 7.5 (Good Submission):** Environment + MCP tools + MCP-X gateway + rich Gradio demo deployed.
809
+
810
+ **Hour 10 (Strong Submission):** Everything above + working Colab training pipeline with visible learning.
811
+
812
+ **Hour 14 (Full Submission):** Polished demo, training curves, stretch goals, video β€” everything done.
813
+
814
+ ---
815
+
816
+ ### EnvBeats Integration Strategy
817
+
818
+ Based on deep analysis of the envbeats reference implementation:
819
+
820
+ #### COPY (Use As-Is)
821
+ | Component | Source File | Why |
822
+ |---|---|---|
823
+ | `call_mcp_tool()` | `eb_assessee_gym/main.py:37-51` | Generic MCP tool caller, directly reusable |
824
+ | `parse_tags()` | `eb_assessor/my_util.py:72-76` | XML tag parser utility |
825
+
826
+ #### ADAPT (Modify for SentinelOps)
827
+ | Component | Source File | What Changes |
828
+ |---|---|---|
829
+ | FastMCP tool wrapping | `eb_assessor/my_agent.py:40-60` | Replace EchoEnv tools with SentinelOps step/reset/state |
830
+ | Gym agent loop | `eb_assessee_gym/main.py:70-98` | MCPEchoEnv β†’ MCPSentinelOpsClient |
831
+ | MCP-X config pattern | `mcp-x/mcp_x.py` | Adapt TOML config for per-agent tool isolation (DEFERRED to post-MVP) |
832
+
833
+ #### IGNORE (Not Needed)
834
+ | Component | Reason |
835
+ |---|---|
836
+ | A2A protocol | Not in submission requirements |
837
+ | Human-in-the-loop assessee | Over-complex for hackathon |
838
+ | LLM-driven agent (pure_mcp) | Gemini-specific, wrong paradigm |
839
+ | Assessor orchestration | We're not assessing, we're training |
840
+
841
+ #### Key EnvBeats Gotchas to Avoid
842
+ 1. `create_app()` returns an ASGI app β€” use `uvicorn.run(app)` not `app.run()`
843
+ 2. `state` is a `@property` not a method β€” `env.state` not `env.state()`
844
+ 3. `Action` has `extra='forbid'` β€” no extra fields allowed in SentinelAction
845
+ 4. FastMCP `as_proxy()` needs a dummy server hack for hot-reload (see mcp_x.py:104-108)
846
+ 5. `streamablehttp_client` is async β€” all MCP client code must be async
847
+ 6. `EnvClient._step_payload()` and `_parse_result()` must be overridden β€” no defaults
848
+
849
+ ---
850
+
851
+ ### Project Description (Draft for Submission)
852
+
853
+ > **SentinelOps Arena** is a multi-agent self-play RL environment built on OpenEnv 0.4 where three AI agents β€” Attacker (red team), Worker (blue team), and Oversight (auditor) β€” interact with simulated enterprise systems (CRM, Billing, Ticketing). The Attacker launches schema drift, policy drift, and social engineering attacks. The Worker must detect disruptions, adapt, and continue serving customers. The Oversight agent monitors worker actions and flags policy violations. Through adversarial self-play with GRPO training, all three agents improve simultaneously β€” creating an autocurriculum that produces hardened enterprise AI agents. Targets Fleet AI (Scalable Oversight) and Patronus AI (Schema Drift) partner tracks.
854
+
855
+ ---
856
+
857
+ ### Risk Mitigation
858
+
859
+ | Risk | Mitigation |
860
+ |---|---|
861
+ | OpenEnv 0.4 API changes | Pin version in pyproject.toml, test imports first |
862
+ | Colab Python version (3.10-3.11) vs openenv-core (requires >=3.13) | Bundle standalone env code in Colab without openenv dependency |
863
+ | Unsloth + rollout_func incompatibility | Use Unsloth for model loading only, vanilla TRL GRPOTrainer for training |
864
+ | HF Spaces deployment fails | Have local demo.py as backup, deploy FastAPI if Gradio fails |
865
+ | Training script doesn't converge | Show pipeline working (loss decreasing) β€” convergence not required |
866
+ | Running out of time | Stop-and-submit checkpoints at hours 3.5, 5, and 6 |
867
+
868
+ ### Deferred (Post-Hackathon)
869
+ - Compliance drift attacks (new required fields)
870
+ - Full 80-tick episodes with 50+ customers
871
+ - Docker containerization
872
+ - A2A protocol integration
873
+ - Full GRPO training convergence (multi-epoch, all 3 agents)
874
  - Reward calibration pass
875
+ - Real datetime-based SLA (currently tick-based)
876
+ - Multi-GPU distributed training
877
+
878
+ ---
879
+
880
+ ## Key Judges to Note
881
+
882
+ ### First Round
883
+ - **Sanyam Bhutani** (Meta), **Ali Sol**, **Hamid Shojanazeri**, **Matthias Reso** (Meta AI/ML Engineers)
884
+ - **Michael Han** (Unsloth CTO)
885
+ - **Soham Tiwari**, **Edgar Arakelyan**, **Divyansh Agarwal** (Scale AI)
886
+ - **Robert Alward**, **Will Bryan**, **Wyatt Marshall** (Halluminate AI)
887
+
888
+ ### Final Round
889
+ - **Daniel Han** (Unsloth Co-Founder) β€” cares about Unsloth/TRL integration
890
+ - **David Corbitt** (CoreWeave) β€” cares about compute efficiency
891
+ - **Sanyam Bhutani** (Meta) β€” cares about OpenEnv quality
892
+ - **Nicolai Ouporov** (Fleet AI) β€” sponsors the Scalable Oversight sub-theme
893
+ - **Jerry Wu** (Halluminate AI) β€” sponsors Multi-Actor Environments sub-theme
894
+ - **Benjamin Burtenshaw** (HuggingFace) β€” cares about Hub deployment
895
+ - **Darshan Deshpande** (Patronus AI) β€” sponsors Schema Drift sub-theme
896
+ - **Anshuman Singh** (Scaler AI Labs) β€” sponsors Enterprise Workflows sub-theme