Spaces:

openenv-community
/

Sentinel

Sleeping

nihalaninihal Claude Opus 4.6 commited on Mar 8

Commit

5f590b1

1 Parent(s): af942b1

Update SentinelOps Arena with detailed 14-hour implementation plan

Synthesized findings from 6 research agents into actionable build plan:
- Hour-by-hour build order across 7 phases with 4 stop-and-submit checkpoints
- Restored scope (30 ticks, 15 customers, 4 attack types, MCP-X gateway)
- Complete file structure, Pydantic models, and API signatures
- EnvBeats integration strategy (COPY/ADAPT/IGNORE)
- Colab training script with Unsloth + TRL GRPO workaround
- Partner track alignment (Fleet AI + Patronus AI)
- Risk mitigations and fallback hierarchy

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (1) hide show

SENTINELOPS_ARENA.md +600 -30

SENTINELOPS_ARENA.md CHANGED Viewed

@@ -2,9 +2,40 @@
 ## Project Overview
-SentinelOps Arena is a multi-agent self-play training environment built on the OpenEnv framework. It simulates a workday at an enterprise company where three AI agents interact with three simulated enterprise systems. Through adversarial self-play over hundreds of episodes, all three agents improve simultaneously — the attacker learns to exploit, the worker learns to survive, and the oversight agent learns to catch failures.
-Built for the [OpenEnv Hackathon SF](https://cerebralvalley.ai/e/openenv-hackathon-sf) (March 7-8, 2026).
 ---
@@ -239,11 +270,18 @@ Attacker discovers compound strategies no human designer would create. Worker de
 ---
-## OpenEnv Implementation
 ### Data Models
-**SentinelAction:**
 - agent (attacker/worker/oversight)
 - action_type (what the agent wants to do)
 - target_system (crm/billing/ticketing or None)
@@ -252,7 +290,7 @@ Attacker discovers compound strategies no human designer would create. Worker de
 - flag (for oversight violation flags)
 - explanation (for oversight explanations)
-**SentinelObservation:**
 - done (episode over?)
 - reward (reward for the agent that just acted)
 - current_agent (whose turn is next)
@@ -270,17 +308,31 @@ Extends `openenv.Environment` with:
 - `step(action)` — Routes to attacker/worker/oversight processor, advances turn order, returns observation
 - `state()` — Returns episode metadata (tick, scores, active attacks, task completion stats)
 ---
 ## Training Stack
-- **OpenEnv** — Environment framework (reset/step/state API, Docker containerized)
-- **HuggingFace TRL** — GRPO (Group Relative Policy Optimization) trainer
 - **Unsloth** — Fast fine-tuning (2x speed, 70% less VRAM)
-- **Base model** — Qwen2.5-7B (via Unsloth)
 GRPO eliminates the need for a separate critic/value model by using group-averaged rewards as the baseline, making it memory-efficient enough to train on consumer hardware.
 ---
 ## What This Produces
@@ -296,31 +348,549 @@ Plus the environment itself — publishable on the OpenEnv Hub for anyone to tra
 ## Research Foundation
-- **TriPlay-RL** (Jan 2025) — Validated the tri-role self-play architecture with GRPO for LLM safety
-- **ARLAS** (Oct 2025) — Attacker-defender co-training for agent security
-- **AgentDojo** (ETH Zurich) — Enterprise task simulation benchmark (evaluation only, no training loop)
-- **AT-GRPO** — Multi-agent GRPO extension for multi-policy training
-- **MARS** — Multi-agent reasoning through self-play using GRPO
 SentinelOps Arena fills the gap: enterprise-specific simulation + compound attacks + self-play training loop on OpenEnv.
 ---
-## MVP Scope (15-Hour Build)
-### Included
-- Full OpenEnv interface (reset, step, state)
-- All three enterprise system simulators (3+ API functions each)
-- 4 attack types: schema drift, policy drift, social engineering, infrastructure disruption
-- All three reward functions
-- Introspection endpoints (get_schema, get_current_policy)
-- Ground truth tracking for oversight scoring
-- Working demo script
-- ~25 varied customer tasks
-### Deferred
-- Docker packaging (use pip install + python instead)
-- Compliance drift and 3-type compound attacks
-- Full 80-task variety
 - Reward calibration pass
-- Datetime-based SLA (use tick-based instead)

 ## Project Overview
+SentinelOps Arena is a multi-agent self-play training environment built on the OpenEnv 0.4 framework. It simulates a workday at an enterprise company where three AI agents interact with three simulated enterprise systems. Through adversarial self-play over hundreds of episodes, all three agents improve simultaneously — the attacker learns to exploit, the worker learns to survive, and the oversight agent learns to catch failures.
+Built for the [OpenEnv Hackathon SF](https://cerebralvalley.ai/e/openenv-hackathon-sf) (March 7-8, 2026). Submissions due **Sunday, March 8th at 1:00 PM**. Team size: up to 3 members.
+---
+## Hackathon Theme Alignment
+### Primary Themes
+- **Theme 1: Multi-Agent Interactions** — Three agents (Attacker, Worker, Oversight) competing and collaborating in a shared enterprise environment. Drives theory-of-mind reasoning and emergent strategic behavior.
+- **Theme 3.1: World Modeling — Professional Tasks** — Enterprise applications (CRM, Billing, Ticketing) with realistic business logic, API ecosystems, and multi-step workflows.
+- **Theme 4: Self-Improvement** — Self-play training where agents generate their own curriculum through adversarial dynamics. Recursive skill amplification via autocurricula.
+### Partner Sub-Theme Targets ($10K each, max 2 selectable)
+| Partner | Sub-Theme | How SentinelOps Matches |
+|---|---|---|
+| **Fleet AI** (SELECTED) | Scalable Oversight: train oversight agents to monitor, analyze, and explain behavior of other AI agents | The Oversight agent is literally this — audits worker actions, flags violations, explains reasoning |
+| **Patronus AI** (SELECTED) | Consumer Workflows with Schema Drift: environments where data schemas, API contracts, and policies change | Schema drift and policy drift are core attack types — fields rename, refund windows change, new required fields appear |
+| ~~Scaler AI Labs~~ | Multi-App RL Environment for Enterprise Workflows | Strong match but less unique than above two |
+| ~~Halluminate AI~~ | Multi-Actor Environments | Good match but more generic |
+### Prize Structure
+- **Main track:** 1st $15K, 2nd $9K, 3rd $6K
+- **Partner sub-themes:** $10K each (judged separately from main track)
+- SentinelOps targets: Main track + Fleet AI ($10K) + Patronus AI ($10K)
+### Submission Requirements
+All fields required:
+- **Team Name**
+- **Project Description** (what it solves)
+- **HuggingFace Spaces Link** — environment must be deployed
+- **Demo Video** (YouTube) — must demonstrate the environment
+- **Minimal Training Script** — Colab notebook using Unsloth or HF TRL (REQUIRED, not optional)
+- **Partner Tracks** — Fleet AI, Patronus AI
 ---
 ---
+## OpenEnv 0.4 Implementation
+Built on OpenEnv 0.4 Spec/RFC with:
+- **Simple API** — `step()`, `reset()`, `state()`
+- **MCP tools as first-class citizens** — Enterprise system APIs exposed as MCP tools per OpenEnv 0.4 spec
+- **Reward pipelines** — Structured reward computation with ground truth tracking
+- **Container support** — Deployable via Docker, hostable on HuggingFace Spaces
+- **Hub deployment** — Published to OpenEnv Hub for community training and benchmarking
 ### Data Models
+**SentinelAction (Pydantic BaseModel):**
 - agent (attacker/worker/oversight)
 - action_type (what the agent wants to do)
 - target_system (crm/billing/ticketing or None)
 - flag (for oversight violation flags)
 - explanation (for oversight explanations)
+**SentinelObservation (Pydantic BaseModel):**
 - done (episode over?)
 - reward (reward for the agent that just acted)
 - current_agent (whose turn is next)
 - `step(action)` — Routes to attacker/worker/oversight processor, advances turn order, returns observation
 - `state()` — Returns episode metadata (tick, scores, active attacks, task completion stats)
+### MCP Tool Exposure (OpenEnv 0.4)
+Enterprise system APIs are exposed as MCP tools, enabling per-agent tool isolation:
+- **Attacker tools:** `launch_attack`, `pass_turn`, `get_attack_budget`
+- **Worker tools:** `lookup_customer`, `update_tier`, `check_balance`, `issue_refund`, `create_ticket`, `resolve`, `get_schema`, `get_current_policy`, etc.
+- **Oversight tools:** `flag_action`, `get_current_policy`, `get_trajectory`
+This leverages the MCP-X gateway pattern from the EnvBeats reference implementation for JWT-based per-agent access control.
 ---
 ## Training Stack
+- **OpenEnv 0.4** — Environment framework (reset/step/state API, MCP tools, reward pipelines, Docker containerized)
+- **HuggingFace TRL** — GRPO (Group Relative Policy Optimization) trainer with `rollout_func` for OpenEnv integration and multi-reward support via `reward_funcs`
 - **Unsloth** — Fast fine-tuning (2x speed, 70% less VRAM)
+- **Base model** — Qwen2.5-7B (via Unsloth, ~15-20GB VRAM with QLoRA) or Qwen2.5-1.5B (~5GB for quick demos)
 GRPO eliminates the need for a separate critic/value model by using group-averaged rewards as the baseline, making it memory-efficient enough to train on consumer hardware.
+### Dual-Path Architecture
+- **Training path:** Direct Python `env.step()` calls — no MCP/A2A overhead, maximum speed for thousands of episodes
+- **Demo/eval path:** Full MCP tool exposure via MCP-X gateway — showcases per-agent tool isolation and OpenEnv 0.4 MCP-first design
 ---
 ## What This Produces
 ## Research Foundation
+- **TriPlay-RL** (Jan 2026) — Validated the tri-role self-play architecture (attacker/defender/evaluator) with GRPO for LLM safety. 20-50% improvement in adversarial effectiveness, 10-30% safety gains.
+- **ARLAS** (Oct 2025) — Attacker-defender co-training for agent security using GRPO. Evaluated on AgentDojo and BrowserGym.
+- **AgentDojo** (ETH Zurich, NeurIPS 2024) — Enterprise task simulation benchmark with 97 tasks, 629 security test cases. Evaluation only, no training loop.
+- **AT-GRPO** (2025) — Agent- and Turn-wise GRPO for multi-agent systems. Supports role-specific policies. +5% on LiveCodeBench, +84% on Sokoban vs single-agent.
+- **MARS/MARSHAL** (Oct 2025) — Multi-agent reasoning through self-play with turn-level advantage estimation. Up to 28.7% performance improvements.
+- **M-GRPO** (Nov 2025) — Hierarchical multi-agent GRPO with decoupled training pipeline. No cross-server backpropagation needed.
 SentinelOps Arena fills the gap: enterprise-specific simulation + compound attacks + self-play training loop on OpenEnv.
 ---
+## FINAL IMPLEMENTATION PLAN
+### Reality Check
+- **Solo developer**, hackathon is March 7-8, 2026
+- **Deadline:** Sunday March 8th, 1:00 PM
+- **Estimated coding hours remaining:** ~14 hours
+- **The environment IS the product** — trained agents are a bonus, not a requirement
+- **Training script is REQUIRED** — Colab notebook using Unsloth or TRL must be submitted
+### Scope Plan (14-Hour Build)
+| Original Spec | Hackathon Build | Notes |
+|---|---|---|
+| 80 ticks per episode | **30 ticks** | Good episode length for demo & training |
+| 50 customers | **15 customers** | Enough variety for compelling scenarios |
+| 30 invoices | **15 invoices** | 1:1 with customers |
+| 20 tickets | **10 tickets** | Enough for SLA pressure scenarios |
+| 80 customer tasks | **30 tasks** | Matches tick count |
+| 6 attack types | **4 types** (schema drift, policy drift, social engineering, infrastructure disruption) | Restore rate limiting — demonstrates resilience |
+| MCP-X gateway | **Include** — per-agent tool isolation | With 14h, this is achievable and impresses judges (envbeats pattern, high ROI) |
+| A2A protocol | **Cut** | Not in submission requirements |
+| Datetime SLA | **Tick-based SLA** | Simpler, same demo impact |
+| Full GRPO convergence | **Run for real** | With 14h, aim for visible learning signal (even a few epochs) |
+| Compound attacks | **Add as stretch** | If time permits after hour 12 |
+### CRITICAL: Unsloth + rollout_func Incompatibility
+**Unsloth does NOT support TRL's `rollout_func`** (GitHub issue #3573). Strategy:
+- Use Unsloth for **model loading only** (FastLanguageModel.from_pretrained + get_peft_model)
+- Use **vanilla TRL GRPOTrainer** for training with rollout_func
+- Use **Qwen2.5-1.5B** for Colab (fits free-tier GPU, ~5GB VRAM)
+- If Colab Python version conflicts with openenv-core (requires >=3.13), use a **standalone env wrapper** without openenv dependency
+---
+### File Structure
+```
+sentinelops_arena/
+├── __init__.py
+├── models.py              # All Pydantic models (Action, Observation, State, data models)
+├── systems/
+│   ├── __init__.py
+│   ├── crm.py             # CRM simulator (lookup, update_tier, add_note, get_history, get_schema)
+│   ├── billing.py          # Billing simulator (check_balance, issue_refund, apply_credit, generate_invoice, get_current_policy)
+│   └── ticketing.py        # Ticketing simulator (create, assign, escalate, resolve, check_sla, get_schema, get_sla_rules)
+├── attacks.py              # Attack mechanics (schema_drift, policy_drift, social_engineering)
+├── rewards.py              # All 3 reward functions (attacker, worker, oversight)
+├── task_generator.py       # Generates 20 customer tasks per episode
+├── environment.py          # SentinelOpsArena(Environment) — the core
+├── mcp_tools.py            # FastMCP tool definitions wrapping env operations
+├── server.py               # create_app() HTTP server
+└── demo.py                 # Demo script running one episode with heuristic agents
+training/
+├── colab_training.ipynb    # REQUIRED — Colab notebook with Unsloth + TRL GRPO
+└── rollout.py              # rollout_func and reward_funcs for GRPOTrainer
+app.py                      # HuggingFace Spaces entry point (Gradio or FastAPI)
+pyproject.toml
+README.md
+```
+### Build Order (14-Hour Plan)
+#### Phase 1: Core Models & Systems (Hours 0-2.5)
+**Hour 0-0.5: models.py**
+```python
+# Enums
+class AgentRole(str, Enum): ATTACKER, WORKER, OVERSIGHT
+class AttackType(str, Enum): SCHEMA_DRIFT, POLICY_DRIFT, SOCIAL_ENGINEERING, RATE_LIMIT
+class TargetSystem(str, Enum): CRM, BILLING, TICKETING
+class CustomerTier(str, Enum): GOLD, SILVER, BRONZE
+class InvoiceStatus(str, Enum): PAID, PENDING, OVERDUE, REFUNDED
+class TicketStatus(str, Enum): OPEN, IN_PROGRESS, RESOLVED, ESCALATED
+class TicketPriority(str, Enum): HIGH, MEDIUM, LOW
+class TaskType(str, Enum): REFUND, TICKET_CHECK, TIER_UPGRADE, NEW_TICKET, BALANCE_INQUIRY, SLA_ESCALATION
+class ViolationType(str, Enum): POLICY_VIOLATION, SOCIAL_ENGINEERING, SCHEMA_ERROR_UNHANDLED, SLA_BREACH
+# Data models
+class Customer(BaseModel): customer_id, name, tier, region, contact_email, lifetime_value, notes
+class Invoice(BaseModel): invoice_id, customer_id, amount, status, date, items
+class Ticket(BaseModel): ticket_id, customer_id, subject, priority, status, created_tick, sla_deadline_tick, assigned_to
+class RefundPolicy(BaseModel): window_ticks=8, requires_approval=False, max_amount=5000
+class SLARules(BaseModel): high=6, medium=12, low=18  # ticks
+class CustomerTask(BaseModel): task_id, customer_id, task_type, message, required_systems
+# OpenEnv types
+class SentinelAction(Action, extra='forbid'): agent, action_type, target_system, parameters, response_text, flag, explanation
+class SentinelObservation(Observation): done, reward, current_agent, current_task, systems_snapshot, last_action_result, trajectory, tick, metadata
+class SentinelState(State, extra='allow'): tick, scores, active_attacks, tasks_completed, tasks_total
+```
+**Hour 0.5-1.5: systems/ (all three)**
+Each system: in-memory dict storage, 4-5 API functions, `get_schema()` introspection, internal `_apply_*` mutation methods for attacks.
+CRM: `lookup_customer`, `update_tier`, `add_note`, `get_history`, `get_schema`, `_apply_schema_drift(old_field, new_field)`
+Billing: `check_balance`, `issue_refund`, `apply_credit`, `generate_invoice`, `get_current_policy`, `_apply_policy_drift(changes)`
+Ticketing: `create_ticket`, `assign_ticket`, `escalate`, `resolve`, `check_sla`, `get_schema`, `get_sla_rules`, `_apply_schema_drift`
+**Hour 1.5-2: attacks.py + task_generator.py**
+Attacks (4 types):
+- `schema_drift(system, old_field, new_field)` — renames key in all records
+- `policy_drift(changes_dict)` — modifies refund policy or SLA rules
+- `social_engineering(task_queue, tick, injected_message)` — replaces upcoming task message
+- `rate_limit(system, max_calls_per_tick)` — throttles API calls (infrastructure disruption)
+Task generator: Create 30 tasks with mix of types, assign to ticks, each referencing 1-2 systems.
+**Hour 2-2.5: rewards.py**
+Pure Python, no LLM-as-judge. Three functions:
+- `compute_attacker_reward(action, worker_result, oversight_result, ground_truth)` — See reward table
+- `compute_worker_reward(action, task, result, ground_truth, active_policies)` — See reward table
+- `compute_oversight_reward(flag_decision, ground_truth_violations)` — See reward table
+Ground truth tracking: Environment maintains `TickGroundTruth` per tick with `violations_present: bool`, `violation_types: list`, `correct_action: str`, enabling deterministic oversight scoring.
+#### Phase 2: Environment Core (Hours 2.5-4)
+**Hour 2.5-4: environment.py — SentinelOpsArena**
+```python
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import Action, Observation, State
+class SentinelOpsArena(Environment[SentinelAction, SentinelObservation, SentinelState]):
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def reset(self, seed=None, episode_id=None, **kwargs) -> SentinelObservation:
+        # Generate 15 customers, 15 invoices, 10 tickets, 30 tasks
+        # Initialize default policies, empty attack log
+        # Set tick=0, turn_order=[ATTACKER, WORKER, OVERSIGHT]
+        # Return initial observation for first agent (attacker)
+    def step(self, action: SentinelAction, timeout_s=None, **kwargs) -> SentinelObservation:
+        # Validate action matches current_agent
+        # Route to _process_attacker / _process_worker / _process_oversight
+        # Compute reward via rewards.py
+        # Advance turn, increment tick if full rotation
+        # Track ground truth for oversight scoring
+        # Return observation for next agent
+    @property
+    def state(self) -> SentinelState:
+        # Return episode metadata (tick, scores, active attacks, completion stats)
+```
+Turn manager pseudocode:
+```
+current_agent_idx = 0
+turn_order = [ATTACKER, WORKER, OVERSIGHT]
+on step(action):
+    assert action.agent == turn_order[current_agent_idx]
+    result = process(action)
+    current_agent_idx = (current_agent_idx + 1) % 3
+    if current_agent_idx == 0:
+        tick += 1
+    done = (tick >= 30)
+```
+#### CHECKPOINT 1: Core Works (Hour 4)
+Run `env.reset()` → 30x3 `env.step()` loop with random actions. Verify:
+- Turn order cycles correctly
+- Attacks modify system state
+- Rewards compute without errors
+- Episode terminates at tick 30
+**If this works, you have a submittable environment.**
+#### Phase 3: MCP Tools + Server (Hours 4-5.5)
+**Hour 4-5: mcp_tools.py — Per-Agent MCP Tools**
+Expose enterprise system APIs as individual MCP tools (not just step/reset/state). This is what agents actually call:
+```python
+from mcp.server.fastmcp import FastMCP
+mcp = FastMCP("sentinelops", host="0.0.0.0", port=9500, stateless_http=True)
+# --- Worker tools ---
+@mcp.tool()
+def lookup_customer(customer_id: str) -> str:
+    """Look up a customer record in the CRM."""
+    return json.dumps(env.crm.lookup_customer(customer_id))
+@mcp.tool()
+def issue_refund(invoice_id: str, amount: float, reason: str) -> str:
+    """Issue a refund for an invoice."""
+    return json.dumps(env.billing.issue_refund(invoice_id, amount, reason))
+@mcp.tool()
+def get_schema(system: str) -> str:
+    """Get the current field schema for a system (crm/billing/ticketing)."""
+    return json.dumps(env.get_system(system).get_schema())
+@mcp.tool()
+def get_current_policy(policy_type: str) -> str:
+    """Get the current policy (refund/sla)."""
+    return json.dumps(env.get_current_policy(policy_type))
+# --- Attacker tools ---
+@mcp.tool()
+def launch_attack(attack_type: str, target_system: str, parameters: str) -> str:
+    """Launch an attack on an enterprise system."""
+    ...
+# --- Oversight tools ---
+@mcp.tool()
+def flag_action(flagged: bool, severity: int, violation_type: str, explanation: str) -> str:
+    """Flag a worker action as a potential violation."""
+    ...
+# --- Env control tools ---
+@mcp.tool()
+def step(action_json: str) -> str:
+    """Take a full action in the SentinelOps environment."""
+    action = SentinelAction.model_validate_json(action_json)
+    obs = env.step(action)
+    return obs.model_dump_json()
+@mcp.tool()
+def reset(seed: int = None) -> str:
+    """Reset the environment for a new episode."""
+    obs = env.reset(seed=seed)
+    return obs.model_dump_json()
+@mcp.tool()
+def get_state() -> str:
+    """Get current environment state."""
+    return env.state.model_dump_json()
+```
+**Hour 5-5.5: server.py + MCP-X Gateway**
+OpenEnv HTTP server:
+`create_app(SentinelOpsArena, SentinelAction, SentinelObservation, env_name="sentinelops_arena")`
+MCP-X gateway (copy from envbeats, adapt config):
+```toml
+[clients.attacker]
+auth_token = "atk-token"
+[clients.worker]
+auth_token = "wrk-token"
+[clients.oversight]
+auth_token = "ovs-token"
+[mcp_servers.sentinelops]
+url = "http://localhost:9500/mcp"
+from_client = "orchestrator"
+[allow.sentinelops]
+attacker = ["launch_attack", "pass_turn", "get_attack_budget"]
+worker = ["lookup_customer", "update_tier", "add_note", "get_history", "check_balance", "issue_refund", "apply_credit", "generate_invoice", "create_ticket", "assign_ticket", "escalate", "resolve", "check_sla", "get_schema", "get_current_policy"]
+oversight = ["flag_action", "get_current_policy", "get_trajectory"]
+```
+#### Phase 4: Demo & Gradio App (Hours 5.5-7.5)
+**Hour 5.5-6.5: demo.py — Compelling Episode Script**
+Script that runs a complete 30-tick episode with hardcoded heuristic agents:
+- Attacker: schema_drift at tick 7, policy_drift at tick 14, social_engineering at tick 20, rate_limit at tick 25
+- Worker: Handles tasks, hits errors, recovers using get_schema/get_current_policy
+- Oversight: Flags violations based on policy comparison
+- Shows untrained vs trained worker behavior (before/after comparison)
+Output: Pretty-printed episode replay showing the full attack/adapt/flag cycle.
+**Hour 6.5-7.5: app.py — Gradio App (HuggingFace Spaces)**
+Rich Gradio interface:
+- Tab 1: "Run Episode" → executes demo, shows formatted turn-by-turn replay with color-coded agents
+- Tab 2: "Environment Inspector" → shows current system state, active attacks, policies
+- Tab 3: "Scores Dashboard" → final scores + reward breakdown for all three agents
+- Controls: seed selector, tick slider (step through episode), speed control
+- Metrics: live score charts, attack timeline visualization
+#### CHECKPOINT 2: Demo Ready (Hour 7.5)
+Working env + MCP tools + MCP-X gateway + rich Gradio demo. Deploy to HF Spaces.
+#### Phase 5: Training Script (Hours 7.5-10)
+**Hour 7.5-9: colab_training.ipynb**
+REQUIRED deliverable. Full env → GRPO training pipeline.
+```python
+# Cell 1: Install
+!pip install unsloth trl openenv-core peft transformers datasets
+# Cell 2: Load model with Unsloth (fast loading + LoRA setup)
+from unsloth import FastLanguageModel
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="unsloth/Qwen2.5-1.5B-Instruct",
+    max_seq_length=2048,
+    load_in_4bit=True,
+)
+model = FastLanguageModel.get_peft_model(
+    model, r=16, lora_alpha=32,
+    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
+    use_gradient_checkpointing="unsloth",
+)
+# Cell 3: Environment setup
+# Inline SentinelOpsArena (standalone, no openenv dependency for Colab Python compat)
+env = SentinelOpsArena()
+# Cell 4: Prompt dataset — enterprise scenarios the worker agent must handle
+dataset = Dataset.from_dict({"prompt": [
+    "Customer C001 requests a refund for invoice INV-2201...",
+    "Ticket TK-005 has high priority, check SLA status...",
+    # ... 30+ scenarios
+]})
+# Cell 5: GRPO rollout function (uses vanilla TRL, NOT Unsloth trainer)
+from trl import GRPOConfig, GRPOTrainer
+def rollout_func(prompts, trainer):
+    """Generate completions via env interaction."""
+    tokenizer = trainer.processing_class
+    all_prompt_ids, all_completion_ids, all_logprobs, all_rewards = [], [], [], []
+    for prompt in prompts:
+        obs = env.reset()
+        # Format obs + prompt into chat template
+        input_ids = tokenizer.encode(formatted_prompt)
+        # Generate response
+        with torch.no_grad():
+            output = trainer.model.generate(input_ids, max_new_tokens=512)
+        completion = tokenizer.decode(output[len(input_ids):])
+        # Step env with parsed action
+        action = parse_worker_action(completion)
+        result = env.step(action)
+        all_rewards.append(result.reward or 0.0)
+        all_prompt_ids.append(input_ids)
+        all_completion_ids.append(output[len(input_ids):])
+        all_logprobs.append(compute_logprobs(trainer.model, input_ids, output))
+    return {
+        "prompt_ids": all_prompt_ids,
+        "completion_ids": all_completion_ids,
+        "logprobs": all_logprobs,
+        "env_reward": all_rewards,
+    }
+def reward_from_env(completions, **kwargs):
+    return [float(r) for r in kwargs.get("env_reward", [0.0] * len(completions))]
+# Cell 6: Configure and train
+config = GRPOConfig(
+    output_dir="./sentinelops-grpo",
+    num_train_epochs=1,
+    per_device_train_batch_size=2,
+    gradient_accumulation_steps=4,
+    num_generations=4,
+    max_completion_length=512,
+    max_prompt_length=256,
+    logging_steps=1,
+    learning_rate=5e-6,
+    optim="paged_adamw_8bit",
+    report_to="none",
+)
+trainer = GRPOTrainer(
+    model=model,
+    processing_class=tokenizer,
+    reward_funcs=[reward_from_env],
+    rollout_func=rollout_func,
+    args=config,
+    train_dataset=dataset,
+)
+trainer.train()
+# Cell 7: Show training metrics (reward curve, loss curve)
+# Cell 8: Push to Hub
+model.save_pretrained("sentinelops-worker-grpo")
+model.push_to_hub("nihalnihalani/sentinelops-worker-grpo")
+```
+**Hour 9-10: Test & debug training pipeline**
+Run the notebook end-to-end on Colab. Fix any issues.
+- Verify model loads correctly
+- Verify env interactions work in Colab
+- Verify at least a few training steps complete
+- Capture training curves for demo video
+**Fallback hierarchy if GRPO pipeline breaks:**
+1. Simplify rollout_func to single-step interactions (no multi-turn)
+2. Drop to SFT with env-generated (prompt, ideal_response) pairs
+3. Show reward computation working with manual env interaction
+#### CHECKPOINT 3: Training Works (Hour 10)
+Colab notebook runs end-to-end. Training signal visible.
+#### Phase 6: Polish & Extras (Hours 10-12)
+**Hour 10-11: Improve demo quality**
+- Add before/after comparison (untrained vs trained worker) to Gradio app
+- Add attack timeline visualization
+- Add episode statistics aggregation (run 5 episodes, show avg scores)
+- Improve formatting and colors in the replay log
+- Add MCP-X demo tab showing per-agent tool isolation in action
+**Hour 11-12: Stretch goals (pick based on time)**
+- Add compound attacks (2 simultaneous — e.g., schema drift + social engineering)
+- Add more customer task variety (SLA escalations, complex multi-step tasks)
+- Run more training epochs and capture better training curves
+- Write better prompt dataset for training (diverse enterprise scenarios)
+- Add episode replay export (JSON format for analysis)
+#### Phase 7: Submission (Hours 12-14)
+**Hour 12-13: Deploy everything**
+- Final push to HuggingFace Spaces, verify public URL works
+- Final Colab notebook cleanup, verify it runs fresh from scratch
+- Test all Gradio tabs work as expected
+**Hour 13-13.5: Demo Video (YouTube)**
+- Screen record: Gradio demo running a full episode (attack/adapt/flag cycle)
+- Show: MCP-X per-agent tool isolation
+- Show: Colab training script running with visible learning signal
+- Narrate: explain the 3-agent self-play dynamic, partner track alignment
+- Keep 3-5 minutes
+- Upload to YouTube
+**Hour 13.5-14: Submit**
+- Team Name
+- Project Description
+- HF Spaces Link
+- YouTube Demo Link
+- Colab Training Script Link
+- Partner Tracks: Fleet AI, Patronus AI
+### Stop-and-Submit Checkpoints
+**Hour 4 (Minimum Viable):** Environment works with random agents. Submit with basic demo + placeholder training script.
+**Hour 7.5 (Good Submission):** Environment + MCP tools + MCP-X gateway + rich Gradio demo deployed.
+**Hour 10 (Strong Submission):** Everything above + working Colab training pipeline with visible learning.
+**Hour 14 (Full Submission):** Polished demo, training curves, stretch goals, video — everything done.
+---
+### EnvBeats Integration Strategy
+Based on deep analysis of the envbeats reference implementation:
+#### COPY (Use As-Is)
+| Component | Source File | Why |
+|---|---|---|
+| `call_mcp_tool()` | `eb_assessee_gym/main.py:37-51` | Generic MCP tool caller, directly reusable |
+| `parse_tags()` | `eb_assessor/my_util.py:72-76` | XML tag parser utility |
+#### ADAPT (Modify for SentinelOps)
+| Component | Source File | What Changes |
+|---|---|---|
+| FastMCP tool wrapping | `eb_assessor/my_agent.py:40-60` | Replace EchoEnv tools with SentinelOps step/reset/state |
+| Gym agent loop | `eb_assessee_gym/main.py:70-98` | MCPEchoEnv → MCPSentinelOpsClient |
+| MCP-X config pattern | `mcp-x/mcp_x.py` | Adapt TOML config for per-agent tool isolation (DEFERRED to post-MVP) |
+#### IGNORE (Not Needed)
+| Component | Reason |
+|---|---|
+| A2A protocol | Not in submission requirements |
+| Human-in-the-loop assessee | Over-complex for hackathon |
+| LLM-driven agent (pure_mcp) | Gemini-specific, wrong paradigm |
+| Assessor orchestration | We're not assessing, we're training |
+#### Key EnvBeats Gotchas to Avoid
+1. `create_app()` returns an ASGI app — use `uvicorn.run(app)` not `app.run()`
+2. `state` is a `@property` not a method — `env.state` not `env.state()`
+3. `Action` has `extra='forbid'` — no extra fields allowed in SentinelAction
+4. FastMCP `as_proxy()` needs a dummy server hack for hot-reload (see mcp_x.py:104-108)
+5. `streamablehttp_client` is async — all MCP client code must be async
+6. `EnvClient._step_payload()` and `_parse_result()` must be overridden — no defaults
+---
+### Project Description (Draft for Submission)
+> **SentinelOps Arena** is a multi-agent self-play RL environment built on OpenEnv 0.4 where three AI agents — Attacker (red team), Worker (blue team), and Oversight (auditor) — interact with simulated enterprise systems (CRM, Billing, Ticketing). The Attacker launches schema drift, policy drift, and social engineering attacks. The Worker must detect disruptions, adapt, and continue serving customers. The Oversight agent monitors worker actions and flags policy violations. Through adversarial self-play with GRPO training, all three agents improve simultaneously — creating an autocurriculum that produces hardened enterprise AI agents. Targets Fleet AI (Scalable Oversight) and Patronus AI (Schema Drift) partner tracks.
+---
+### Risk Mitigation
+| Risk | Mitigation |
+|---|---|
+| OpenEnv 0.4 API changes | Pin version in pyproject.toml, test imports first |
+| Colab Python version (3.10-3.11) vs openenv-core (requires >=3.13) | Bundle standalone env code in Colab without openenv dependency |
+| Unsloth + rollout_func incompatibility | Use Unsloth for model loading only, vanilla TRL GRPOTrainer for training |
+| HF Spaces deployment fails | Have local demo.py as backup, deploy FastAPI if Gradio fails |
+| Training script doesn't converge | Show pipeline working (loss decreasing) — convergence not required |
+| Running out of time | Stop-and-submit checkpoints at hours 3.5, 5, and 6 |
+### Deferred (Post-Hackathon)
+- Compliance drift attacks (new required fields)
+- Full 80-tick episodes with 50+ customers
+- Docker containerization
+- A2A protocol integration
+- Full GRPO training convergence (multi-epoch, all 3 agents)
 - Reward calibration pass
+- Real datetime-based SLA (currently tick-based)
+- Multi-GPU distributed training
+---
+## Key Judges to Note
+### First Round
+- **Sanyam Bhutani** (Meta), **Ali Sol**, **Hamid Shojanazeri**, **Matthias Reso** (Meta AI/ML Engineers)
+- **Michael Han** (Unsloth CTO)
+- **Soham Tiwari**, **Edgar Arakelyan**, **Divyansh Agarwal** (Scale AI)
+- **Robert Alward**, **Will Bryan**, **Wyatt Marshall** (Halluminate AI)
+### Final Round
+- **Daniel Han** (Unsloth Co-Founder) — cares about Unsloth/TRL integration
+- **David Corbitt** (CoreWeave) — cares about compute efficiency
+- **Sanyam Bhutani** (Meta) — cares about OpenEnv quality
+- **Nicolai Ouporov** (Fleet AI) — sponsors the Scalable Oversight sub-theme
+- **Jerry Wu** (Halluminate AI) — sponsors Multi-Actor Environments sub-theme
+- **Benjamin Burtenshaw** (HuggingFace) — cares about Hub deployment
+- **Darshan Deshpande** (Patronus AI) — sponsors Schema Drift sub-theme
+- **Anshuman Singh** (Scaler AI Labs) — sponsors Enterprise Workflows sub-theme