Spaces:
Running
Running
Commit ·
dc8bc66
1
Parent(s): 707377e
Refine build plan with devil's advocate corrections
Browse files- Switch to MCPEnvironment base class (auto MCP tool routing)
- Cut MCP-X gateway (stretch goal only)
- Use _step_impl() instead of step() for game logic
- Add Phase 0 pre-flight (H100 test + video script)
- Revised time allocation: Phase 1 expanded to 3.5h, Phase 3 compressed to 0.5h
- Hard SFT fallback at 1.5h into training phase
- Insurance HF Spaces deploy at Checkpoint 2
- Document Action extra='forbid' gotcha and reserved tool names
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- plan/phase-2-environment-core.md +175 -15
- plan/phase-3-mcp-and-server.md +116 -396
- plan/phase-4-demo-and-ui.md +5 -3
- plan/phase-5-training.md +3 -1
- plan/phase-6-polish-and-submit.md +41 -60
plan/phase-2-environment-core.md
CHANGED
|
@@ -20,23 +20,28 @@
|
|
| 20 |
|
| 21 |
## Step-by-Step Build Instructions
|
| 22 |
|
| 23 |
-
### Step 1: environment.py -- Core Class (
|
| 24 |
|
| 25 |
-
This is the most critical file.
|
| 26 |
|
| 27 |
-
**
|
| 28 |
-
- `
|
| 29 |
-
- `
|
| 30 |
-
- `
|
| 31 |
-
- `
|
| 32 |
- `SUPPORTS_CONCURRENT_SESSIONS: bool = True` (class attribute)
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
```python
|
|
|
|
| 35 |
import random
|
| 36 |
from uuid import uuid4
|
| 37 |
from typing import Any, Dict, List, Optional
|
| 38 |
|
| 39 |
-
from
|
|
|
|
| 40 |
from openenv.core.env_server.types import State
|
| 41 |
|
| 42 |
from .models import (
|
|
@@ -53,7 +58,7 @@ from .rewards import compute_attacker_reward, compute_worker_reward, compute_ove
|
|
| 53 |
from .task_generator import generate_tasks, generate_customers, generate_invoices, generate_tickets
|
| 54 |
|
| 55 |
|
| 56 |
-
class SentinelOpsArena(
|
| 57 |
SUPPORTS_CONCURRENT_SESSIONS = True
|
| 58 |
|
| 59 |
NUM_CUSTOMERS = 15
|
|
@@ -63,7 +68,132 @@ class SentinelOpsArena(Environment[SentinelAction, SentinelObservation, Sentinel
|
|
| 63 |
MAX_TICKS = 30
|
| 64 |
|
| 65 |
def __init__(self):
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
self._state = SentinelState(episode_id=str(uuid4()), step_count=0)
|
| 68 |
self.crm = CRMSystem()
|
| 69 |
self.billing = BillingSystem()
|
|
@@ -116,7 +246,10 @@ class SentinelOpsArena(Environment[SentinelAction, SentinelObservation, Sentinel
|
|
| 116 |
|
| 117 |
return self._make_observation(AgentRole.ATTACKER, reward=0.0, done=False)
|
| 118 |
|
| 119 |
-
def
|
|
|
|
|
|
|
|
|
|
| 120 |
expected_agent = self.turn_order[self.current_agent_idx]
|
| 121 |
|
| 122 |
# Validate agent turn
|
|
@@ -536,13 +669,35 @@ Scores: {...}
|
|
| 536 |
CHECKPOINT 1 PASSED
|
| 537 |
```
|
| 538 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 539 |
### Also verify the HTTP server works:
|
| 540 |
```bash
|
| 541 |
-
cd sentinelops_arena
|
| 542 |
python -c "
|
| 543 |
from openenv.core.env_server.http_server import create_app
|
| 544 |
-
from models import SentinelAction, SentinelObservation
|
| 545 |
-
from environment import SentinelOpsArena
|
| 546 |
app = create_app(SentinelOpsArena, SentinelAction, SentinelObservation, env_name='sentinelops_arena')
|
| 547 |
print('create_app() OK')
|
| 548 |
"
|
|
@@ -554,8 +709,10 @@ print('create_app() OK')
|
|
| 554 |
|
| 555 |
| Issue | Cause | Fix |
|
| 556 |
|-------|-------|-----|
|
| 557 |
-
| `TypeError:
|
|
|
|
| 558 |
| `state is not a property` | Defined `def state()` instead of `@property def state` | Use `@property` decorator |
|
|
|
|
| 559 |
| Turn order not advancing | `current_agent_idx` not updating | Check modulo arithmetic: `(idx + 1) % 3` |
|
| 560 |
| Tick not incrementing | Forgot tick advance on full rotation | `if current_agent_idx == 0: tick += 1` |
|
| 561 |
| Episode never ends | `done` condition wrong | Check `self.tick >= self.MAX_TICKS` after advancing |
|
|
@@ -575,6 +732,9 @@ print('create_app() OK')
|
|
| 575 |
- [ ] Rewards compute without errors (all 3 reward functions)
|
| 576 |
- [ ] Wrong-turn actions receive penalty
|
| 577 |
- [ ] `demo.py` runs a full episode without crashing
|
|
|
|
|
|
|
|
|
|
| 578 |
- [ ] `create_app()` creates a valid ASGI app
|
| 579 |
|
| 580 |
---
|
|
|
|
| 20 |
|
| 21 |
## Step-by-Step Build Instructions
|
| 22 |
|
| 23 |
+
### Step 1: environment.py -- Core Class with MCPEnvironment (75 min)
|
| 24 |
|
| 25 |
+
This is the most critical file. Use `MCPEnvironment` as the base class.
|
| 26 |
|
| 27 |
+
**MCPEnvironment API Contract (from installed code):**
|
| 28 |
+
- `MCPEnvironment` extends `Environment`, takes a `FastMCP` server in `__init__`
|
| 29 |
+
- `step()` auto-routes `ListToolsAction` -> `_handle_list_tools()` and `CallToolAction` -> `_handle_call_tool()`
|
| 30 |
+
- All other actions go to abstract `_step_impl(self, action, timeout_s=None, **kwargs) -> Observation`
|
| 31 |
+
- `reset()` and `state` are still abstract (inherited from `Environment`)
|
| 32 |
- `SUPPORTS_CONCURRENT_SESSIONS: bool = True` (class attribute)
|
| 33 |
+
- **RESERVED TOOL NAMES:** `reset`, `step`, `state`, `close` CANNOT be used as MCP tool names
|
| 34 |
+
|
| 35 |
+
**Architecture:** MCP tools (enterprise system APIs) are defined as FastMCP tools inside `__init__`. MCPEnvironment auto-routes `CallToolAction` to these tools. Non-MCP actions (turn management, game logic) go through `_step_impl`.
|
| 36 |
|
| 37 |
```python
|
| 38 |
+
import json
|
| 39 |
import random
|
| 40 |
from uuid import uuid4
|
| 41 |
from typing import Any, Dict, List, Optional
|
| 42 |
|
| 43 |
+
from fastmcp import FastMCP
|
| 44 |
+
from openenv.core.env_server.mcp_environment import MCPEnvironment
|
| 45 |
from openenv.core.env_server.types import State
|
| 46 |
|
| 47 |
from .models import (
|
|
|
|
| 58 |
from .task_generator import generate_tasks, generate_customers, generate_invoices, generate_tickets
|
| 59 |
|
| 60 |
|
| 61 |
+
class SentinelOpsArena(MCPEnvironment):
|
| 62 |
SUPPORTS_CONCURRENT_SESSIONS = True
|
| 63 |
|
| 64 |
NUM_CUSTOMERS = 15
|
|
|
|
| 68 |
MAX_TICKS = 30
|
| 69 |
|
| 70 |
def __init__(self):
|
| 71 |
+
# Create FastMCP server with enterprise system tools
|
| 72 |
+
mcp = FastMCP("sentinelops")
|
| 73 |
+
|
| 74 |
+
# --- Worker tools (enterprise system APIs) ---
|
| 75 |
+
@mcp.tool()
|
| 76 |
+
def lookup_customer(customer_id: str) -> str:
|
| 77 |
+
"""Look up a customer record in the CRM system."""
|
| 78 |
+
return json.dumps(self.crm.lookup_customer(customer_id))
|
| 79 |
+
|
| 80 |
+
@mcp.tool()
|
| 81 |
+
def update_tier(customer_id: str, new_tier: str) -> str:
|
| 82 |
+
"""Update a customer's tier level (gold/silver/bronze)."""
|
| 83 |
+
return json.dumps(self.crm.update_tier(customer_id, new_tier))
|
| 84 |
+
|
| 85 |
+
@mcp.tool()
|
| 86 |
+
def add_note(customer_id: str, note: str) -> str:
|
| 87 |
+
"""Add a note to a customer's record."""
|
| 88 |
+
return json.dumps(self.crm.add_note(customer_id, note))
|
| 89 |
+
|
| 90 |
+
@mcp.tool()
|
| 91 |
+
def get_history(customer_id: str) -> str:
|
| 92 |
+
"""Get interaction history for a customer."""
|
| 93 |
+
return json.dumps(self.crm.get_history(customer_id))
|
| 94 |
+
|
| 95 |
+
@mcp.tool()
|
| 96 |
+
def check_balance(customer_id: str) -> str:
|
| 97 |
+
"""Check the billing balance for a customer."""
|
| 98 |
+
return json.dumps(self.billing.check_balance(customer_id))
|
| 99 |
+
|
| 100 |
+
@mcp.tool()
|
| 101 |
+
def issue_refund(invoice_id: str, amount: float, reason: str) -> str:
|
| 102 |
+
"""Issue a refund for an invoice. Must comply with current refund policy."""
|
| 103 |
+
return json.dumps(self.billing.issue_refund(invoice_id, amount, reason))
|
| 104 |
+
|
| 105 |
+
@mcp.tool()
|
| 106 |
+
def apply_credit(customer_id: str, amount: float) -> str:
|
| 107 |
+
"""Apply a credit to a customer's account."""
|
| 108 |
+
return json.dumps(self.billing.apply_credit(customer_id, amount))
|
| 109 |
+
|
| 110 |
+
@mcp.tool()
|
| 111 |
+
def generate_invoice(customer_id: str, items: str, amount: float) -> str:
|
| 112 |
+
"""Generate a new invoice. Items should be comma-separated."""
|
| 113 |
+
item_list = [i.strip() for i in items.split(",")]
|
| 114 |
+
return json.dumps(self.billing.generate_invoice(customer_id, item_list, amount))
|
| 115 |
+
|
| 116 |
+
@mcp.tool()
|
| 117 |
+
def create_ticket(customer_id: str, subject: str, priority: str = "medium") -> str:
|
| 118 |
+
"""Create a new support ticket."""
|
| 119 |
+
return json.dumps(self.ticketing.create_ticket(
|
| 120 |
+
customer_id, subject, TicketPriority(priority)))
|
| 121 |
+
|
| 122 |
+
@mcp.tool()
|
| 123 |
+
def assign_ticket(ticket_id: str, agent_name: str) -> str:
|
| 124 |
+
"""Assign a ticket to an agent."""
|
| 125 |
+
return json.dumps(self.ticketing.assign_ticket(ticket_id, agent_name))
|
| 126 |
+
|
| 127 |
+
@mcp.tool()
|
| 128 |
+
def escalate_ticket(ticket_id: str, reason: str) -> str:
|
| 129 |
+
"""Escalate a ticket to a senior agent."""
|
| 130 |
+
return json.dumps(self.ticketing.escalate(ticket_id, reason))
|
| 131 |
+
|
| 132 |
+
@mcp.tool()
|
| 133 |
+
def resolve_ticket(ticket_id: str, resolution: str) -> str:
|
| 134 |
+
"""Resolve a ticket with the given resolution."""
|
| 135 |
+
return json.dumps(self.ticketing.resolve(ticket_id, resolution))
|
| 136 |
+
|
| 137 |
+
@mcp.tool()
|
| 138 |
+
def check_sla(ticket_id: str) -> str:
|
| 139 |
+
"""Check SLA status for a ticket (ticks remaining before breach)."""
|
| 140 |
+
return json.dumps(self.ticketing.check_sla(ticket_id))
|
| 141 |
+
|
| 142 |
+
@mcp.tool()
|
| 143 |
+
def get_schema(system: str) -> str:
|
| 144 |
+
"""Get current field schema for a system. Critical after schema drift."""
|
| 145 |
+
sys_obj = self._get_system(system)
|
| 146 |
+
if sys_obj is None:
|
| 147 |
+
return json.dumps({"error": f"Unknown system: {system}"})
|
| 148 |
+
return json.dumps(sys_obj.get_schema())
|
| 149 |
+
|
| 150 |
+
@mcp.tool()
|
| 151 |
+
def get_current_policy(policy_type: str = "refund") -> str:
|
| 152 |
+
"""Get the current policy (refund or sla). Critical after policy drift."""
|
| 153 |
+
if policy_type == "refund":
|
| 154 |
+
return json.dumps(self.billing.get_current_policy())
|
| 155 |
+
elif policy_type == "sla":
|
| 156 |
+
return json.dumps(self.ticketing.get_sla_rules())
|
| 157 |
+
return json.dumps({"error": f"Unknown policy type: {policy_type}"})
|
| 158 |
+
|
| 159 |
+
@mcp.tool()
|
| 160 |
+
def launch_attack(attack_type: str, target_system: str,
|
| 161 |
+
parameters_json: str = "{}") -> str:
|
| 162 |
+
"""Launch an attack on an enterprise system (attacker only).
|
| 163 |
+
Types: schema_drift, policy_drift, social_engineering, rate_limit."""
|
| 164 |
+
params = json.loads(parameters_json)
|
| 165 |
+
params["attack_type"] = attack_type
|
| 166 |
+
params["target_system"] = target_system
|
| 167 |
+
result = self.attack_manager.launch_attack(
|
| 168 |
+
AttackType(attack_type), TargetSystem(target_system), params, self.tick)
|
| 169 |
+
return json.dumps(result)
|
| 170 |
+
|
| 171 |
+
@mcp.tool()
|
| 172 |
+
def get_attack_budget() -> str:
|
| 173 |
+
"""Get remaining attack budget for this episode."""
|
| 174 |
+
budget = self.attack_manager.attack_budget if self.attack_manager else 10.0
|
| 175 |
+
return json.dumps({"budget": budget})
|
| 176 |
+
|
| 177 |
+
@mcp.tool()
|
| 178 |
+
def flag_action(flagged: bool, severity: int = 3,
|
| 179 |
+
violation_type: str = "policy_violation",
|
| 180 |
+
explanation: str = "") -> str:
|
| 181 |
+
"""Flag or approve a worker action (oversight only)."""
|
| 182 |
+
return json.dumps({
|
| 183 |
+
"flagged": flagged, "severity": severity,
|
| 184 |
+
"violation_type": violation_type, "explanation": explanation,
|
| 185 |
+
})
|
| 186 |
+
|
| 187 |
+
@mcp.tool()
|
| 188 |
+
def get_trajectory(num_recent: int = 5) -> str:
|
| 189 |
+
"""Get recent action trajectory for oversight analysis."""
|
| 190 |
+
trajectory = self.trajectory[-num_recent:] if self.trajectory else []
|
| 191 |
+
return json.dumps(trajectory)
|
| 192 |
+
|
| 193 |
+
# Initialize MCPEnvironment with the FastMCP server
|
| 194 |
+
super().__init__(mcp)
|
| 195 |
+
|
| 196 |
+
# Initialize systems
|
| 197 |
self._state = SentinelState(episode_id=str(uuid4()), step_count=0)
|
| 198 |
self.crm = CRMSystem()
|
| 199 |
self.billing = BillingSystem()
|
|
|
|
| 246 |
|
| 247 |
return self._make_observation(AgentRole.ATTACKER, reward=0.0, done=False)
|
| 248 |
|
| 249 |
+
def _step_impl(self, action: SentinelAction, timeout_s=None, **kwargs) -> SentinelObservation:
|
| 250 |
+
"""Handle non-MCP actions (game logic, turn management).
|
| 251 |
+
MCPEnvironment.step() auto-routes ListToolsAction/CallToolAction
|
| 252 |
+
to the FastMCP server. Everything else comes here."""
|
| 253 |
expected_agent = self.turn_order[self.current_agent_idx]
|
| 254 |
|
| 255 |
# Validate agent turn
|
|
|
|
| 669 |
CHECKPOINT 1 PASSED
|
| 670 |
```
|
| 671 |
|
| 672 |
+
### Also verify MCPEnvironment MCP routing works:
|
| 673 |
+
```bash
|
| 674 |
+
python -c "
|
| 675 |
+
from openenv.core.env_server.mcp_types import ListToolsAction, CallToolAction
|
| 676 |
+
from sentinelops_arena.environment import SentinelOpsArena
|
| 677 |
+
env = SentinelOpsArena()
|
| 678 |
+
env.reset(seed=42)
|
| 679 |
+
|
| 680 |
+
# Test MCP tool discovery
|
| 681 |
+
obs = env.step(ListToolsAction())
|
| 682 |
+
tool_names = [t.name for t in obs.tools]
|
| 683 |
+
print(f'MCP tools available: {tool_names}')
|
| 684 |
+
assert 'lookup_customer' in tool_names
|
| 685 |
+
assert 'launch_attack' in tool_names
|
| 686 |
+
assert 'reset' not in tool_names # reserved
|
| 687 |
+
|
| 688 |
+
# Test MCP tool call
|
| 689 |
+
obs = env.step(CallToolAction(tool_name='lookup_customer', arguments={'customer_id': 'C000'}))
|
| 690 |
+
print(f'Tool result: {obs.result}')
|
| 691 |
+
print('MCPEnvironment MCP routing OK')
|
| 692 |
+
"
|
| 693 |
+
```
|
| 694 |
+
|
| 695 |
### Also verify the HTTP server works:
|
| 696 |
```bash
|
|
|
|
| 697 |
python -c "
|
| 698 |
from openenv.core.env_server.http_server import create_app
|
| 699 |
+
from sentinelops_arena.models import SentinelAction, SentinelObservation
|
| 700 |
+
from sentinelops_arena.environment import SentinelOpsArena
|
| 701 |
app = create_app(SentinelOpsArena, SentinelAction, SentinelObservation, env_name='sentinelops_arena')
|
| 702 |
print('create_app() OK')
|
| 703 |
"
|
|
|
|
| 709 |
|
| 710 |
| Issue | Cause | Fix |
|
| 711 |
|-------|-------|-----|
|
| 712 |
+
| `TypeError: MCPEnvironment.__init__() missing mcp_server` | Forgot to pass FastMCP to super() | Call `super().__init__(mcp)` with FastMCP instance |
|
| 713 |
+
| `ValueError: MCP tools cannot use reserved names` | Tool named `reset`, `step`, `state`, or `close` | Rename the tool (e.g., `env_reset` -> but better to not overlap at all) |
|
| 714 |
| `state is not a property` | Defined `def state()` instead of `@property def state` | Use `@property` decorator |
|
| 715 |
+
| `_step_impl not defined` | Forgot to implement abstract method | MCPEnvironment requires `_step_impl()`, not `step()` |
|
| 716 |
| Turn order not advancing | `current_agent_idx` not updating | Check modulo arithmetic: `(idx + 1) % 3` |
|
| 717 |
| Tick not incrementing | Forgot tick advance on full rotation | `if current_agent_idx == 0: tick += 1` |
|
| 718 |
| Episode never ends | `done` condition wrong | Check `self.tick >= self.MAX_TICKS` after advancing |
|
|
|
|
| 732 |
- [ ] Rewards compute without errors (all 3 reward functions)
|
| 733 |
- [ ] Wrong-turn actions receive penalty
|
| 734 |
- [ ] `demo.py` runs a full episode without crashing
|
| 735 |
+
- [ ] `ListToolsAction` returns all MCP tools (via MCPEnvironment auto-routing)
|
| 736 |
+
- [ ] `CallToolAction` successfully calls enterprise system tools
|
| 737 |
+
- [ ] No reserved tool names used (`reset`, `step`, `state`, `close`)
|
| 738 |
- [ ] `create_app()` creates a valid ASGI app
|
| 739 |
|
| 740 |
---
|
plan/phase-3-mcp-and-server.md
CHANGED
|
@@ -1,8 +1,10 @@
|
|
| 1 |
-
# Phase 3: MCP
|
| 2 |
|
| 3 |
-
**Time:**
|
| 4 |
-
**Priority:**
|
| 5 |
-
**Depends on:** Phase 2 (working environment)
|
|
|
|
|
|
|
| 6 |
|
| 7 |
---
|
| 8 |
|
|
@@ -10,32 +12,32 @@
|
|
| 10 |
|
| 11 |
| File | Purpose | Est. Time |
|
| 12 |
|------|---------|-----------|
|
| 13 |
-
| `sentinelops_arena/
|
| 14 |
-
|
|
| 15 |
-
|
|
| 16 |
-
| `mcp-x/mcp_x.py` | Copy from envbeats, no modifications needed | 5 min |
|
| 17 |
-
| `run_server.py` | Script to start both env server + MCP-X | 10 min |
|
| 18 |
-
| `tests/test_mcp.py` | MCP tool integration tests | 20 min |
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
## Step-by-Step Build Instructions
|
| 23 |
|
| 24 |
-
### Step 1: server.py -- OpenEnv HTTP Server (
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
```python
|
| 29 |
# sentinelops_arena/server.py
|
| 30 |
"""
|
| 31 |
-
|
| 32 |
|
| 33 |
Endpoints:
|
| 34 |
POST /reset -- Reset environment
|
| 35 |
-
POST /step -- Execute an action
|
| 36 |
GET /state -- Get current state
|
| 37 |
GET /schema -- Get action/observation schemas
|
| 38 |
-
WS /ws -- WebSocket for persistent sessions
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
Usage:
|
| 41 |
uvicorn sentinelops_arena.server:app --host 0.0.0.0 --port 8000
|
|
@@ -65,394 +67,120 @@ if __name__ == "__main__":
|
|
| 65 |
main(port=args.port)
|
| 66 |
```
|
| 67 |
|
| 68 |
-
### Step 2:
|
| 69 |
-
|
| 70 |
-
Expose enterprise system APIs as individual MCP tools. This is what LLM agents actually call.
|
| 71 |
-
|
| 72 |
-
```python
|
| 73 |
-
# sentinelops_arena/mcp_tools.py
|
| 74 |
-
"""
|
| 75 |
-
MCP tool definitions for SentinelOps Arena.
|
| 76 |
-
|
| 77 |
-
Exposes enterprise system APIs as MCP tools via FastMCP.
|
| 78 |
-
Tools are grouped by agent role (attacker/worker/oversight).
|
| 79 |
-
"""
|
| 80 |
-
import json
|
| 81 |
-
from fastmcp import FastMCP
|
| 82 |
-
|
| 83 |
-
from .environment import SentinelOpsArena
|
| 84 |
-
from .models import (
|
| 85 |
-
SentinelAction, AgentRole, AttackType, TargetSystem,
|
| 86 |
-
TicketPriority,
|
| 87 |
-
)
|
| 88 |
-
|
| 89 |
-
mcp = FastMCP("sentinelops", host="0.0.0.0", port=9500, stateless_http=True)
|
| 90 |
-
|
| 91 |
-
# Global environment instance (shared across MCP calls)
|
| 92 |
-
env = SentinelOpsArena()
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
# ============ Environment Control Tools ============
|
| 96 |
-
|
| 97 |
-
@mcp.tool()
|
| 98 |
-
def reset(seed: int = 42) -> str:
|
| 99 |
-
"""Reset the SentinelOps environment for a new episode."""
|
| 100 |
-
obs = env.reset(seed=seed)
|
| 101 |
-
return obs.model_dump_json()
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
@mcp.tool()
|
| 105 |
-
def step(action_json: str) -> str:
|
| 106 |
-
"""Take a step in the SentinelOps environment with a full action."""
|
| 107 |
-
action = SentinelAction.model_validate_json(action_json)
|
| 108 |
-
obs = env.step(action)
|
| 109 |
-
return obs.model_dump_json()
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
@mcp.tool()
|
| 113 |
-
def get_state() -> str:
|
| 114 |
-
"""Get the current environment state (tick, scores, active attacks)."""
|
| 115 |
-
return env.state.model_dump_json()
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
# ============ Worker Tools (Enterprise System APIs) ============
|
| 119 |
-
|
| 120 |
-
@mcp.tool()
|
| 121 |
-
def lookup_customer(customer_id: str) -> str:
|
| 122 |
-
"""Look up a customer record in the CRM system."""
|
| 123 |
-
result = env.crm.lookup_customer(customer_id)
|
| 124 |
-
return json.dumps(result)
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
@mcp.tool()
|
| 128 |
-
def update_tier(customer_id: str, new_tier: str) -> str:
|
| 129 |
-
"""Update a customer's tier level (gold/silver/bronze)."""
|
| 130 |
-
result = env.crm.update_tier(customer_id, new_tier)
|
| 131 |
-
return json.dumps(result)
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
@mcp.tool()
|
| 135 |
-
def add_note(customer_id: str, note: str) -> str:
|
| 136 |
-
"""Add a note to a customer's record."""
|
| 137 |
-
result = env.crm.add_note(customer_id, note)
|
| 138 |
-
return json.dumps(result)
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
@mcp.tool()
|
| 142 |
-
def get_history(customer_id: str) -> str:
|
| 143 |
-
"""Get interaction history for a customer."""
|
| 144 |
-
result = env.crm.get_history(customer_id)
|
| 145 |
-
return json.dumps(result)
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
@mcp.tool()
|
| 149 |
-
def check_balance(customer_id: str) -> str:
|
| 150 |
-
"""Check the billing balance for a customer."""
|
| 151 |
-
result = env.billing.check_balance(customer_id)
|
| 152 |
-
return json.dumps(result)
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
@mcp.tool()
|
| 156 |
-
def issue_refund(invoice_id: str, amount: float, reason: str) -> str:
|
| 157 |
-
"""Issue a refund for an invoice. Must comply with current refund policy."""
|
| 158 |
-
result = env.billing.issue_refund(invoice_id, amount, reason)
|
| 159 |
-
return json.dumps(result)
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
@mcp.tool()
|
| 163 |
-
def apply_credit(customer_id: str, amount: float) -> str:
|
| 164 |
-
"""Apply a credit to a customer's account."""
|
| 165 |
-
result = env.billing.apply_credit(customer_id, amount)
|
| 166 |
-
return json.dumps(result)
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
@mcp.tool()
|
| 170 |
-
def generate_invoice(customer_id: str, items: str, amount: float) -> str:
|
| 171 |
-
"""Generate a new invoice for a customer. Items should be comma-separated."""
|
| 172 |
-
item_list = [i.strip() for i in items.split(",")]
|
| 173 |
-
result = env.billing.generate_invoice(customer_id, item_list, amount)
|
| 174 |
-
return json.dumps(result)
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
@mcp.tool()
|
| 178 |
-
def create_ticket(customer_id: str, subject: str, priority: str = "medium") -> str:
|
| 179 |
-
"""Create a new support ticket."""
|
| 180 |
-
result = env.ticketing.create_ticket(customer_id, subject, TicketPriority(priority))
|
| 181 |
-
return json.dumps(result)
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
@mcp.tool()
|
| 185 |
-
def assign_ticket(ticket_id: str, agent_name: str) -> str:
|
| 186 |
-
"""Assign a ticket to an agent."""
|
| 187 |
-
result = env.ticketing.assign_ticket(ticket_id, agent_name)
|
| 188 |
-
return json.dumps(result)
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
@mcp.tool()
|
| 192 |
-
def escalate_ticket(ticket_id: str, reason: str) -> str:
|
| 193 |
-
"""Escalate a ticket to a senior agent."""
|
| 194 |
-
result = env.ticketing.escalate(ticket_id, reason)
|
| 195 |
-
return json.dumps(result)
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
@mcp.tool()
|
| 199 |
-
def resolve_ticket(ticket_id: str, resolution: str) -> str:
|
| 200 |
-
"""Resolve a ticket with the given resolution."""
|
| 201 |
-
result = env.ticketing.resolve(ticket_id, resolution)
|
| 202 |
-
return json.dumps(result)
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
@mcp.tool()
|
| 206 |
-
def check_sla(ticket_id: str) -> str:
|
| 207 |
-
"""Check SLA status for a ticket (ticks remaining before breach)."""
|
| 208 |
-
result = env.ticketing.check_sla(ticket_id)
|
| 209 |
-
return json.dumps(result)
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
@mcp.tool()
|
| 213 |
-
def get_schema(system: str) -> str:
|
| 214 |
-
"""Get the current field schema for a system (crm/billing/ticketing).
|
| 215 |
-
Critical after schema drift attacks -- fields may have been renamed."""
|
| 216 |
-
sys_obj = env._get_system(system)
|
| 217 |
-
if sys_obj is None:
|
| 218 |
-
return json.dumps({"error": f"Unknown system: {system}"})
|
| 219 |
-
return json.dumps(sys_obj.get_schema())
|
| 220 |
-
|
| 221 |
|
| 222 |
-
@mcp.tool()
|
| 223 |
-
def get_current_policy(policy_type: str = "refund") -> str:
|
| 224 |
-
"""Get the current policy (refund or sla).
|
| 225 |
-
Critical after policy drift attacks -- rules may have changed."""
|
| 226 |
-
if policy_type == "refund":
|
| 227 |
-
return json.dumps(env.billing.get_current_policy())
|
| 228 |
-
elif policy_type == "sla":
|
| 229 |
-
return json.dumps(env.ticketing.get_sla_rules())
|
| 230 |
-
return json.dumps({"error": f"Unknown policy type: {policy_type}"})
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
# ============ Attacker Tools ============
|
| 234 |
-
|
| 235 |
-
@mcp.tool()
|
| 236 |
-
def launch_attack(attack_type: str, target_system: str, parameters_json: str = "{}") -> str:
|
| 237 |
-
"""Launch an attack on an enterprise system.
|
| 238 |
-
Types: schema_drift, policy_drift, social_engineering, rate_limit.
|
| 239 |
-
Costs 0.3 reward points per attack."""
|
| 240 |
-
import json as _json
|
| 241 |
-
params = _json.loads(parameters_json)
|
| 242 |
-
params["attack_type"] = attack_type
|
| 243 |
-
params["target_system"] = target_system
|
| 244 |
-
result = env.attack_manager.launch_attack(
|
| 245 |
-
AttackType(attack_type), TargetSystem(target_system), params, env.tick
|
| 246 |
-
)
|
| 247 |
-
return json.dumps(result)
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
@mcp.tool()
|
| 251 |
-
def pass_turn() -> str:
|
| 252 |
-
"""Pass the attacker's turn without launching an attack."""
|
| 253 |
-
return json.dumps({"status": "passed"})
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
@mcp.tool()
|
| 257 |
-
def get_attack_budget() -> str:
|
| 258 |
-
"""Get the remaining attack budget for this episode."""
|
| 259 |
-
budget = env.attack_manager.attack_budget if env.attack_manager else 10.0
|
| 260 |
-
return json.dumps({"budget": budget})
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
# ============ Oversight Tools ============
|
| 264 |
-
|
| 265 |
-
@mcp.tool()
|
| 266 |
-
def flag_action(flagged: bool, severity: int = 3,
|
| 267 |
-
violation_type: str = "policy_violation",
|
| 268 |
-
explanation: str = "") -> str:
|
| 269 |
-
"""Flag or approve a worker action. Used by the oversight agent."""
|
| 270 |
-
return json.dumps({
|
| 271 |
-
"flagged": flagged,
|
| 272 |
-
"severity": severity,
|
| 273 |
-
"violation_type": violation_type,
|
| 274 |
-
"explanation": explanation,
|
| 275 |
-
})
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
@mcp.tool()
|
| 279 |
-
def get_trajectory(num_recent: int = 5) -> str:
|
| 280 |
-
"""Get recent action trajectory for oversight analysis."""
|
| 281 |
-
trajectory = env.trajectory[-num_recent:] if env.trajectory else []
|
| 282 |
-
return json.dumps(trajectory)
|
| 283 |
-
```
|
| 284 |
-
|
| 285 |
-
### Step 3: MCP-X Gateway Config (10 min)
|
| 286 |
-
|
| 287 |
-
```toml
|
| 288 |
-
# mcp-x/config.toml
|
| 289 |
-
[clients]
|
| 290 |
-
[clients.orchestrator]
|
| 291 |
-
auth_token = "orch-token-001"
|
| 292 |
-
|
| 293 |
-
[clients.attacker]
|
| 294 |
-
auth_token = "atk-token-001"
|
| 295 |
-
|
| 296 |
-
[clients.worker]
|
| 297 |
-
auth_token = "wrk-token-001"
|
| 298 |
-
|
| 299 |
-
[clients.oversight]
|
| 300 |
-
auth_token = "ovs-token-001"
|
| 301 |
-
|
| 302 |
-
[mcp_servers]
|
| 303 |
-
[mcp_servers.sentinelops]
|
| 304 |
-
url = "http://localhost:9500/mcp/"
|
| 305 |
-
from_client = "orchestrator"
|
| 306 |
-
|
| 307 |
-
[allow]
|
| 308 |
-
[allow.sentinelops]
|
| 309 |
-
attacker = ["launch_attack", "pass_turn", "get_attack_budget", "step", "reset", "get_state"]
|
| 310 |
-
worker = ["lookup_customer", "update_tier", "add_note", "get_history", "check_balance", "issue_refund", "apply_credit", "generate_invoice", "create_ticket", "assign_ticket", "escalate_ticket", "resolve_ticket", "check_sla", "get_schema", "get_current_policy", "step", "reset", "get_state"]
|
| 311 |
-
oversight = ["flag_action", "get_current_policy", "get_trajectory", "step", "reset", "get_state"]
|
| 312 |
-
```
|
| 313 |
-
|
| 314 |
-
### Step 4: Copy MCP-X (5 min)
|
| 315 |
-
|
| 316 |
-
Copy `envbeats/mcp-x/mcp_x.py` to `mcp-x/mcp_x.py`. No modifications needed -- it reads from `config.toml` in its working directory.
|
| 317 |
-
|
| 318 |
-
```bash
|
| 319 |
-
cp envbeats/mcp-x/mcp_x.py mcp-x/mcp_x.py
|
| 320 |
-
```
|
| 321 |
-
|
| 322 |
-
### Step 5: run_server.py -- Start Script (10 min)
|
| 323 |
-
|
| 324 |
-
```python
|
| 325 |
-
# run_server.py
|
| 326 |
-
"""Start both the OpenEnv HTTP server and MCP server."""
|
| 327 |
-
import subprocess
|
| 328 |
-
import sys
|
| 329 |
-
import time
|
| 330 |
-
|
| 331 |
-
def main():
|
| 332 |
-
# Start OpenEnv HTTP server on port 8000
|
| 333 |
-
env_proc = subprocess.Popen([
|
| 334 |
-
sys.executable, "-m", "uvicorn",
|
| 335 |
-
"sentinelops_arena.server:app",
|
| 336 |
-
"--host", "0.0.0.0", "--port", "8000",
|
| 337 |
-
])
|
| 338 |
-
|
| 339 |
-
# Start FastMCP server on port 9500
|
| 340 |
-
mcp_proc = subprocess.Popen([
|
| 341 |
-
sys.executable, "-c",
|
| 342 |
-
"from sentinelops_arena.mcp_tools import mcp; mcp.run()"
|
| 343 |
-
])
|
| 344 |
-
|
| 345 |
-
# Start MCP-X gateway on port 9000
|
| 346 |
-
mcpx_proc = subprocess.Popen([
|
| 347 |
-
sys.executable, "mcp-x/mcp_x.py", "--port", "9000"
|
| 348 |
-
])
|
| 349 |
-
|
| 350 |
-
print("Servers started:")
|
| 351 |
-
print(" OpenEnv HTTP: http://localhost:8000")
|
| 352 |
-
print(" MCP (FastMCP): http://localhost:9500")
|
| 353 |
-
print(" MCP-X Gateway: http://localhost:9000")
|
| 354 |
-
|
| 355 |
-
try:
|
| 356 |
-
env_proc.wait()
|
| 357 |
-
except KeyboardInterrupt:
|
| 358 |
-
env_proc.terminate()
|
| 359 |
-
mcp_proc.terminate()
|
| 360 |
-
mcpx_proc.terminate()
|
| 361 |
-
|
| 362 |
-
if __name__ == "__main__":
|
| 363 |
-
main()
|
| 364 |
-
```
|
| 365 |
-
|
| 366 |
-
---
|
| 367 |
-
|
| 368 |
-
## VERIFY
|
| 369 |
-
|
| 370 |
-
### Test 1: OpenEnv HTTP Server
|
| 371 |
```bash
|
| 372 |
# Start server
|
| 373 |
uvicorn sentinelops_arena.server:app --port 8000 &
|
| 374 |
|
| 375 |
# Test reset
|
| 376 |
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
|
| 377 |
-
# Should return: {"observation": {...}, "reward": null, "done": false}
|
| 378 |
|
| 379 |
-
# Test step
|
| 380 |
curl -X POST http://localhost:8000/step -H "Content-Type: application/json" \
|
| 381 |
-d '{"action": {"agent": "attacker", "action_type": "pass"}}'
|
| 382 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 383 |
|
| 384 |
# Test state
|
| 385 |
curl http://localhost:8000/state
|
| 386 |
-
# Should return: {"episode_id": "...", "step_count": 1, "tick": 0, ...}
|
| 387 |
|
| 388 |
# Test schema
|
| 389 |
curl http://localhost:8000/schema
|
| 390 |
-
# Should return action/observation/state JSON schemas
|
| 391 |
|
| 392 |
kill %1
|
| 393 |
```
|
| 394 |
|
| 395 |
-
###
|
|
|
|
| 396 |
```python
|
| 397 |
-
#
|
| 398 |
-
from mcp.client.streamable_http import streamablehttp_client
|
| 399 |
-
from mcp.client.session import ClientSession
|
| 400 |
import asyncio
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 401 |
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 406 |
|
| 407 |
-
|
| 408 |
-
tools = await session.list_tools()
|
| 409 |
-
tool_names = [t.name for t in tools.tools]
|
| 410 |
-
print(f"Available tools: {tool_names}")
|
| 411 |
-
assert "reset" in tool_names
|
| 412 |
-
assert "step" in tool_names
|
| 413 |
-
assert "lookup_customer" in tool_names
|
| 414 |
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 418 |
|
| 419 |
-
|
| 420 |
-
result = await session.call_tool("get_state", {})
|
| 421 |
-
print(f"State: {result.content[0].text[:100]}")
|
| 422 |
|
| 423 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 424 |
```
|
| 425 |
|
| 426 |
-
### Test
|
| 427 |
-
```
|
| 428 |
-
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
async with ClientSession(r, w) as session:
|
| 448 |
-
await session.initialize()
|
| 449 |
-
tools = await session.list_tools()
|
| 450 |
-
names = [t.name for t in tools.tools]
|
| 451 |
-
print(f"Attacker tools: {names}")
|
| 452 |
-
assert "launch_attack" in names
|
| 453 |
-
assert "lookup_customer" not in names # attacker cannot use CRM
|
| 454 |
-
|
| 455 |
-
asyncio.run(test_mcpx())
|
| 456 |
```
|
| 457 |
|
| 458 |
---
|
|
@@ -461,38 +189,30 @@ asyncio.run(test_mcpx())
|
|
| 461 |
|
| 462 |
| Issue | Cause | Fix |
|
| 463 |
|-------|-------|-----|
|
| 464 |
-
| `Port 8000
|
| 465 |
-
| `
|
| 466 |
-
|
|
| 467 |
-
|
|
| 468 |
-
|
|
| 469 |
-
| `
|
| 470 |
-
| MCP tool returns empty | Environment not reset | Call `reset` before other tools |
|
| 471 |
-
| `model_dump_json()` fails on complex types | Pydantic serialization issue | Use `json.dumps()` for dict results, `model_dump_json()` for Pydantic models |
|
| 472 |
|
| 473 |
---
|
| 474 |
|
| 475 |
## EXIT CRITERIA
|
| 476 |
|
| 477 |
- [ ] `uvicorn sentinelops_arena.server:app` starts without errors
|
| 478 |
-
- [ ] HTTP `/reset`, `/step`, `/state`, `/schema`
|
| 479 |
-
- [ ]
|
| 480 |
-
- [ ]
|
| 481 |
-
- [ ]
|
| 482 |
-
- [ ] `lookup_customer`, `issue_refund`, etc. return valid data
|
| 483 |
-
- [ ] MCP-X gateway starts on port 9000
|
| 484 |
-
- [ ] Worker token sees only worker tools
|
| 485 |
-
- [ ] Attacker token sees only attacker tools
|
| 486 |
-
- [ ] Oversight token sees only oversight tools
|
| 487 |
-
- [ ] Cross-role tool access denied (worker can't call launch_attack)
|
| 488 |
|
| 489 |
---
|
| 490 |
|
| 491 |
## ROLLBACK PLAN
|
| 492 |
|
| 493 |
-
|
| 494 |
-
1. **
|
| 495 |
-
2. **
|
| 496 |
-
3. **
|
| 497 |
|
| 498 |
Do NOT cut: `server.py` with `create_app()`. This is required for HF Spaces deployment.
|
|
|
|
| 1 |
+
# Phase 3: MCP + OpenEnv HTTP Server
|
| 2 |
|
| 3 |
+
**Time:** 0.5 hours (Hours 6-6.5)
|
| 4 |
+
**Priority:** MEDIUM -- MCPEnvironment did most of the work in Phase 2
|
| 5 |
+
**Depends on:** Phase 2 (working environment with MCP tools)
|
| 6 |
+
|
| 7 |
+
**KEY CHANGE:** MCPEnvironment handles MCP tool routing automatically. Phase 3 is now just creating the HTTP server entry point and verifying everything works end-to-end. MCP-X gateway is CUT.
|
| 8 |
|
| 9 |
---
|
| 10 |
|
|
|
|
| 12 |
|
| 13 |
| File | Purpose | Est. Time |
|
| 14 |
|------|---------|-----------|
|
| 15 |
+
| `sentinelops_arena/server.py` | `create_app()` HTTP server entry point | 10 min |
|
| 16 |
+
| Verify MCP tools via HTTP | End-to-end test | 10 min |
|
| 17 |
+
| Verify WebSocket + MCP | Integration test | 10 min |
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
## Step-by-Step Build Instructions
|
| 22 |
|
| 23 |
+
### Step 1: server.py -- OpenEnv HTTP Server (10 min)
|
| 24 |
|
| 25 |
+
This is trivial -- follow the hackathon_env template exactly.
|
| 26 |
|
| 27 |
```python
|
| 28 |
# sentinelops_arena/server.py
|
| 29 |
"""
|
| 30 |
+
HTTP server for SentinelOps Arena.
|
| 31 |
|
| 32 |
Endpoints:
|
| 33 |
POST /reset -- Reset environment
|
| 34 |
+
POST /step -- Execute an action (including ListToolsAction, CallToolAction)
|
| 35 |
GET /state -- Get current state
|
| 36 |
GET /schema -- Get action/observation schemas
|
| 37 |
+
WS /ws -- WebSocket for persistent sessions (supports /mcp)
|
| 38 |
+
|
| 39 |
+
The MCPEnvironment base class handles MCP tool routing automatically.
|
| 40 |
+
Agents can discover tools via ListToolsAction and call them via CallToolAction.
|
| 41 |
|
| 42 |
Usage:
|
| 43 |
uvicorn sentinelops_arena.server:app --host 0.0.0.0 --port 8000
|
|
|
|
| 67 |
main(port=args.port)
|
| 68 |
```
|
| 69 |
|
| 70 |
+
### Step 2: Verify HTTP + MCP Integration (10 min)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
```bash
|
| 73 |
# Start server
|
| 74 |
uvicorn sentinelops_arena.server:app --port 8000 &
|
| 75 |
|
| 76 |
# Test reset
|
| 77 |
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{}'
|
|
|
|
| 78 |
|
| 79 |
+
# Test step (regular action)
|
| 80 |
curl -X POST http://localhost:8000/step -H "Content-Type: application/json" \
|
| 81 |
-d '{"action": {"agent": "attacker", "action_type": "pass"}}'
|
| 82 |
+
|
| 83 |
+
# Test step (MCP list_tools -- auto-routed by MCPEnvironment)
|
| 84 |
+
curl -X POST http://localhost:8000/step -H "Content-Type: application/json" \
|
| 85 |
+
-d '{"action": {"type": "list_tools"}}'
|
| 86 |
+
# Should return available MCP tools
|
| 87 |
+
|
| 88 |
+
# Test step (MCP call_tool -- auto-routed by MCPEnvironment)
|
| 89 |
+
curl -X POST http://localhost:8000/step -H "Content-Type: application/json" \
|
| 90 |
+
-d '{"action": {"type": "call_tool", "tool_name": "lookup_customer", "arguments": {"customer_id": "C000"}}}'
|
| 91 |
+
# Should return customer data
|
| 92 |
|
| 93 |
# Test state
|
| 94 |
curl http://localhost:8000/state
|
|
|
|
| 95 |
|
| 96 |
# Test schema
|
| 97 |
curl http://localhost:8000/schema
|
|
|
|
| 98 |
|
| 99 |
kill %1
|
| 100 |
```
|
| 101 |
|
| 102 |
+
### Step 3: Verify WebSocket MCP Path (10 min)
|
| 103 |
+
|
| 104 |
```python
|
| 105 |
+
# Quick WebSocket test
|
|
|
|
|
|
|
| 106 |
import asyncio
|
| 107 |
+
import json
|
| 108 |
+
import websockets
|
| 109 |
+
|
| 110 |
+
async def test_ws():
|
| 111 |
+
async with websockets.connect("ws://localhost:8000/ws") as ws:
|
| 112 |
+
# Reset
|
| 113 |
+
await ws.send(json.dumps({"type": "reset", "data": {"seed": 42}}))
|
| 114 |
+
resp = json.loads(await ws.recv())
|
| 115 |
+
print(f"Reset: {resp['type']}")
|
| 116 |
+
|
| 117 |
+
# MCP via WebSocket
|
| 118 |
+
await ws.send(json.dumps({
|
| 119 |
+
"type": "mcp",
|
| 120 |
+
"data": {"method": "tools/list", "params": {}, "id": 1}
|
| 121 |
+
}))
|
| 122 |
+
resp = json.loads(await ws.recv())
|
| 123 |
+
print(f"MCP tools via WS: {resp}")
|
| 124 |
+
|
| 125 |
+
asyncio.run(test_ws())
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## What MCPEnvironment Gives Us For Free
|
| 131 |
|
| 132 |
+
| Feature | How |
|
| 133 |
+
|---------|-----|
|
| 134 |
+
| MCP tool discovery | `ListToolsAction` -> returns all tools with schemas |
|
| 135 |
+
| MCP tool invocation | `CallToolAction(tool_name, arguments)` -> calls FastMCP tool |
|
| 136 |
+
| Reserved name validation | Rejects tools named `reset`, `step`, `state`, `close` |
|
| 137 |
+
| Timeout handling | Configurable timeout on tool calls |
|
| 138 |
+
| Error categorization | `ToolError` with types: execution_error, invalid_args, tool_not_found, timeout |
|
| 139 |
+
| WebSocket MCP path | `/ws` endpoint supports `type: "mcp"` messages |
|
| 140 |
+
| Async support | `_run_async_safely()` handles both sync and async contexts |
|
| 141 |
|
| 142 |
+
## What We DON'T Need (CUT)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
+
| Removed | Reason |
|
| 145 |
+
|---------|--------|
|
| 146 |
+
| `mcp_tools.py` | MCP tools defined inside `environment.py` via FastMCP |
|
| 147 |
+
| `mcp-x/` directory | MCP-X gateway CUT -- MCPEnvironment handles tool exposure |
|
| 148 |
+
| `config.toml` | No MCP-X = no per-agent access control config |
|
| 149 |
+
| `run_server.py` | Single server is enough |
|
| 150 |
+
| Per-agent JWT tokens | Nice-to-have, not needed for demo/judging |
|
| 151 |
|
| 152 |
+
---
|
|
|
|
|
|
|
| 153 |
|
| 154 |
+
## VERIFY
|
| 155 |
+
|
| 156 |
+
### Test 1: HTTP Server starts
|
| 157 |
+
```bash
|
| 158 |
+
uvicorn sentinelops_arena.server:app --port 8000
|
| 159 |
+
# Should start without errors
|
| 160 |
+
# Should show "Uvicorn running on http://0.0.0.0:8000"
|
| 161 |
```
|
| 162 |
|
| 163 |
+
### Test 2: All endpoints return valid JSON
|
| 164 |
+
```bash
|
| 165 |
+
# Reset -> Observation JSON
|
| 166 |
+
# Step -> Observation JSON
|
| 167 |
+
# State -> State JSON
|
| 168 |
+
# Schema -> Action/Observation/State schemas
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
### Test 3: MCP tools discoverable via HTTP
|
| 172 |
+
```bash
|
| 173 |
+
# POST /step with ListToolsAction -> list of tools
|
| 174 |
+
# Verify: lookup_customer, issue_refund, get_schema, launch_attack etc. all present
|
| 175 |
+
# Verify: no reserved names (reset, step, state, close)
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
### Test 4: MCP tools callable via HTTP
|
| 179 |
+
```bash
|
| 180 |
+
# POST /step with CallToolAction -> tool result
|
| 181 |
+
# Call lookup_customer("C000") -> customer data
|
| 182 |
+
# Call get_schema("crm") -> field list
|
| 183 |
+
# Call get_current_policy("refund") -> policy values
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
```
|
| 185 |
|
| 186 |
---
|
|
|
|
| 189 |
|
| 190 |
| Issue | Cause | Fix |
|
| 191 |
|-------|-------|-----|
|
| 192 |
+
| `Port 8000 already in use` | Previous server running | `kill $(lsof -t -i:8000)` |
|
| 193 |
+
| `create_app()` fails with type error | Wrong argument types | Pass class (not instance), Action class, Observation class |
|
| 194 |
+
| MCP tools not showing up | Tools defined after `super().__init__()` | Define tools BEFORE calling `super().__init__(mcp)` |
|
| 195 |
+
| `ValueError: reserved names` | Tool named `reset` or `step` | Rename the tool |
|
| 196 |
+
| WebSocket MCP not working | Wrong message format | Use `{"type": "mcp", "data": {"method": "tools/list", ...}}` |
|
| 197 |
+
| `ListToolsAction` not recognized | `create_app` doesn't know about MCP types | May need to pass both `SentinelAction` and MCP action types to create_app |
|
|
|
|
|
|
|
| 198 |
|
| 199 |
---
|
| 200 |
|
| 201 |
## EXIT CRITERIA
|
| 202 |
|
| 203 |
- [ ] `uvicorn sentinelops_arena.server:app` starts without errors
|
| 204 |
+
- [ ] HTTP `/reset`, `/step`, `/state`, `/schema` return valid JSON
|
| 205 |
+
- [ ] `ListToolsAction` via `/step` returns all enterprise system tools
|
| 206 |
+
- [ ] `CallToolAction` via `/step` successfully calls tools
|
| 207 |
+
- [ ] WebSocket `/ws` endpoint accepts connections
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
|
| 209 |
---
|
| 210 |
|
| 211 |
## ROLLBACK PLAN
|
| 212 |
|
| 213 |
+
Phase 3 is already minimal. If it takes longer than 30 minutes:
|
| 214 |
+
1. **Skip WebSocket verification** -- HTTP-only is fine for demo
|
| 215 |
+
2. **Skip schema endpoint check** -- not needed for judging
|
| 216 |
+
3. **If `create_app()` fails entirely** -- serve the Gradio app directly without the OpenEnv HTTP layer. The environment still works via direct Python calls.
|
| 217 |
|
| 218 |
Do NOT cut: `server.py` with `create_app()`. This is required for HF Spaces deployment.
|
plan/phase-4-demo-and-ui.md
CHANGED
|
@@ -1,8 +1,10 @@
|
|
| 1 |
# Phase 4: Demo Script + Gradio App + HF Spaces Deployment
|
| 2 |
|
| 3 |
-
**Time:** 2 hours (Hours
|
| 4 |
-
**Priority:** HIGH -- Storytelling is 30% of judging
|
| 5 |
-
**Depends on:** Phase 3 (
|
|
|
|
|
|
|
| 6 |
|
| 7 |
---
|
| 8 |
|
|
|
|
| 1 |
# Phase 4: Demo Script + Gradio App + HF Spaces Deployment
|
| 2 |
|
| 3 |
+
**Time:** 2 hours (Hours 6.5-8.5)
|
| 4 |
+
**Priority:** HIGH -- Storytelling is 30% of judging. Innovation (40%) + Storytelling (30%) = 70% non-code.
|
| 5 |
+
**Depends on:** Phase 3 (server working)
|
| 6 |
+
|
| 7 |
+
**IMPORTANT:** Deploy to HF Spaces at the END of this phase as INSURANCE SUBMISSION (Checkpoint 2). This is a good submission even if training fails later.
|
| 8 |
|
| 9 |
---
|
| 10 |
|
plan/phase-5-training.md
CHANGED
|
@@ -1,9 +1,11 @@
|
|
| 1 |
# Phase 5: Training Script -- Colab Notebook with GRPO
|
| 2 |
|
| 3 |
-
**Time:** 2
|
| 4 |
**Priority:** HIGH -- Training Script is 20% of judging and REQUIRED for submission
|
| 5 |
**Depends on:** Phase 2 (working environment)
|
| 6 |
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
## Files to Create
|
|
|
|
| 1 |
# Phase 5: Training Script -- Colab Notebook with GRPO
|
| 2 |
|
| 3 |
+
**Time:** 2 hours MAX (Hours 8.5-10.5)
|
| 4 |
**Priority:** HIGH -- Training Script is 20% of judging and REQUIRED for submission
|
| 5 |
**Depends on:** Phase 2 (working environment)
|
| 6 |
|
| 7 |
+
**HARD RULE:** If GRPO is not working after 1.5 hours (hour 10), FALL BACK TO SFT immediately. Training only needs to show "improvement" -- even a 0.1 reward increase counts. Do not spend more than 2h total on this phase.
|
| 8 |
+
|
| 9 |
---
|
| 10 |
|
| 11 |
## Files to Create
|
plan/phase-6-polish-and-submit.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
# Phase 6: Polish, Video, and Submit
|
| 2 |
|
| 3 |
-
**Time:**
|
| 4 |
-
**Priority:** CRITICAL -- this is when everything comes together
|
| 5 |
**Depends on:** All previous phases
|
| 6 |
|
| 7 |
---
|
|
@@ -10,11 +10,10 @@
|
|
| 10 |
|
| 11 |
| Task | Est. Time |
|
| 12 |
|------|-----------|
|
| 13 |
-
| Polish demo quality
|
| 14 |
-
|
|
| 15 |
-
| Final deployment + verification |
|
| 16 |
-
|
|
| 17 |
-
| Submission form | 15 min (Hours 13:45-14) |
|
| 18 |
|
| 19 |
---
|
| 20 |
|
|
@@ -34,11 +33,11 @@
|
|
| 34 |
- Highlight "key moments" in the replay (attack launched, error recovered, social eng resisted)
|
| 35 |
- Add score differential chart
|
| 36 |
|
| 37 |
-
**Optional: MCP
|
| 38 |
-
If
|
| 39 |
-
- Add a tab showing
|
| 40 |
-
-
|
| 41 |
-
-
|
| 42 |
|
| 43 |
### Hour 11-12: Stretch Goals (Pick Based on Time)
|
| 44 |
|
|
@@ -89,58 +88,40 @@ uvicorn sentinelops_arena.server:app --port 8000 # HTTP API works
|
|
| 89 |
curl http://localhost:8000/schema # Schema endpoint returns
|
| 90 |
```
|
| 91 |
|
| 92 |
-
### Hour
|
| 93 |
|
| 94 |
-
**Video Script (
|
| 95 |
|
| 96 |
-
|
| 97 |
-
[SLIDE 1: Title - 5 seconds]
|
| 98 |
-
"SentinelOps Arena: Multi-Agent Self-Play for Enterprise Security"
|
| 99 |
-
|
| 100 |
-
[SCREEN: Gradio app - 15 seconds]
|
| 101 |
-
"SentinelOps Arena is a multi-agent self-play training environment
|
| 102 |
-
built on OpenEnv. Three AI agents -- Attacker, Worker, and
|
| 103 |
-
Oversight -- interact with simulated enterprise systems."
|
| 104 |
-
|
| 105 |
-
[SCREEN: Run Episode tab - 20 seconds]
|
| 106 |
-
"Let me show you an episode. The attacker launches schema drift
|
| 107 |
-
at tick 7 -- renaming customer_id to account_id. Watch what
|
| 108 |
-
happens when the untrained worker hits this."
|
| 109 |
-
[Click Run Episode with trained=False]
|
| 110 |
-
"The worker crashes on the schema change. It doesn't know how
|
| 111 |
-
to recover."
|
| 112 |
-
|
| 113 |
-
[SCREEN: Comparison tab - 20 seconds]
|
| 114 |
-
"Now let's see the trained worker handle the same attacks."
|
| 115 |
-
[Click Run Comparison]
|
| 116 |
-
"The trained worker detects the KeyError, calls get_schema to
|
| 117 |
-
discover the new field name, and continues serving customers.
|
| 118 |
-
Score improvement is clear."
|
| 119 |
-
|
| 120 |
-
[SCREEN: Inspector tab - 10 seconds]
|
| 121 |
-
"Under the hood, we have 15 customers, 15 invoices, 10 tickets,
|
| 122 |
-
and 30 customer tasks per episode. Four attack types: schema
|
| 123 |
-
drift, policy drift, social engineering, and rate limiting."
|
| 124 |
-
|
| 125 |
-
[SCREEN: Colab notebook - 15 seconds]
|
| 126 |
-
"Training uses GRPO with Unsloth and TRL. The environment
|
| 127 |
-
provides reward signals directly to the training loop. Here
|
| 128 |
-
you can see the reward improving over training steps."
|
| 129 |
-
[Show training curves]
|
| 130 |
-
|
| 131 |
-
[SLIDE 2: Partner Tracks - 10 seconds]
|
| 132 |
-
"We target two partner tracks:
|
| 133 |
-
Fleet AI -- our Oversight agent monitors and explains Worker behavior
|
| 134 |
-
Patronus AI -- schema and policy drift are core attack types"
|
| 135 |
|
| 136 |
-
[SLIDE 3: Architecture - 10 seconds]
|
| 137 |
-
"Built on OpenEnv with MCP tools and an MCP-X gateway for
|
| 138 |
-
per-agent tool isolation. Three agents, three systems,
|
| 139 |
-
self-play training via GRPO."
|
| 140 |
-
|
| 141 |
-
[END - 5 seconds]
|
| 142 |
-
"SentinelOps Arena. Try it on HuggingFace Spaces."
|
| 143 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
**Recording instructions:**
|
| 146 |
1. Open Gradio app in browser
|
|
|
|
| 1 |
# Phase 6: Polish, Video, and Submit
|
| 2 |
|
| 3 |
+
**Time:** 3.5 hours (Hours 10.5-14)
|
| 4 |
+
**Priority:** CRITICAL -- this is when everything comes together. Storytelling = 30% of judging.
|
| 5 |
**Depends on:** All previous phases
|
| 6 |
|
| 7 |
---
|
|
|
|
| 10 |
|
| 11 |
| Task | Est. Time |
|
| 12 |
|------|-----------|
|
| 13 |
+
| Polish demo quality + stretch goals | 1h (Hours 10.5-11.5) |
|
| 14 |
+
| Record and upload video | 1.5h (Hours 11.5-13) |
|
| 15 |
+
| Final deployment + verification | 0.5h (Hours 13-13.5) |
|
| 16 |
+
| Submission form | 0.5h (Hours 13.5-14) |
|
|
|
|
| 17 |
|
| 18 |
---
|
| 19 |
|
|
|
|
| 33 |
- Highlight "key moments" in the replay (attack launched, error recovered, social eng resisted)
|
| 34 |
- Add score differential chart
|
| 35 |
|
| 36 |
+
**Optional: MCP Tool Discovery Tab**
|
| 37 |
+
If time permits:
|
| 38 |
+
- Add a Gradio tab showing MCP tool list (via ListToolsAction)
|
| 39 |
+
- Show tool schemas and descriptions
|
| 40 |
+
- Demonstrate CallToolAction calling enterprise system APIs
|
| 41 |
|
| 42 |
### Hour 11-12: Stretch Goals (Pick Based on Time)
|
| 43 |
|
|
|
|
| 88 |
curl http://localhost:8000/schema # Schema endpoint returns
|
| 89 |
```
|
| 90 |
|
| 91 |
+
### Hour 11.5-13: Demo Video
|
| 92 |
|
| 93 |
+
**PRIMARY Video Script (60 seconds -- tight and punchy):**
|
| 94 |
|
| 95 |
+
Write this script BEFORE starting the hackathon (Phase 0). It drives clarity on what to build and demo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
```
|
| 98 |
+
[0-10s: Problem statement]
|
| 99 |
+
"Enterprise AI agents break when schemas change, policies drift,
|
| 100 |
+
or they face social engineering. How do we train resilient agents?"
|
| 101 |
+
|
| 102 |
+
[10-20s: What SentinelOps Arena is]
|
| 103 |
+
"SentinelOps Arena: a multi-agent self-play environment on OpenEnv.
|
| 104 |
+
Three agents -- Attacker, Worker, and Oversight -- compete in
|
| 105 |
+
simulated enterprise systems."
|
| 106 |
+
|
| 107 |
+
[20-35s: SCREEN -- Demo showing attack -> error -> recovery cycle]
|
| 108 |
+
[Click Run Episode in Gradio]
|
| 109 |
+
"Watch: the attacker launches schema drift at tick 7. The untrained
|
| 110 |
+
worker crashes. But the trained worker detects the error, queries
|
| 111 |
+
get_schema, adapts, and continues serving customers."
|
| 112 |
+
|
| 113 |
+
[35-50s: SCREEN -- Training reward curve]
|
| 114 |
+
[Show Colab training curves]
|
| 115 |
+
"We train with GRPO using Unsloth and TRL. The reward signal
|
| 116 |
+
comes directly from the environment. Here you can see
|
| 117 |
+
improvement over training steps."
|
| 118 |
+
|
| 119 |
+
[50-60s: Partner tracks + close]
|
| 120 |
+
"Built for Fleet AI -- scalable oversight -- and Patronus AI --
|
| 121 |
+
schema drift. Try it on HuggingFace Spaces."
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
**EXTENDED Video Script (if time permits, 2-3 minutes):**
|
| 125 |
|
| 126 |
**Recording instructions:**
|
| 127 |
1. Open Gradio app in browser
|