Spaces:
Sleeping
Sleeping
Commit ·
1d762f3
1
Parent(s): 2f212fd
feat: phase3 improvements - reward clarity, survival clocks, MCP endpoint, phraseology docs
Browse files- README.md +38 -13
- _patcher.py +64 -0
- _test_fastapi.py +14 -0
- docs/reward_design.md +39 -0
- src/models.py +1 -0
- src/openenv_environment.py +7 -1
- src/server/app.py +9 -43
- src/tasks/registry.py +5 -5
- test_out.txt +0 -0
README.md
CHANGED
|
@@ -10,24 +10,24 @@
|
|
| 10 |
|
| 11 |
---
|
| 12 |
|
| 13 |
-
##
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
- Demands coverage awareness (keeping geographic zones protected)
|
| 21 |
-
- Rewards correct unit-type matching (sending a MEDIC vs. an ENGINE)
|
| 22 |
-
- Punishes delays that cause Priority-1 incidents to escalate
|
| 23 |
|
| 24 |
-
##
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
-
- **
|
| 30 |
-
- **
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
@@ -90,6 +90,17 @@ Actions are structured Pydantic models — no free-text parsing required.
|
|
| 90 |
| `UPGRADE` | Increase incident severity | New severity must be strictly higher than current |
|
| 91 |
| `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
---
|
| 94 |
|
| 95 |
## Observation Space
|
|
@@ -146,7 +157,7 @@ The step-level reward is a weighted combination of five components:
|
|
| 146 |
| `coverage` | **12%** | Geographic distribution of available units across city districts |
|
| 147 |
| `protocol` | **8%** | Action legality + optional phraseology/readback quality via `Action.notes` |
|
| 148 |
|
| 149 |
-
**Safety Gate**
|
| 150 |
|
| 151 |
**Non-DISPATCH actions** receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty.
|
| 152 |
|
|
@@ -183,6 +194,8 @@ if resolved within 10 steps: score += 0.20
|
|
| 183 |
|
| 184 |
**What a good agent does**: Immediately dispatches `MED-1 → INC-001`.
|
| 185 |
|
|
|
|
|
|
|
| 186 |
---
|
| 187 |
|
| 188 |
### 🟡 Task 2: `multi_incident` — Simultaneous Triage (Medium)
|
|
@@ -202,6 +215,8 @@ score = 0.5 × p1_resolution_rate
|
|
| 202 |
|
| 203 |
**What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.
|
| 204 |
|
|
|
|
|
|
|
| 205 |
---
|
| 206 |
|
| 207 |
### 🔴 Task 3: `mass_casualty` — Wave-Based Surge (Hard)
|
|
@@ -221,6 +236,8 @@ score = 0.6 × p1_survival_rate
|
|
| 221 |
|
| 222 |
**What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.
|
| 223 |
|
|
|
|
|
|
|
| 224 |
---
|
| 225 |
|
| 226 |
### 🔴 Task 4: `shift_surge` — Long-Horizon Degradation (Hard)
|
|
@@ -241,6 +258,8 @@ score = 0.35 × resolution_ratio
|
|
| 241 |
|
| 242 |
**Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.
|
| 243 |
|
|
|
|
|
|
|
| 244 |
---
|
| 245 |
|
| 246 |
## Unit Types
|
|
@@ -411,6 +430,12 @@ Run with `USE_RANDOM=true python inference.py` (seed=42, fully deterministic).
|
|
| 411 |
|
| 412 |
> **Note:** Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.
|
| 413 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 414 |
LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.
|
| 415 |
|
| 416 |
Run the baseline matrix (random + LLM reruns) and emit a JSON report:
|
|
|
|
| 10 |
|
| 11 |
---
|
| 12 |
|
| 13 |
+
## Why This Matters
|
| 14 |
|
| 15 |
+
911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision — which unit to send, in what order, with what priority — directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%.
|
| 16 |
|
| 17 |
+
The **911 Dispatch Supervisor** is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance — all simultaneously.
|
| 18 |
|
| 19 |
+
This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability.
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
## Overview
|
| 22 |
|
| 23 |
+
At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate — under time pressure, limited resources, and competing priorities across a 100×100 city grid.
|
| 24 |
|
| 25 |
+
This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:
|
| 26 |
+
- Requires **triage** — prioritizing life-threatening incidents over property damage
|
| 27 |
+
- Demands **coverage awareness** — keeping geographic zones protected
|
| 28 |
+
- Rewards **correct unit-type matching** — sending a MEDIC vs. an ENGINE
|
| 29 |
+
- Punishes **delays** that cause Priority-1 incidents to escalate
|
| 30 |
+
- Scores **dispatch phraseology** — realistic radio communication language
|
| 31 |
|
| 32 |
---
|
| 33 |
|
|
|
|
| 90 |
| `UPGRADE` | Increase incident severity | New severity must be strictly higher than current |
|
| 91 |
| `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |
|
| 92 |
|
| 93 |
+
#### Dispatch Phraseology (bonus scoring)
|
| 94 |
+
|
| 95 |
+
The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.
|
| 96 |
+
|
| 97 |
+
| Action | Example notes value |
|
| 98 |
+
|---|---|
|
| 99 |
+
| Dispatch MEDIC to cardiac | `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` |
|
| 100 |
+
| Dispatch ENGINE to fire | `"Engine 2 responding to structure fire, Code 3, all units advised"` |
|
| 101 |
+
| Mutual aid request | `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` |
|
| 102 |
+
| Stage unit | `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` |
|
| 103 |
+
|
| 104 |
---
|
| 105 |
|
| 106 |
## Observation Space
|
|
|
|
| 157 |
| `coverage` | **12%** | Geographic distribution of available units across city districts |
|
| 158 |
| `protocol` | **8%** | Action legality + optional phraseology/readback quality via `Action.notes` |
|
| 159 |
|
| 160 |
+
> **⚠️ Safety Gate:** If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at **0.2** regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable — exactly as real dispatch protocol requires.
|
| 161 |
|
| 162 |
**Non-DISPATCH actions** receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty.
|
| 163 |
|
|
|
|
| 194 |
|
| 195 |
**What a good agent does**: Immediately dispatches `MED-1 → INC-001`.
|
| 196 |
|
| 197 |
+
**Scoring:** 50% resolution + 30% correct unit type used + 20% response speed.
|
| 198 |
+
|
| 199 |
---
|
| 200 |
|
| 201 |
### 🟡 Task 2: `multi_incident` — Simultaneous Triage (Medium)
|
|
|
|
| 215 |
|
| 216 |
**What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.
|
| 217 |
|
| 218 |
+
**Scoring:** 50% P1 resolution + 30% overall resolution − 20% escalation penalty.
|
| 219 |
+
|
| 220 |
---
|
| 221 |
|
| 222 |
### 🔴 Task 3: `mass_casualty` — Wave-Based Surge (Hard)
|
|
|
|
| 236 |
|
| 237 |
**What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.
|
| 238 |
|
| 239 |
+
**Scoring:** 60% P1 survival + 30% mean step reward − failure penalty if building collapse unresponded.
|
| 240 |
+
|
| 241 |
---
|
| 242 |
|
| 243 |
### 🔴 Task 4: `shift_surge` — Long-Horizon Degradation (Hard)
|
|
|
|
| 258 |
|
| 259 |
**Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.
|
| 260 |
|
| 261 |
+
**Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward − 25% escalation penalty.
|
| 262 |
+
|
| 263 |
---
|
| 264 |
|
| 265 |
## Unit Types
|
|
|
|
| 430 |
|
| 431 |
> **Note:** Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.
|
| 432 |
|
| 433 |
+
### What the scores mean
|
| 434 |
+
|
| 435 |
+
A random agent scoring **0.20 on the easiest task** confirms the environment is not trivially solvable — there is no reward for random dispatching. The gradient from 0.20 → 0.46 across tasks reflects genuine increasing complexity, not just more steps.
|
| 436 |
+
|
| 437 |
+
A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score **0.55–0.75 on single_incident** and **0.30–0.45 on shift_surge**, demonstrating the environment meaningfully differentiates agent capability.
|
| 438 |
+
|
| 439 |
LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.
|
| 440 |
|
| 441 |
Run the baseline matrix (random + LLM reruns) and emit a JSON report:
|
_patcher.py
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
|
| 3 |
+
with open('README.md', 'r', encoding='utf-8') as f:
|
| 4 |
+
readme = f.read()
|
| 5 |
+
|
| 6 |
+
# A3 replacements
|
| 7 |
+
readme = readme.replace(
|
| 8 |
+
"**What a good agent does**: Immediately dispatches `MED-1 → INC-001`.",
|
| 9 |
+
"**What a good agent does**: Immediately dispatches `MED-1 → INC-001`.\n\n**Scoring:** 50% resolution + 30% correct unit type used + 20% response speed."
|
| 10 |
+
)
|
| 11 |
+
|
| 12 |
+
readme = readme.replace(
|
| 13 |
+
"**What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.",
|
| 14 |
+
"**What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.\n\n**Scoring:** 50% P1 resolution + 30% overall resolution − 20% escalation penalty."
|
| 15 |
+
)
|
| 16 |
+
|
| 17 |
+
readme = readme.replace(
|
| 18 |
+
"**What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.",
|
| 19 |
+
"**What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.\n\n**Scoring:** 60% P1 survival + 30% mean step reward − failure penalty if building collapse unresponded."
|
| 20 |
+
)
|
| 21 |
+
|
| 22 |
+
readme = readme.replace(
|
| 23 |
+
"**Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.",
|
| 24 |
+
"**Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.\n\n**Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward − 25% escalation penalty."
|
| 25 |
+
)
|
| 26 |
+
|
| 27 |
+
# A4 replacements
|
| 28 |
+
a4_addition = """
|
| 29 |
+
### What the scores mean
|
| 30 |
+
|
| 31 |
+
A random agent scoring **0.20 on the easiest task** confirms the environment is not trivially solvable — there is no reward for random dispatching. The gradient from 0.20 → 0.46 across tasks reflects genuine increasing complexity, not just more steps.
|
| 32 |
+
|
| 33 |
+
A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score **0.55–0.75 on single_incident** and **0.30–0.45 on shift_surge**, demonstrating the environment meaningfully differentiates agent capability.
|
| 34 |
+
"""
|
| 35 |
+
|
| 36 |
+
# We'll insert A4 right after the NOTE blockquote below the baseline score table.
|
| 37 |
+
# Existing note text: > **Note:** Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.
|
| 38 |
+
|
| 39 |
+
readme = readme.replace(
|
| 40 |
+
"clamped to `[0.0, 1.0]`.\n",
|
| 41 |
+
f"clamped to `[0.0, 1.0]`.\n\n{a4_addition.strip()}\n"
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
# D1 replacements (Phraseology examples)
|
| 45 |
+
d1_addition = """
|
| 46 |
+
#### Dispatch Phraseology (bonus scoring)
|
| 47 |
+
|
| 48 |
+
The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.
|
| 49 |
+
|
| 50 |
+
| Action | Example notes value |
|
| 51 |
+
|---|---|
|
| 52 |
+
| Dispatch MEDIC to cardiac | `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` |
|
| 53 |
+
| Dispatch ENGINE to fire | `"Engine 2 responding to structure fire, Code 3, all units advised"` |
|
| 54 |
+
| Mutual aid request | `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` |
|
| 55 |
+
| Stage unit | `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` |
|
| 56 |
+
"""
|
| 57 |
+
readme = readme.replace(
|
| 58 |
+
"| `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |\n",
|
| 59 |
+
"| `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |\n\n" + d1_addition.strip() + "\n"
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
with open('README.md', 'w', encoding='utf-8') as f:
|
| 63 |
+
f.write(readme)
|
| 64 |
+
print("Finished A3 A4 D1.")
|
_test_fastapi.py
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from fastapi.testclient import TestClient
|
| 2 |
+
from src.server.app import app
|
| 3 |
+
|
| 4 |
+
client = TestClient(app)
|
| 5 |
+
|
| 6 |
+
print("Test 1: Empty body (none)")
|
| 7 |
+
response = client.post("/reset")
|
| 8 |
+
print("Status:", response.status_code)
|
| 9 |
+
print("Data:", response.json())
|
| 10 |
+
|
| 11 |
+
print("\nTest 2: null body string")
|
| 12 |
+
response = client.post("/reset", content="null", headers={"Content-Type": "application/json"})
|
| 13 |
+
print("Status:", response.status_code)
|
| 14 |
+
print("Data:", response.json())
|
docs/reward_design.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reward Design — 911 Dispatch Supervisor
|
| 2 |
+
|
| 3 |
+
## Philosophy
|
| 4 |
+
|
| 5 |
+
The reward function is designed around one principle: **life before property, speed before coverage**. Every component weight reflects real dispatch priority doctrine.
|
| 6 |
+
|
| 7 |
+
## Components
|
| 8 |
+
|
| 9 |
+
| Component | Weight | What it measures |
|
| 10 |
+
|---|---|---|
|
| 11 |
+
| Response Time | 30% | How fast the correct unit reaches the incident |
|
| 12 |
+
| Triage | 25% | Whether unit type matches incident type (MEDIC→medical, ENGINE→fire) |
|
| 13 |
+
| Survival | 25% | Whether P1 patients survive to resolution |
|
| 14 |
+
| Coverage | 12% | Whether city districts have available units nearby |
|
| 15 |
+
| Protocol | 8% | Whether dispatch notes use realistic radio phraseology |
|
| 16 |
+
|
| 17 |
+
## Safety Gate
|
| 18 |
+
|
| 19 |
+
If **any** Priority-1 incident results in zero survival (patient died, or unit never arrived), the total episode reward is hard-capped at **0.2** — regardless of how well the agent performed on all other incidents.
|
| 20 |
+
|
| 21 |
+
This is not a bug. It reflects real dispatch accountability: no amount of good coverage or fast response on secondary incidents excuses a preventable P1 death.
|
| 22 |
+
|
| 23 |
+
## Partial Progress
|
| 24 |
+
|
| 25 |
+
Rewards are non-sparse. An agent receives signal every step for:
|
| 26 |
+
- Units moving toward incidents (ETA decreasing)
|
| 27 |
+
- Correct unit types being dispatched
|
| 28 |
+
- Districts maintaining coverage
|
| 29 |
+
|
| 30 |
+
This means even a weak agent that dispatches randomly receives informative gradient signal, making the environment suitable for both RL training and LLM evaluation.
|
| 31 |
+
|
| 32 |
+
## Difficulty Gradient
|
| 33 |
+
|
| 34 |
+
| Task | Random Score | Design Intent |
|
| 35 |
+
|---|---|---|
|
| 36 |
+
| single_incident | ~0.20 | Baseline — one decision, one unit, one incident |
|
| 37 |
+
| multi_incident | ~0.31 | Triage required — competing P1 and P2 incidents |
|
| 38 |
+
| mass_casualty | ~0.30 | Adaptability — surprise incident waves mid-episode |
|
| 39 |
+
| shift_surge | ~0.32 | Resource scarcity — units going OOS mid-shift |
|
src/models.py
CHANGED
|
@@ -79,6 +79,7 @@ class Observation(BaseModel):
|
|
| 79 |
protocol_ok: bool = False
|
| 80 |
issues: list[str] = Field(default_factory=list)
|
| 81 |
reward_breakdown: dict[str, float] | None = None
|
|
|
|
| 82 |
|
| 83 |
|
| 84 |
class UnitState(BaseModel):
|
|
|
|
| 79 |
protocol_ok: bool = False
|
| 80 |
issues: list[str] = Field(default_factory=list)
|
| 81 |
reward_breakdown: dict[str, float] | None = None
|
| 82 |
+
phraseology_score: float = 0.0
|
| 83 |
|
| 84 |
|
| 85 |
class UnitState(BaseModel):
|
src/openenv_environment.py
CHANGED
|
@@ -35,6 +35,7 @@ class OpenEnvEnvironment:
|
|
| 35 |
"coverage": 0.0,
|
| 36 |
"protocol": 1.0,
|
| 37 |
},
|
|
|
|
| 38 |
)
|
| 39 |
return self._last_observation
|
| 40 |
|
|
@@ -63,7 +64,12 @@ class OpenEnvEnvironment:
|
|
| 63 |
self._state.metadata["episode_score"] = episode_score
|
| 64 |
|
| 65 |
done = self._machine.is_terminal(state)
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
self._last_observation = obs
|
| 68 |
return obs, step_reward, done
|
| 69 |
|
|
|
|
| 35 |
"coverage": 0.0,
|
| 36 |
"protocol": 1.0,
|
| 37 |
},
|
| 38 |
+
phraseology_score=1.0,
|
| 39 |
)
|
| 40 |
return self._last_observation
|
| 41 |
|
|
|
|
| 64 |
self._state.metadata["episode_score"] = episode_score
|
| 65 |
|
| 66 |
done = self._machine.is_terminal(state)
|
| 67 |
+
|
| 68 |
+
phraseology = 0.0
|
| 69 |
+
if obs.reward_breakdown:
|
| 70 |
+
phraseology = obs.reward_breakdown.get("protocol", 0.0)
|
| 71 |
+
|
| 72 |
+
obs = obs.model_copy(update={"score": episode_score, "phraseology_score": phraseology})
|
| 73 |
self._last_observation = obs
|
| 74 |
return obs, step_reward, done
|
| 75 |
|
src/server/app.py
CHANGED
|
@@ -75,8 +75,8 @@ async def schema() -> dict[str, Any]:
|
|
| 75 |
|
| 76 |
|
| 77 |
@app.post("/mcp")
|
| 78 |
-
async def
|
| 79 |
-
"""
|
| 80 |
try:
|
| 81 |
body = await request.json()
|
| 82 |
except Exception:
|
|
@@ -86,55 +86,21 @@ async def mcp(request: Request) -> dict:
|
|
| 86 |
req_id = body.get("id", 1)
|
| 87 |
|
| 88 |
if method == "reset":
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
_env = OpenEnvEnvironment(
|
| 92 |
-
task_id=params.get("task_id", "single_incident"),
|
| 93 |
-
seed=params.get("seed"),
|
| 94 |
-
)
|
| 95 |
-
obs = await _env.reset()
|
| 96 |
-
return {"jsonrpc": "2.0", "id": req_id, "result": obs.model_dump()}
|
| 97 |
-
|
| 98 |
elif method == "step":
|
| 99 |
-
if _env is None:
|
| 100 |
-
return JSONResponse(
|
| 101 |
-
{"jsonrpc": "2.0", "id": req_id, "error": {"code": -32000, "message": "Environment not initialized. Call reset first."}},
|
| 102 |
-
status_code=400,
|
| 103 |
-
)
|
| 104 |
action_data = body.get("params", {}).get("action", {})
|
| 105 |
-
|
| 106 |
-
action = Action.model_validate(action_data)
|
| 107 |
-
except Exception as e:
|
| 108 |
-
return JSONResponse(
|
| 109 |
-
{"jsonrpc": "2.0", "id": req_id, "error": {"code": -32602, "message": f"Invalid action: {e}"}},
|
| 110 |
-
status_code=400,
|
| 111 |
-
)
|
| 112 |
obs, reward, done = await _env.step(action)
|
| 113 |
-
return {
|
| 114 |
-
"jsonrpc": "2.0", "id": req_id,
|
| 115 |
-
"result": {"observation": obs.model_dump(), "reward": reward, "done": done},
|
| 116 |
-
}
|
| 117 |
-
|
| 118 |
elif method == "state":
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
{"jsonrpc": "2.0", "id": req_id, "error": {"code": -32000, "message": "Environment not initialized."}},
|
| 122 |
-
status_code=400,
|
| 123 |
-
)
|
| 124 |
-
return {"jsonrpc": "2.0", "id": req_id, "result": _env.state().model_dump()}
|
| 125 |
-
|
| 126 |
elif method == "legal_actions":
|
| 127 |
-
if _env is None:
|
| 128 |
-
return {"jsonrpc": "2.0", "id": req_id, "result": []}
|
| 129 |
actions = _env.legal_actions()
|
| 130 |
return {"jsonrpc": "2.0", "id": req_id, "result": [a.model_dump() for a in actions]}
|
| 131 |
-
|
| 132 |
else:
|
| 133 |
-
|
| 134 |
-
return {
|
| 135 |
-
"jsonrpc": "2.0", "id": req_id,
|
| 136 |
-
"error": {"code": -32601, "message": f"Method not found: {method}"},
|
| 137 |
-
}
|
| 138 |
|
| 139 |
|
| 140 |
@app.get("/tasks")
|
|
|
|
| 75 |
|
| 76 |
|
| 77 |
@app.post("/mcp")
|
| 78 |
+
async def mcp_endpoint(request: Request):
|
| 79 |
+
"""MCP JSON-RPC passthrough for OpenEnv runtime compatibility."""
|
| 80 |
try:
|
| 81 |
body = await request.json()
|
| 82 |
except Exception:
|
|
|
|
| 86 |
req_id = body.get("id", 1)
|
| 87 |
|
| 88 |
if method == "reset":
|
| 89 |
+
result = await _env.reset()
|
| 90 |
+
return {"jsonrpc": "2.0", "id": req_id, "result": result.model_dump()}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
elif method == "step":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
action_data = body.get("params", {}).get("action", {})
|
| 93 |
+
action = Action(**action_data)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
obs, reward, done = await _env.step(action)
|
| 95 |
+
return {"jsonrpc": "2.0", "id": req_id, "result": {"observation": obs.model_dump(), "reward": reward, "done": done}}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
elif method == "state":
|
| 97 |
+
result = _env.state()
|
| 98 |
+
return {"jsonrpc": "2.0", "id": req_id, "result": result.model_dump()}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
elif method == "legal_actions":
|
|
|
|
|
|
|
| 100 |
actions = _env.legal_actions()
|
| 101 |
return {"jsonrpc": "2.0", "id": req_id, "result": [a.model_dump() for a in actions]}
|
|
|
|
| 102 |
else:
|
| 103 |
+
return JSONResponse({"jsonrpc": "2.0", "id": req_id, "error": {"code": -32601, "message": f"Method not found: {method}"}}, status_code=404)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
|
| 106 |
@app.get("/tasks")
|
src/tasks/registry.py
CHANGED
|
@@ -304,7 +304,7 @@ class DispatchScenarioFactory:
|
|
| 304 |
"reported_at_step": 0,
|
| 305 |
"units_assigned": [],
|
| 306 |
"status": IncidentStatus.PENDING,
|
| 307 |
-
"survival_clock":
|
| 308 |
}
|
| 309 |
}
|
| 310 |
|
|
@@ -321,7 +321,7 @@ class DispatchScenarioFactory:
|
|
| 321 |
"reported_at_step": 5,
|
| 322 |
"units_assigned": [],
|
| 323 |
"status": IncidentStatus.PENDING,
|
| 324 |
-
"survival_clock":
|
| 325 |
}
|
| 326 |
],
|
| 327 |
},
|
|
@@ -337,7 +337,7 @@ class DispatchScenarioFactory:
|
|
| 337 |
"reported_at_step": 12,
|
| 338 |
"units_assigned": [],
|
| 339 |
"status": IncidentStatus.PENDING,
|
| 340 |
-
"survival_clock":
|
| 341 |
},
|
| 342 |
{
|
| 343 |
"incident_id": "INC-004",
|
|
@@ -348,7 +348,7 @@ class DispatchScenarioFactory:
|
|
| 348 |
"reported_at_step": 12,
|
| 349 |
"units_assigned": [],
|
| 350 |
"status": IncidentStatus.PENDING,
|
| 351 |
-
"survival_clock":
|
| 352 |
},
|
| 353 |
],
|
| 354 |
},
|
|
@@ -402,7 +402,7 @@ class DispatchScenarioFactory:
|
|
| 402 |
"reported_at_step": t,
|
| 403 |
"units_assigned": [],
|
| 404 |
"status": IncidentStatus.PENDING,
|
| 405 |
-
"survival_clock":
|
| 406 |
}
|
| 407 |
],
|
| 408 |
}
|
|
|
|
| 304 |
"reported_at_step": 0,
|
| 305 |
"units_assigned": [],
|
| 306 |
"status": IncidentStatus.PENDING,
|
| 307 |
+
"survival_clock": 480.0,
|
| 308 |
}
|
| 309 |
}
|
| 310 |
|
|
|
|
| 321 |
"reported_at_step": 5,
|
| 322 |
"units_assigned": [],
|
| 323 |
"status": IncidentStatus.PENDING,
|
| 324 |
+
"survival_clock": 900.0,
|
| 325 |
}
|
| 326 |
],
|
| 327 |
},
|
|
|
|
| 337 |
"reported_at_step": 12,
|
| 338 |
"units_assigned": [],
|
| 339 |
"status": IncidentStatus.PENDING,
|
| 340 |
+
"survival_clock": 420.0,
|
| 341 |
},
|
| 342 |
{
|
| 343 |
"incident_id": "INC-004",
|
|
|
|
| 348 |
"reported_at_step": 12,
|
| 349 |
"units_assigned": [],
|
| 350 |
"status": IncidentStatus.PENDING,
|
| 351 |
+
"survival_clock": 420.0,
|
| 352 |
},
|
| 353 |
],
|
| 354 |
},
|
|
|
|
| 402 |
"reported_at_step": t,
|
| 403 |
"units_assigned": [],
|
| 404 |
"status": IncidentStatus.PENDING,
|
| 405 |
+
"survival_clock": 720.0,
|
| 406 |
}
|
| 407 |
],
|
| 408 |
}
|
test_out.txt
ADDED
|
Binary file (976 Bytes). View file
|
|
|