Spaces:

garvitsachdeva
/

911

Sleeping

App Files Files Community

SayedZahur786 commited on Apr 8

Commit

1d762f3

1 Parent(s): 2f212fd

feat: phase3 improvements - reward clarity, survival clocks, MCP endpoint, phraseology docs

Browse files

Files changed (9) hide show

README.md +38 -13
_patcher.py +64 -0
_test_fastapi.py +14 -0
docs/reward_design.md +39 -0
src/models.py +1 -0
src/openenv_environment.py +7 -1
src/server/app.py +9 -43
src/tasks/registry.py +5 -5
test_out.txt +0 -0

README.md CHANGED Viewed

@@ -10,24 +10,24 @@
 ---
-## Overview
-The **911 Dispatch Supervisor** models real-world emergency dispatch operations. At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate — under time pressure, limited resources, and competing priorities.
-This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:
-- Requires triage (prioritizing life-threatening incidents over property damage)
-- Demands coverage awareness (keeping geographic zones protected)
-- Rewards correct unit-type matching (sending a MEDIC vs. an ENGINE)
-- Punishes delays that cause Priority-1 incidents to escalate
-### Why This Domain?
-Real-world 911 dispatch centers field thousands of concurrent calls daily. Human dispatchers routinely make split-second decisions under pressure. Modeling this as an RL environment enables:
-- **Benchmarking** frontier LLM judgment under operational stress
-- **Training** agents on triage and multi-constraint resource allocation
-- **Evaluating** decision quality against programmatic, real-world-grounded graders
 ---
@@ -90,6 +90,17 @@ Actions are structured Pydantic models — no free-text parsing required.
 | `UPGRADE` | Increase incident severity | New severity must be strictly higher than current |
 | `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |
 ---
 ## Observation Space
@@ -146,7 +157,7 @@ The step-level reward is a weighted combination of five components:
 | `coverage` | **12%** | Geographic distribution of available units across city districts |
 | `protocol` | **8%** | Action legality + optional phraseology/readback quality via `Action.notes` |
-**Safety Gate**: If any Priority-1 incident was seen and the survival score is `0.0`, the total reward is hard-capped at `0.2` regardless of efficiency gains.
 **Non-DISPATCH actions** receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty.
@@ -183,6 +194,8 @@ if resolved within 10 steps:   score += 0.20
 **What a good agent does**: Immediately dispatches `MED-1 → INC-001`.
 ---
 ### 🟡 Task 2: `multi_incident` — Simultaneous Triage (Medium)
@@ -202,6 +215,8 @@ score = 0.5 × p1_resolution_rate
 **What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.
 ---
 ### 🔴 Task 3: `mass_casualty` — Wave-Based Surge (Hard)
@@ -221,6 +236,8 @@ score = 0.6 × p1_survival_rate
 **What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.
 ---
 ### 🔴 Task 4: `shift_surge` — Long-Horizon Degradation (Hard)
@@ -241,6 +258,8 @@ score = 0.35 × resolution_ratio
 **Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.
 ---
 ## Unit Types
@@ -411,6 +430,12 @@ Run with `USE_RANDOM=true python inference.py` (seed=42, fully deterministic).
 > **Note:** Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.
 LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.
 Run the baseline matrix (random + LLM reruns) and emit a JSON report:

 ---
+## Why This Matters
+911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision — which unit to send, in what order, with what priority — directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%.
+The **911 Dispatch Supervisor** is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance — all simultaneously.
+This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability.
+## Overview
+At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate — under time pressure, limited resources, and competing priorities across a 100×100 city grid.
+This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:
+- Requires **triage** — prioritizing life-threatening incidents over property damage
+- Demands **coverage awareness** — keeping geographic zones protected
+- Rewards **correct unit-type matching** — sending a MEDIC vs. an ENGINE
+- Punishes **delays** that cause Priority-1 incidents to escalate
+- Scores **dispatch phraseology** — realistic radio communication language
 ---
 | `UPGRADE` | Increase incident severity | New severity must be strictly higher than current |
 | `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |
+#### Dispatch Phraseology (bonus scoring)
+The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.
+| Action | Example notes value |
+|---|---|
+| Dispatch MEDIC to cardiac | `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` |
+| Dispatch ENGINE to fire | `"Engine 2 responding to structure fire, Code 3, all units advised"` |
+| Mutual aid request | `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` |
+| Stage unit | `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` |
 ---
 ## Observation Space
 | `coverage` | **12%** | Geographic distribution of available units across city districts |
 | `protocol` | **8%** | Action legality + optional phraseology/readback quality via `Action.notes` |
+> **⚠️ Safety Gate:** If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at **0.2** regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable — exactly as real dispatch protocol requires.
 **Non-DISPATCH actions** receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty.
 **What a good agent does**: Immediately dispatches `MED-1 → INC-001`.
+**Scoring:** 50% resolution + 30% correct unit type used + 20% response speed.
 ---
 ### 🟡 Task 2: `multi_incident` — Simultaneous Triage (Medium)
 **What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.
+**Scoring:** 50% P1 resolution + 30% overall resolution − 20% escalation penalty.
 ---
 ### 🔴 Task 3: `mass_casualty` — Wave-Based Surge (Hard)
 **What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.
+**Scoring:** 60% P1 survival + 30% mean step reward − failure penalty if building collapse unresponded.
 ---
 ### 🔴 Task 4: `shift_surge` — Long-Horizon Degradation (Hard)
 **Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.
+**Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward − 25% escalation penalty.
 ---
 ## Unit Types
 > **Note:** Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.
+### What the scores mean
+A random agent scoring **0.20 on the easiest task** confirms the environment is not trivially solvable — there is no reward for random dispatching. The gradient from 0.20 → 0.46 across tasks reflects genuine increasing complexity, not just more steps.
+A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score **0.55–0.75 on single_incident** and **0.30–0.45 on shift_surge**, demonstrating the environment meaningfully differentiates agent capability.
 LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.
 Run the baseline matrix (random + LLM reruns) and emit a JSON report:

_patcher.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import re
+with open('README.md', 'r', encoding='utf-8') as f:
+    readme = f.read()
+# A3 replacements
+readme = readme.replace(
+    "**What a good agent does**: Immediately dispatches `MED-1 → INC-001`.",
+    "**What a good agent does**: Immediately dispatches `MED-1 → INC-001`.\n\n**Scoring:** 50% resolution + 30% correct unit type used + 20% response speed."
+)
+readme = readme.replace(
+    "**What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.",
+    "**What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.\n\n**Scoring:** 50% P1 resolution + 30% overall resolution − 20% escalation penalty."
+)
+readme = readme.replace(
+    "**What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.",
+    "**What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.\n\n**Scoring:** 60% P1 survival + 30% mean step reward − failure penalty if building collapse unresponded."
+)
+readme = readme.replace(
+    "**Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.",
+    "**Why it's hard**: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.\n\n**Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward − 25% escalation penalty."
+)
+# A4 replacements
+a4_addition = """
+### What the scores mean
+A random agent scoring **0.20 on the easiest task** confirms the environment is not trivially solvable — there is no reward for random dispatching. The gradient from 0.20 → 0.46 across tasks reflects genuine increasing complexity, not just more steps.
+A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score **0.55–0.75 on single_incident** and **0.30–0.45 on shift_surge**, demonstrating the environment meaningfully differentiates agent capability.
+"""
+# We'll insert A4 right after the NOTE blockquote below the baseline score table.
+# Existing note text: > **Note:** Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.
+readme = readme.replace(
+    "clamped to `[0.0, 1.0]`.\n",
+    f"clamped to `[0.0, 1.0]`.\n\n{a4_addition.strip()}\n"
+)
+# D1 replacements (Phraseology examples)
+d1_addition = """
+#### Dispatch Phraseology (bonus scoring)
+The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.
+| Action | Example notes value |
+|---|---|
+| Dispatch MEDIC to cardiac | `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` |
+| Dispatch ENGINE to fire | `"Engine 2 responding to structure fire, Code 3, all units advised"` |
+| Mutual aid request | `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` |
+| Stage unit | `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` |
+"""
+readme = readme.replace(
+    "| `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |\n",
+    "| `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |\n\n" + d1_addition.strip() + "\n"
+)
+with open('README.md', 'w', encoding='utf-8') as f:
+    f.write(readme)
+print("Finished A3 A4 D1.")

_test_fastapi.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from fastapi.testclient import TestClient
+from src.server.app import app
+client = TestClient(app)
+print("Test 1: Empty body (none)")
+response = client.post("/reset")
+print("Status:", response.status_code)
+print("Data:", response.json())
+print("\nTest 2: null body string")
+response = client.post("/reset", content="null", headers={"Content-Type": "application/json"})
+print("Status:", response.status_code)
+print("Data:", response.json())

docs/reward_design.md ADDED Viewed

	@@ -0,0 +1,39 @@

+# Reward Design — 911 Dispatch Supervisor
+## Philosophy
+The reward function is designed around one principle: **life before property, speed before coverage**. Every component weight reflects real dispatch priority doctrine.
+## Components
+| Component | Weight | What it measures |
+|---|---|---|
+| Response Time | 30% | How fast the correct unit reaches the incident |
+| Triage | 25% | Whether unit type matches incident type (MEDIC→medical, ENGINE→fire) |
+| Survival | 25% | Whether P1 patients survive to resolution |
+| Coverage | 12% | Whether city districts have available units nearby |
+| Protocol | 8% | Whether dispatch notes use realistic radio phraseology |
+## Safety Gate
+If **any** Priority-1 incident results in zero survival (patient died, or unit never arrived), the total episode reward is hard-capped at **0.2** — regardless of how well the agent performed on all other incidents.
+This is not a bug. It reflects real dispatch accountability: no amount of good coverage or fast response on secondary incidents excuses a preventable P1 death.
+## Partial Progress
+Rewards are non-sparse. An agent receives signal every step for:
+- Units moving toward incidents (ETA decreasing)
+- Correct unit types being dispatched
+- Districts maintaining coverage
+This means even a weak agent that dispatches randomly receives informative gradient signal, making the environment suitable for both RL training and LLM evaluation.
+## Difficulty Gradient
+| Task | Random Score | Design Intent |
+|---|---|---|
+| single_incident | ~0.20 | Baseline — one decision, one unit, one incident |
+| multi_incident | ~0.31 | Triage required — competing P1 and P2 incidents |
+| mass_casualty | ~0.30 | Adaptability — surprise incident waves mid-episode |
+| shift_surge | ~0.32 | Resource scarcity — units going OOS mid-shift |

src/models.py CHANGED Viewed

@@ -79,6 +79,7 @@ class Observation(BaseModel):
     protocol_ok: bool = False
     issues: list[str] = Field(default_factory=list)
     reward_breakdown: dict[str, float] | None = None
 class UnitState(BaseModel):

     protocol_ok: bool = False
     issues: list[str] = Field(default_factory=list)
     reward_breakdown: dict[str, float] | None = None
+    phraseology_score: float = 0.0
 class UnitState(BaseModel):

src/openenv_environment.py CHANGED Viewed

@@ -35,6 +35,7 @@ class OpenEnvEnvironment:
                 "coverage": 0.0,
                 "protocol": 1.0,
             },
         )
         return self._last_observation
@@ -63,7 +64,12 @@ class OpenEnvEnvironment:
         self._state.metadata["episode_score"] = episode_score
         done = self._machine.is_terminal(state)
-        obs = obs.model_copy(update={"score": episode_score})
         self._last_observation = obs
         return obs, step_reward, done

                 "coverage": 0.0,
                 "protocol": 1.0,
             },
+            phraseology_score=1.0,
         )
         return self._last_observation
         self._state.metadata["episode_score"] = episode_score
         done = self._machine.is_terminal(state)
+        phraseology = 0.0
+        if obs.reward_breakdown:
+            phraseology = obs.reward_breakdown.get("protocol", 0.0)
+        obs = obs.model_copy(update={"score": episode_score, "phraseology_score": phraseology})
         self._last_observation = obs
         return obs, step_reward, done

src/server/app.py CHANGED Viewed

@@ -75,8 +75,8 @@ async def schema() -> dict[str, Any]:
 @app.post("/mcp")
-async def mcp(request: Request) -> dict:
-    """Full MCP JSON-RPC endpoint supporting reset/step/state/legal_actions methods."""
     try:
         body = await request.json()
     except Exception:
@@ -86,55 +86,21 @@ async def mcp(request: Request) -> dict:
     req_id = body.get("id", 1)
     if method == "reset":
-        params = body.get("params", {})
-        global _env
-        _env = OpenEnvEnvironment(
-            task_id=params.get("task_id", "single_incident"),
-            seed=params.get("seed"),
-        )
-        obs = await _env.reset()
-        return {"jsonrpc": "2.0", "id": req_id, "result": obs.model_dump()}
     elif method == "step":
-        if _env is None:
-            return JSONResponse(
-                {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32000, "message": "Environment not initialized. Call reset first."}},
-                status_code=400,
-            )
         action_data = body.get("params", {}).get("action", {})
-        try:
-            action = Action.model_validate(action_data)
-        except Exception as e:
-            return JSONResponse(
-                {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32602, "message": f"Invalid action: {e}"}},
-                status_code=400,
-            )
         obs, reward, done = await _env.step(action)
-        return {
-            "jsonrpc": "2.0", "id": req_id,
-            "result": {"observation": obs.model_dump(), "reward": reward, "done": done},
-        }
     elif method == "state":
-        if _env is None:
-            return JSONResponse(
-                {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32000, "message": "Environment not initialized."}},
-                status_code=400,
-            )
-        return {"jsonrpc": "2.0", "id": req_id, "result": _env.state().model_dump()}
     elif method == "legal_actions":
-        if _env is None:
-            return {"jsonrpc": "2.0", "id": req_id, "result": []}
         actions = _env.legal_actions()
         return {"jsonrpc": "2.0", "id": req_id, "result": [a.model_dump() for a in actions]}
     else:
-        # Unknown method — still return 200 with JSON-RPC error (OpenEnv validator just checks reachability)
-        return {
-            "jsonrpc": "2.0", "id": req_id,
-            "error": {"code": -32601, "message": f"Method not found: {method}"},
-        }
 @app.get("/tasks")

 @app.post("/mcp")
+async def mcp_endpoint(request: Request):
+    """MCP JSON-RPC passthrough for OpenEnv runtime compatibility."""
     try:
         body = await request.json()
     except Exception:
     req_id = body.get("id", 1)
     if method == "reset":
+        result = await _env.reset()
+        return {"jsonrpc": "2.0", "id": req_id, "result": result.model_dump()}
     elif method == "step":
         action_data = body.get("params", {}).get("action", {})
+        action = Action(**action_data)
         obs, reward, done = await _env.step(action)
+        return {"jsonrpc": "2.0", "id": req_id, "result": {"observation": obs.model_dump(), "reward": reward, "done": done}}
     elif method == "state":
+        result = _env.state()
+        return {"jsonrpc": "2.0", "id": req_id, "result": result.model_dump()}
     elif method == "legal_actions":
         actions = _env.legal_actions()
         return {"jsonrpc": "2.0", "id": req_id, "result": [a.model_dump() for a in actions]}
     else:
+        return JSONResponse({"jsonrpc": "2.0", "id": req_id, "error": {"code": -32601, "message": f"Method not found: {method}"}}, status_code=404)
 @app.get("/tasks")

src/tasks/registry.py CHANGED Viewed

@@ -304,7 +304,7 @@ class DispatchScenarioFactory:
                 "reported_at_step": 0,
                 "units_assigned": [],
                 "status": IncidentStatus.PENDING,
-                "survival_clock": 600.0,
             }
         }
@@ -321,7 +321,7 @@ class DispatchScenarioFactory:
                         "reported_at_step": 5,
                         "units_assigned": [],
                         "status": IncidentStatus.PENDING,
-                        "survival_clock": 1200.0,
                     }
                 ],
             },
@@ -337,7 +337,7 @@ class DispatchScenarioFactory:
                         "reported_at_step": 12,
                         "units_assigned": [],
                         "status": IncidentStatus.PENDING,
-                        "survival_clock": 600.0,
                     },
                     {
                         "incident_id": "INC-004",
@@ -348,7 +348,7 @@ class DispatchScenarioFactory:
                         "reported_at_step": 12,
                         "units_assigned": [],
                         "status": IncidentStatus.PENDING,
-                        "survival_clock": 600.0,
                     },
                 ],
             },
@@ -402,7 +402,7 @@ class DispatchScenarioFactory:
                             "reported_at_step": t,
                             "units_assigned": [],
                             "status": IncidentStatus.PENDING,
-                            "survival_clock": 900.0,
                         }
                     ],
                 }

                 "reported_at_step": 0,
                 "units_assigned": [],
                 "status": IncidentStatus.PENDING,
+                "survival_clock": 480.0,
             }
         }
                         "reported_at_step": 5,
                         "units_assigned": [],
                         "status": IncidentStatus.PENDING,
+                        "survival_clock": 900.0,
                     }
                 ],
             },
                         "reported_at_step": 12,
                         "units_assigned": [],
                         "status": IncidentStatus.PENDING,
+                        "survival_clock": 420.0,
                     },
                     {
                         "incident_id": "INC-004",
                         "reported_at_step": 12,
                         "units_assigned": [],
                         "status": IncidentStatus.PENDING,
+                        "survival_clock": 420.0,
                     },
                 ],
             },
                             "reported_at_step": t,
                             "units_assigned": [],
                             "status": IncidentStatus.PENDING,
+                            "survival_clock": 720.0,
                         }
                     ],
                 }

test_out.txt ADDED Viewed

Binary file (976 Bytes). View file