Spaces:

garvitsachdeva
/

911

Sleeping

App Files Files Community

Abhinav31122006 commited on Apr 8

Commit

0b2675d

1 Parent(s): c8b8245

feat: exploit analysis, architecture docs, observation depth, citation

Browse files

Files changed (7) hide show

CITATION.md +27 -0
docs/architecture.md +66 -0
docs/exploit_analysis.md +46 -0
docs/reward_design.md +10 -6
docs/tasks.md +72 -0
src/models.py +4 -0
src/openenv_environment.py +18 -2

CITATION.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# Citation
+If you use 911 Dispatch Supervisor in your research, please cite:
+```bibtex
+@software{dispatch_supervisor_2025,
+  title  = {911 Dispatch Supervisor: An Open RL Benchmark for Emergency Dispatch Decision-Making},
+  year   = {2025},
+  url    = {https://huggingface.co/spaces/garvitsachdeva/911},
+  note   = {OpenEnv-compatible environment for training and evaluating LLM agents
+            on real-world emergency resource allocation under time pressure}
+}
+```
+## Research Applications
+This environment is designed to support research in:
+- LLM decision-making under constraint — fixed action budgets, time pressure
+- Multi-objective RL — non-sparse rewards with competing components
+- AI safety evaluation — hard constraints (Safety Gate) that cannot be gamed
+- Human-AI collaboration — dispatch copilot systems for public safety
+## Dataset & Reproducibility
+All episodes are fully deterministic under seed=42. The random agent baseline
+produces identical scores on every run, enabling valid cross-environment comparisons.

docs/architecture.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# Architecture — 911 Dispatch Supervisor
+## Layer Overview
+OpenEnvEnvironment          ← public API (reset/step/state/legal_actions)
+│
+DispatchStateMachine        ← simulation engine
+├── DispatchProtocolValidator   ← action legality (15+ rules)
+├── RewardCalculator            ← 5-component weighted reward
+└── DispatchScenarioFactory     ← deterministic task fixtures
+│
+Task-Specific Graders          ← episode-level scoring
+## Key Design Decisions
+### Why Manhattan Distance Physics
+Real city blocks use Manhattan (rectilinear) distance for navigation.
+Euclidean distance would underestimate travel time by ~27% on average,
+making ETAs unrealistically optimistic. Manhattan physics produce ETAs
+that match real CAD system calculations.
+### Why Legal Actions Are Pre-filtered
+Rather than letting agents propose arbitrary actions and penalizing illegal
+ones, the environment exposes only currently-valid actions via `legal_actions()`.
+This eliminates wasted LLM budget on invalid action generation and focuses
+evaluation on dispatch decision quality, not action syntax compliance.
+### Why the Safety Gate Uses 0.2 Not 0.0
+A hard zero for any P1 failure would make the reward surface completely flat
+for bad agents — no gradient to learn from. Capping at 0.2 preserves partial
+signal (coverage, response time on other incidents) while making P1 failure
+unambiguously catastrophic. Real dispatch accountability works the same way:
+an incident review happens, but other good work is still acknowledged.
+### Why Phraseology Is Scored
+Real dispatchers are evaluated on radio communication clarity. An agent that
+dispatches the right unit but says nothing (or says the wrong thing) is less
+useful as a CAD copilot than one that also generates the correct radio traffic.
+Phraseology scoring creates incentive for agents to learn domain language, not
+just resource allocation.
+### Why Waves Spawn at Fixed Steps Not Random Times
+Reproducibility is a first-class requirement. Fixed step offsets guarantee
+identical episode structure across all runs, making score comparisons valid.
+The challenge comes from the agent not knowing wave timing in advance — it
+must react, not plan.
+## State Machine Transitions
+Unit:     AVAILABLE → DISPATCHED → ON_SCENE → AVAILABLE
+↘️ STAGED     ↗️
+Incident: PENDING → RESPONDING → ON_SCENE → RESOLVED
+↘️ ESCALATED (survival clock expired)
+## File Map
+| File | Responsibility |
+|---|---|
+| `src/models.py` | All Pydantic data contracts |
+| `src/state_machine.py` | Core simulation engine |
+| `src/protocol.py` | Action legality validation |
+| `src/rewards.py` | Reward calculation |
+| `src/physics.py` | Manhattan distance, ETA, coverage |
+| `src/phraseology.py` | Radio language scoring |
+| `src/city_schema.py` | City topology loader |
+| `src/tasks/registry.py` | Task definitions and fixtures |
+| `src/openenv_environment.py` | OpenEnv API wrapper |
+| `server/app.py` | FastAPI server |

docs/exploit_analysis.md ADDED Viewed

	@@ -0,0 +1,46 @@

+# Exploit Analysis — 911 Dispatch Supervisor
+## Known Attack Vectors Considered & Closed
+### 1. Reward Farming via Repeated Dispatch-Cancel Cycles
+**Vector:** Agent dispatches a unit, immediately cancels, re-dispatches to collect
+partial response_time reward on each cycle without ever resolving incidents.
+**Mitigation:** Cancel actions return the unit to AVAILABLE but do NOT reset the
+incident survival clock. The incident continues counting down regardless of agent
+action, so farming cancel-dispatch cycles accelerates incident escalation and
+triggers the Safety Gate, collapsing the score to ≤0.2.
+### 2. Safety Gate Bypass via P2-Only Dispatching
+**Vector:** Agent ignores all P1 incidents and only resolves P2/P3 incidents to
+accumulate triage and response_time rewards without triggering the Safety Gate.
+**Mitigation:** The Safety Gate activates if ANY P1 incident existed during the
+episode and its survival score is 0.0. The agent cannot avoid P1 incidents
+existing — they are spawned deterministically by the scenario fixture.
+### 3. Coverage Score Farming via Staging
+**Vector:** Agent repeatedly stages all units in one district to maximize coverage
+score for that district while ignoring active incidents.
+**Mitigation:** Coverage score is computed across ALL districts simultaneously.
+Concentrating units in one district reduces coverage elsewhere, and staged units
+cannot respond to incidents without an explicit dispatch action, allowing incident
+survival clocks to expire and triggering escalation penalties.
+### 4. Phraseology Score Inflation via Notes Stuffing
+**Vector:** Agent fills the notes field with every possible dispatch phrase to
+maximize token overlap with canonical phrases.
+**Mitigation:** PhraseologyJudge uses token overlap normalized by notes length.
+Stuffing long notes with irrelevant text reduces precision, keeping the score low.
+Only notes that match the specific action type and incident type score highly.
+### 5. Determinism Exploitation
+**Vector:** Agent memorizes the exact incident sequence (seed=42) and hardcodes
+optimal actions rather than learning dispatch reasoning.
+**Mitigation:** This is intentional for reproducibility. However, the wave spawn
+system introduces timing-dependent incident locations with small perturbations,
+meaning hardcoded action sequences fail when unit positions vary. The environment
+is designed for evaluation, not training-time generalization.
+## Conclusion
+No reward farming exploit was found that allows an agent to score >0.6 without
+genuinely resolving Priority-1 incidents with correct unit types within survival
+clock windows.

docs/reward_design.md CHANGED Viewed

@@ -31,9 +31,13 @@ This means even a weak agent that dispatches randomly receives informative gradi
 ## Difficulty Gradient
-| Task | Random Score | Design Intent |
-|---|---|---|
-| single_incident | ~0.20 | Baseline — one decision, one unit, one incident |
-| multi_incident | ~0.31 | Triage required — competing P1 and P2 incidents |
-| mass_casualty | ~0.30 | Adaptability — surprise incident waves mid-episode |
-| shift_surge | ~0.32 | Resource scarcity — units going OOS mid-shift |

 ## Difficulty Gradient
+| Task | Random Score | LLM Expected | Design Intent |
+|---|---|---|---|
+| single_incident | 0.20 | 0.55–0.75 | One decision, one unit — tests basic triage |
+| multi_incident | 0.31 | 0.40–0.60 | Competing P1s — tests priority ordering |
+| mass_casualty | ~0.28 | 0.30–0.50 | Surprise waves — tests adaptability |
+| shift_surge | ~0.25 | 0.25–0.40 | Resource scarcity — tests planning under constraint |
+The gap between random and LLM scores is the signal this benchmark measures.
+A model that scores 0.70 on single_incident but 0.25 on shift_surge is demonstrating
+exactly the capability boundary the environment is designed to expose.

docs/tasks.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# Task Reference — 911 Dispatch Supervisor
+## Task 1: `single_incident` — Easy
+**Objective:** Dispatch the correct unit to a single cardiac arrest and resolve it
+before the survival clock expires.
+**Initial State:** 1 incident (cardiac arrest, P1), 3 units available (1 MEDIC, 1 ENGINE, 1 PATROL)
+**Max Steps:** 20 | **Survival Clock:** 600s
+**What a good agent does:** Immediately dispatches MEDIC to the cardiac arrest.
+Does not dispatch ENGINE or PATROL (triage mismatch penalty).
+**What a bad agent does:** Dispatches ENGINE (wrong unit type), wastes steps,
+patient survival clock expires → Safety Gate → score capped at 0.2.
+**Scoring:** 50% resolution + 30% correct unit type + 20% response speed
+---
+## Task 2: `multi_incident` — Medium
+**Objective:** Triage 3 simultaneous incidents with competing priorities.
+**Initial State:** 3 incidents (structure fire P2, cardiac arrest P1, shooting P1),
+6 units available
+**Max Steps:** 40
+**What a good agent does:** Immediately dispatches MEDIC to cardiac arrest and
+PATROL to shooting (both P1), then dispatches ENGINE to structure fire (P2).
+**What a bad agent does:** Dispatches to the fire first (visible/dramatic but P2),
+leaving P1 incidents unattended → Safety Gate.
+**Scoring:** 50% P1 resolution + 30% overall resolution − 20% escalation penalty
+---
+## Task 3: `mass_casualty` — Hard
+**Objective:** Manage a building collapse with surprise incident waves.
+**Initial State:** 1 incident (building collapse P1, survival 480s), 7 units
+**Max Steps:** 60
+**Wave spawns:** Step 5 → structure fire; Step 12 → 2× cardiac arrests
+**What a good agent does:** Responds to building collapse immediately, pre-stages
+units for anticipated waves, adapts when cardiac arrests spawn at step 12.
+**What a bad agent does:** Commits all units to building collapse, has no
+available units when cardiac arrests spawn → multiple P1 failures → Safety Gate.
+**Scoring:** 60% P1 survival + 30% mean step reward − failure penalty
+---
+## Task 4: `shift_surge` — Hard
+**Objective:** Maintain coverage as units go out of service mid-shift.
+**Initial State:** 5 units, 0 incidents (board starts empty)
+**Max Steps:** 60 | **Wave spawn:** Every 8 steps | **Survival clock:** 720s
+**OOS events:** 3 units go OUT_OF_SERVICE by step 5
+**What a good agent does:** Anticipates resource scarcity, requests mutual aid
+early, stages remaining units strategically, prioritizes P1 incidents as board fills.
+**What a bad agent does:** Dispatches all units freely in early steps, has no
+coverage when OOS events hit and new incidents spawn simultaneously.
+**Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog +
+10% step reward − 25% escalation penalty

src/models.py CHANGED Viewed

@@ -80,6 +80,10 @@ class Observation(BaseModel):
     issues: list[str] = Field(default_factory=list)
     reward_breakdown: dict[str, float] | None = None
     phraseology_score: float = 0.0
 class UnitState(BaseModel):

     issues: list[str] = Field(default_factory=list)
     reward_breakdown: dict[str, float] | None = None
     phraseology_score: float = 0.0
+    active_p1_count: int = 0
+    units_available: int = 0
+    step_count: int = 0
+    episode_done: bool = False
 class UnitState(BaseModel):

src/openenv_environment.py CHANGED Viewed

@@ -64,12 +64,28 @@ class OpenEnvEnvironment:
         self._state.metadata["episode_score"] = episode_score
         done = self._machine.is_terminal(state)
         phraseology = 0.0
         if obs.reward_breakdown:
             phraseology = obs.reward_breakdown.get("protocol", 0.0)
-        obs = obs.model_copy(update={"score": episode_score, "phraseology_score": phraseology})
         self._last_observation = obs
         return obs, step_reward, done

         self._state.metadata["episode_score"] = episode_score
         done = self._machine.is_terminal(state)
+        active_p1_count = sum(
+            1
+            for i in state.incidents.values()
+            if i.severity.value == "PRIORITY_1" and i.status.value not in {"RESOLVED", "ESCALATED"}
+        )
+        units_available = sum(1 for u in state.units.values() if u.status.value == "AVAILABLE")
         phraseology = 0.0
         if obs.reward_breakdown:
             phraseology = obs.reward_breakdown.get("protocol", 0.0)
+        obs = obs.model_copy(
+            update={
+                "score": episode_score,
+                "phraseology_score": phraseology,
+                "active_p1_count": active_p1_count,
+                "units_available": units_available,
+                "step_count": state.step_count,
+                "episode_done": done,
+            }
+        )
         self._last_observation = obs
         return obs, step_reward, done