Abhinav31122006 commited on
Commit
0b2675d
Β·
1 Parent(s): c8b8245

feat: exploit analysis, architecture docs, observation depth, citation

Browse files
CITATION.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Citation
2
+
3
+ If you use 911 Dispatch Supervisor in your research, please cite:
4
+
5
+ ```bibtex
6
+ @software{dispatch_supervisor_2025,
7
+ title = {911 Dispatch Supervisor: An Open RL Benchmark for Emergency Dispatch Decision-Making},
8
+ year = {2025},
9
+ url = {https://huggingface.co/spaces/garvitsachdeva/911},
10
+ note = {OpenEnv-compatible environment for training and evaluating LLM agents
11
+ on real-world emergency resource allocation under time pressure}
12
+ }
13
+ ```
14
+
15
+ ## Research Applications
16
+
17
+ This environment is designed to support research in:
18
+
19
+ - LLM decision-making under constraint β€” fixed action budgets, time pressure
20
+ - Multi-objective RL β€” non-sparse rewards with competing components
21
+ - AI safety evaluation β€” hard constraints (Safety Gate) that cannot be gamed
22
+ - Human-AI collaboration β€” dispatch copilot systems for public safety
23
+
24
+ ## Dataset & Reproducibility
25
+
26
+ All episodes are fully deterministic under seed=42. The random agent baseline
27
+ produces identical scores on every run, enabling valid cross-environment comparisons.
docs/architecture.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture β€” 911 Dispatch Supervisor
2
+
3
+ ## Layer Overview
4
+ OpenEnvEnvironment ← public API (reset/step/state/legal_actions)
5
+ β”‚
6
+ DispatchStateMachine ← simulation engine
7
+ β”œβ”€β”€ DispatchProtocolValidator ← action legality (15+ rules)
8
+ β”œβ”€β”€ RewardCalculator ← 5-component weighted reward
9
+ └── DispatchScenarioFactory ← deterministic task fixtures
10
+ β”‚
11
+ Task-Specific Graders ← episode-level scoring
12
+
13
+ ## Key Design Decisions
14
+
15
+ ### Why Manhattan Distance Physics
16
+ Real city blocks use Manhattan (rectilinear) distance for navigation.
17
+ Euclidean distance would underestimate travel time by ~27% on average,
18
+ making ETAs unrealistically optimistic. Manhattan physics produce ETAs
19
+ that match real CAD system calculations.
20
+
21
+ ### Why Legal Actions Are Pre-filtered
22
+ Rather than letting agents propose arbitrary actions and penalizing illegal
23
+ ones, the environment exposes only currently-valid actions via `legal_actions()`.
24
+ This eliminates wasted LLM budget on invalid action generation and focuses
25
+ evaluation on dispatch decision quality, not action syntax compliance.
26
+
27
+ ### Why the Safety Gate Uses 0.2 Not 0.0
28
+ A hard zero for any P1 failure would make the reward surface completely flat
29
+ for bad agents β€” no gradient to learn from. Capping at 0.2 preserves partial
30
+ signal (coverage, response time on other incidents) while making P1 failure
31
+ unambiguously catastrophic. Real dispatch accountability works the same way:
32
+ an incident review happens, but other good work is still acknowledged.
33
+
34
+ ### Why Phraseology Is Scored
35
+ Real dispatchers are evaluated on radio communication clarity. An agent that
36
+ dispatches the right unit but says nothing (or says the wrong thing) is less
37
+ useful as a CAD copilot than one that also generates the correct radio traffic.
38
+ Phraseology scoring creates incentive for agents to learn domain language, not
39
+ just resource allocation.
40
+
41
+ ### Why Waves Spawn at Fixed Steps Not Random Times
42
+ Reproducibility is a first-class requirement. Fixed step offsets guarantee
43
+ identical episode structure across all runs, making score comparisons valid.
44
+ The challenge comes from the agent not knowing wave timing in advance β€” it
45
+ must react, not plan.
46
+
47
+ ## State Machine Transitions
48
+ Unit: AVAILABLE β†’ DISPATCHED β†’ ON_SCENE β†’ AVAILABLE
49
+ β†˜οΈ STAGED ↗️
50
+ Incident: PENDING β†’ RESPONDING β†’ ON_SCENE β†’ RESOLVED
51
+ β†˜οΈ ESCALATED (survival clock expired)
52
+
53
+ ## File Map
54
+
55
+ | File | Responsibility |
56
+ |---|---|
57
+ | `src/models.py` | All Pydantic data contracts |
58
+ | `src/state_machine.py` | Core simulation engine |
59
+ | `src/protocol.py` | Action legality validation |
60
+ | `src/rewards.py` | Reward calculation |
61
+ | `src/physics.py` | Manhattan distance, ETA, coverage |
62
+ | `src/phraseology.py` | Radio language scoring |
63
+ | `src/city_schema.py` | City topology loader |
64
+ | `src/tasks/registry.py` | Task definitions and fixtures |
65
+ | `src/openenv_environment.py` | OpenEnv API wrapper |
66
+ | `server/app.py` | FastAPI server |
docs/exploit_analysis.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Exploit Analysis β€” 911 Dispatch Supervisor
2
+
3
+ ## Known Attack Vectors Considered & Closed
4
+
5
+ ### 1. Reward Farming via Repeated Dispatch-Cancel Cycles
6
+ **Vector:** Agent dispatches a unit, immediately cancels, re-dispatches to collect
7
+ partial response_time reward on each cycle without ever resolving incidents.
8
+ **Mitigation:** Cancel actions return the unit to AVAILABLE but do NOT reset the
9
+ incident survival clock. The incident continues counting down regardless of agent
10
+ action, so farming cancel-dispatch cycles accelerates incident escalation and
11
+ triggers the Safety Gate, collapsing the score to ≀0.2.
12
+
13
+ ### 2. Safety Gate Bypass via P2-Only Dispatching
14
+ **Vector:** Agent ignores all P1 incidents and only resolves P2/P3 incidents to
15
+ accumulate triage and response_time rewards without triggering the Safety Gate.
16
+ **Mitigation:** The Safety Gate activates if ANY P1 incident existed during the
17
+ episode and its survival score is 0.0. The agent cannot avoid P1 incidents
18
+ existing β€” they are spawned deterministically by the scenario fixture.
19
+
20
+ ### 3. Coverage Score Farming via Staging
21
+ **Vector:** Agent repeatedly stages all units in one district to maximize coverage
22
+ score for that district while ignoring active incidents.
23
+ **Mitigation:** Coverage score is computed across ALL districts simultaneously.
24
+ Concentrating units in one district reduces coverage elsewhere, and staged units
25
+ cannot respond to incidents without an explicit dispatch action, allowing incident
26
+ survival clocks to expire and triggering escalation penalties.
27
+
28
+ ### 4. Phraseology Score Inflation via Notes Stuffing
29
+ **Vector:** Agent fills the notes field with every possible dispatch phrase to
30
+ maximize token overlap with canonical phrases.
31
+ **Mitigation:** PhraseologyJudge uses token overlap normalized by notes length.
32
+ Stuffing long notes with irrelevant text reduces precision, keeping the score low.
33
+ Only notes that match the specific action type and incident type score highly.
34
+
35
+ ### 5. Determinism Exploitation
36
+ **Vector:** Agent memorizes the exact incident sequence (seed=42) and hardcodes
37
+ optimal actions rather than learning dispatch reasoning.
38
+ **Mitigation:** This is intentional for reproducibility. However, the wave spawn
39
+ system introduces timing-dependent incident locations with small perturbations,
40
+ meaning hardcoded action sequences fail when unit positions vary. The environment
41
+ is designed for evaluation, not training-time generalization.
42
+
43
+ ## Conclusion
44
+ No reward farming exploit was found that allows an agent to score >0.6 without
45
+ genuinely resolving Priority-1 incidents with correct unit types within survival
46
+ clock windows.
docs/reward_design.md CHANGED
@@ -31,9 +31,13 @@ This means even a weak agent that dispatches randomly receives informative gradi
31
 
32
  ## Difficulty Gradient
33
 
34
- | Task | Random Score | Design Intent |
35
- |---|---|---|
36
- | single_incident | ~0.20 | Baseline β€” one decision, one unit, one incident |
37
- | multi_incident | ~0.31 | Triage required β€” competing P1 and P2 incidents |
38
- | mass_casualty | ~0.30 | Adaptability β€” surprise incident waves mid-episode |
39
- | shift_surge | ~0.32 | Resource scarcity β€” units going OOS mid-shift |
 
 
 
 
 
31
 
32
  ## Difficulty Gradient
33
 
34
+ | Task | Random Score | LLM Expected | Design Intent |
35
+ |---|---|---|---|
36
+ | single_incident | 0.20 | 0.55–0.75 | One decision, one unit β€” tests basic triage |
37
+ | multi_incident | 0.31 | 0.40–0.60 | Competing P1s β€” tests priority ordering |
38
+ | mass_casualty | ~0.28 | 0.30–0.50 | Surprise waves β€” tests adaptability |
39
+ | shift_surge | ~0.25 | 0.25–0.40 | Resource scarcity β€” tests planning under constraint |
40
+
41
+ The gap between random and LLM scores is the signal this benchmark measures.
42
+ A model that scores 0.70 on single_incident but 0.25 on shift_surge is demonstrating
43
+ exactly the capability boundary the environment is designed to expose.
docs/tasks.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Task Reference β€” 911 Dispatch Supervisor
2
+
3
+ ## Task 1: `single_incident` β€” Easy
4
+
5
+ **Objective:** Dispatch the correct unit to a single cardiac arrest and resolve it
6
+ before the survival clock expires.
7
+
8
+ **Initial State:** 1 incident (cardiac arrest, P1), 3 units available (1 MEDIC, 1 ENGINE, 1 PATROL)
9
+ **Max Steps:** 20 | **Survival Clock:** 600s
10
+
11
+ **What a good agent does:** Immediately dispatches MEDIC to the cardiac arrest.
12
+ Does not dispatch ENGINE or PATROL (triage mismatch penalty).
13
+
14
+ **What a bad agent does:** Dispatches ENGINE (wrong unit type), wastes steps,
15
+ patient survival clock expires β†’ Safety Gate β†’ score capped at 0.2.
16
+
17
+ **Scoring:** 50% resolution + 30% correct unit type + 20% response speed
18
+
19
+ ---
20
+
21
+ ## Task 2: `multi_incident` β€” Medium
22
+
23
+ **Objective:** Triage 3 simultaneous incidents with competing priorities.
24
+
25
+ **Initial State:** 3 incidents (structure fire P2, cardiac arrest P1, shooting P1),
26
+ 6 units available
27
+ **Max Steps:** 40
28
+
29
+ **What a good agent does:** Immediately dispatches MEDIC to cardiac arrest and
30
+ PATROL to shooting (both P1), then dispatches ENGINE to structure fire (P2).
31
+
32
+ **What a bad agent does:** Dispatches to the fire first (visible/dramatic but P2),
33
+ leaving P1 incidents unattended β†’ Safety Gate.
34
+
35
+ **Scoring:** 50% P1 resolution + 30% overall resolution βˆ’ 20% escalation penalty
36
+
37
+ ---
38
+
39
+ ## Task 3: `mass_casualty` β€” Hard
40
+
41
+ **Objective:** Manage a building collapse with surprise incident waves.
42
+
43
+ **Initial State:** 1 incident (building collapse P1, survival 480s), 7 units
44
+ **Max Steps:** 60
45
+ **Wave spawns:** Step 5 β†’ structure fire; Step 12 β†’ 2Γ— cardiac arrests
46
+
47
+ **What a good agent does:** Responds to building collapse immediately, pre-stages
48
+ units for anticipated waves, adapts when cardiac arrests spawn at step 12.
49
+
50
+ **What a bad agent does:** Commits all units to building collapse, has no
51
+ available units when cardiac arrests spawn β†’ multiple P1 failures β†’ Safety Gate.
52
+
53
+ **Scoring:** 60% P1 survival + 30% mean step reward βˆ’ failure penalty
54
+
55
+ ---
56
+
57
+ ## Task 4: `shift_surge` β€” Hard
58
+
59
+ **Objective:** Maintain coverage as units go out of service mid-shift.
60
+
61
+ **Initial State:** 5 units, 0 incidents (board starts empty)
62
+ **Max Steps:** 60 | **Wave spawn:** Every 8 steps | **Survival clock:** 720s
63
+ **OOS events:** 3 units go OUT_OF_SERVICE by step 5
64
+
65
+ **What a good agent does:** Anticipates resource scarcity, requests mutual aid
66
+ early, stages remaining units strategically, prioritizes P1 incidents as board fills.
67
+
68
+ **What a bad agent does:** Dispatches all units freely in early steps, has no
69
+ coverage when OOS events hit and new incidents spawn simultaneously.
70
+
71
+ **Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog +
72
+ 10% step reward βˆ’ 25% escalation penalty
src/models.py CHANGED
@@ -80,6 +80,10 @@ class Observation(BaseModel):
80
  issues: list[str] = Field(default_factory=list)
81
  reward_breakdown: dict[str, float] | None = None
82
  phraseology_score: float = 0.0
 
 
 
 
83
 
84
 
85
  class UnitState(BaseModel):
 
80
  issues: list[str] = Field(default_factory=list)
81
  reward_breakdown: dict[str, float] | None = None
82
  phraseology_score: float = 0.0
83
+ active_p1_count: int = 0
84
+ units_available: int = 0
85
+ step_count: int = 0
86
+ episode_done: bool = False
87
 
88
 
89
  class UnitState(BaseModel):
src/openenv_environment.py CHANGED
@@ -64,12 +64,28 @@ class OpenEnvEnvironment:
64
  self._state.metadata["episode_score"] = episode_score
65
 
66
  done = self._machine.is_terminal(state)
 
 
 
 
 
 
 
67
 
68
  phraseology = 0.0
69
  if obs.reward_breakdown:
70
  phraseology = obs.reward_breakdown.get("protocol", 0.0)
71
-
72
- obs = obs.model_copy(update={"score": episode_score, "phraseology_score": phraseology})
 
 
 
 
 
 
 
 
 
73
  self._last_observation = obs
74
  return obs, step_reward, done
75
 
 
64
  self._state.metadata["episode_score"] = episode_score
65
 
66
  done = self._machine.is_terminal(state)
67
+
68
+ active_p1_count = sum(
69
+ 1
70
+ for i in state.incidents.values()
71
+ if i.severity.value == "PRIORITY_1" and i.status.value not in {"RESOLVED", "ESCALATED"}
72
+ )
73
+ units_available = sum(1 for u in state.units.values() if u.status.value == "AVAILABLE")
74
 
75
  phraseology = 0.0
76
  if obs.reward_breakdown:
77
  phraseology = obs.reward_breakdown.get("protocol", 0.0)
78
+
79
+ obs = obs.model_copy(
80
+ update={
81
+ "score": episode_score,
82
+ "phraseology_score": phraseology,
83
+ "active_p1_count": active_p1_count,
84
+ "units_available": units_available,
85
+ "step_count": state.step_count,
86
+ "episode_done": done,
87
+ }
88
+ )
89
  self._last_observation = obs
90
  return obs, step_reward, done
91