Spaces:
Sleeping
Sleeping
Abhinav31122006 commited on
Commit Β·
0b2675d
1
Parent(s): c8b8245
feat: exploit analysis, architecture docs, observation depth, citation
Browse files- CITATION.md +27 -0
- docs/architecture.md +66 -0
- docs/exploit_analysis.md +46 -0
- docs/reward_design.md +10 -6
- docs/tasks.md +72 -0
- src/models.py +4 -0
- src/openenv_environment.py +18 -2
CITATION.md
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Citation
|
| 2 |
+
|
| 3 |
+
If you use 911 Dispatch Supervisor in your research, please cite:
|
| 4 |
+
|
| 5 |
+
```bibtex
|
| 6 |
+
@software{dispatch_supervisor_2025,
|
| 7 |
+
title = {911 Dispatch Supervisor: An Open RL Benchmark for Emergency Dispatch Decision-Making},
|
| 8 |
+
year = {2025},
|
| 9 |
+
url = {https://huggingface.co/spaces/garvitsachdeva/911},
|
| 10 |
+
note = {OpenEnv-compatible environment for training and evaluating LLM agents
|
| 11 |
+
on real-world emergency resource allocation under time pressure}
|
| 12 |
+
}
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
## Research Applications
|
| 16 |
+
|
| 17 |
+
This environment is designed to support research in:
|
| 18 |
+
|
| 19 |
+
- LLM decision-making under constraint β fixed action budgets, time pressure
|
| 20 |
+
- Multi-objective RL β non-sparse rewards with competing components
|
| 21 |
+
- AI safety evaluation β hard constraints (Safety Gate) that cannot be gamed
|
| 22 |
+
- Human-AI collaboration β dispatch copilot systems for public safety
|
| 23 |
+
|
| 24 |
+
## Dataset & Reproducibility
|
| 25 |
+
|
| 26 |
+
All episodes are fully deterministic under seed=42. The random agent baseline
|
| 27 |
+
produces identical scores on every run, enabling valid cross-environment comparisons.
|
docs/architecture.md
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Architecture β 911 Dispatch Supervisor
|
| 2 |
+
|
| 3 |
+
## Layer Overview
|
| 4 |
+
OpenEnvEnvironment β public API (reset/step/state/legal_actions)
|
| 5 |
+
β
|
| 6 |
+
DispatchStateMachine β simulation engine
|
| 7 |
+
βββ DispatchProtocolValidator β action legality (15+ rules)
|
| 8 |
+
βββ RewardCalculator β 5-component weighted reward
|
| 9 |
+
βββ DispatchScenarioFactory β deterministic task fixtures
|
| 10 |
+
β
|
| 11 |
+
Task-Specific Graders β episode-level scoring
|
| 12 |
+
|
| 13 |
+
## Key Design Decisions
|
| 14 |
+
|
| 15 |
+
### Why Manhattan Distance Physics
|
| 16 |
+
Real city blocks use Manhattan (rectilinear) distance for navigation.
|
| 17 |
+
Euclidean distance would underestimate travel time by ~27% on average,
|
| 18 |
+
making ETAs unrealistically optimistic. Manhattan physics produce ETAs
|
| 19 |
+
that match real CAD system calculations.
|
| 20 |
+
|
| 21 |
+
### Why Legal Actions Are Pre-filtered
|
| 22 |
+
Rather than letting agents propose arbitrary actions and penalizing illegal
|
| 23 |
+
ones, the environment exposes only currently-valid actions via `legal_actions()`.
|
| 24 |
+
This eliminates wasted LLM budget on invalid action generation and focuses
|
| 25 |
+
evaluation on dispatch decision quality, not action syntax compliance.
|
| 26 |
+
|
| 27 |
+
### Why the Safety Gate Uses 0.2 Not 0.0
|
| 28 |
+
A hard zero for any P1 failure would make the reward surface completely flat
|
| 29 |
+
for bad agents β no gradient to learn from. Capping at 0.2 preserves partial
|
| 30 |
+
signal (coverage, response time on other incidents) while making P1 failure
|
| 31 |
+
unambiguously catastrophic. Real dispatch accountability works the same way:
|
| 32 |
+
an incident review happens, but other good work is still acknowledged.
|
| 33 |
+
|
| 34 |
+
### Why Phraseology Is Scored
|
| 35 |
+
Real dispatchers are evaluated on radio communication clarity. An agent that
|
| 36 |
+
dispatches the right unit but says nothing (or says the wrong thing) is less
|
| 37 |
+
useful as a CAD copilot than one that also generates the correct radio traffic.
|
| 38 |
+
Phraseology scoring creates incentive for agents to learn domain language, not
|
| 39 |
+
just resource allocation.
|
| 40 |
+
|
| 41 |
+
### Why Waves Spawn at Fixed Steps Not Random Times
|
| 42 |
+
Reproducibility is a first-class requirement. Fixed step offsets guarantee
|
| 43 |
+
identical episode structure across all runs, making score comparisons valid.
|
| 44 |
+
The challenge comes from the agent not knowing wave timing in advance β it
|
| 45 |
+
must react, not plan.
|
| 46 |
+
|
| 47 |
+
## State Machine Transitions
|
| 48 |
+
Unit: AVAILABLE β DISPATCHED β ON_SCENE β AVAILABLE
|
| 49 |
+
βοΈ STAGED βοΈ
|
| 50 |
+
Incident: PENDING β RESPONDING β ON_SCENE β RESOLVED
|
| 51 |
+
βοΈ ESCALATED (survival clock expired)
|
| 52 |
+
|
| 53 |
+
## File Map
|
| 54 |
+
|
| 55 |
+
| File | Responsibility |
|
| 56 |
+
|---|---|
|
| 57 |
+
| `src/models.py` | All Pydantic data contracts |
|
| 58 |
+
| `src/state_machine.py` | Core simulation engine |
|
| 59 |
+
| `src/protocol.py` | Action legality validation |
|
| 60 |
+
| `src/rewards.py` | Reward calculation |
|
| 61 |
+
| `src/physics.py` | Manhattan distance, ETA, coverage |
|
| 62 |
+
| `src/phraseology.py` | Radio language scoring |
|
| 63 |
+
| `src/city_schema.py` | City topology loader |
|
| 64 |
+
| `src/tasks/registry.py` | Task definitions and fixtures |
|
| 65 |
+
| `src/openenv_environment.py` | OpenEnv API wrapper |
|
| 66 |
+
| `server/app.py` | FastAPI server |
|
docs/exploit_analysis.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Exploit Analysis β 911 Dispatch Supervisor
|
| 2 |
+
|
| 3 |
+
## Known Attack Vectors Considered & Closed
|
| 4 |
+
|
| 5 |
+
### 1. Reward Farming via Repeated Dispatch-Cancel Cycles
|
| 6 |
+
**Vector:** Agent dispatches a unit, immediately cancels, re-dispatches to collect
|
| 7 |
+
partial response_time reward on each cycle without ever resolving incidents.
|
| 8 |
+
**Mitigation:** Cancel actions return the unit to AVAILABLE but do NOT reset the
|
| 9 |
+
incident survival clock. The incident continues counting down regardless of agent
|
| 10 |
+
action, so farming cancel-dispatch cycles accelerates incident escalation and
|
| 11 |
+
triggers the Safety Gate, collapsing the score to β€0.2.
|
| 12 |
+
|
| 13 |
+
### 2. Safety Gate Bypass via P2-Only Dispatching
|
| 14 |
+
**Vector:** Agent ignores all P1 incidents and only resolves P2/P3 incidents to
|
| 15 |
+
accumulate triage and response_time rewards without triggering the Safety Gate.
|
| 16 |
+
**Mitigation:** The Safety Gate activates if ANY P1 incident existed during the
|
| 17 |
+
episode and its survival score is 0.0. The agent cannot avoid P1 incidents
|
| 18 |
+
existing β they are spawned deterministically by the scenario fixture.
|
| 19 |
+
|
| 20 |
+
### 3. Coverage Score Farming via Staging
|
| 21 |
+
**Vector:** Agent repeatedly stages all units in one district to maximize coverage
|
| 22 |
+
score for that district while ignoring active incidents.
|
| 23 |
+
**Mitigation:** Coverage score is computed across ALL districts simultaneously.
|
| 24 |
+
Concentrating units in one district reduces coverage elsewhere, and staged units
|
| 25 |
+
cannot respond to incidents without an explicit dispatch action, allowing incident
|
| 26 |
+
survival clocks to expire and triggering escalation penalties.
|
| 27 |
+
|
| 28 |
+
### 4. Phraseology Score Inflation via Notes Stuffing
|
| 29 |
+
**Vector:** Agent fills the notes field with every possible dispatch phrase to
|
| 30 |
+
maximize token overlap with canonical phrases.
|
| 31 |
+
**Mitigation:** PhraseologyJudge uses token overlap normalized by notes length.
|
| 32 |
+
Stuffing long notes with irrelevant text reduces precision, keeping the score low.
|
| 33 |
+
Only notes that match the specific action type and incident type score highly.
|
| 34 |
+
|
| 35 |
+
### 5. Determinism Exploitation
|
| 36 |
+
**Vector:** Agent memorizes the exact incident sequence (seed=42) and hardcodes
|
| 37 |
+
optimal actions rather than learning dispatch reasoning.
|
| 38 |
+
**Mitigation:** This is intentional for reproducibility. However, the wave spawn
|
| 39 |
+
system introduces timing-dependent incident locations with small perturbations,
|
| 40 |
+
meaning hardcoded action sequences fail when unit positions vary. The environment
|
| 41 |
+
is designed for evaluation, not training-time generalization.
|
| 42 |
+
|
| 43 |
+
## Conclusion
|
| 44 |
+
No reward farming exploit was found that allows an agent to score >0.6 without
|
| 45 |
+
genuinely resolving Priority-1 incidents with correct unit types within survival
|
| 46 |
+
clock windows.
|
docs/reward_design.md
CHANGED
|
@@ -31,9 +31,13 @@ This means even a weak agent that dispatches randomly receives informative gradi
|
|
| 31 |
|
| 32 |
## Difficulty Gradient
|
| 33 |
|
| 34 |
-
| Task | Random Score | Design Intent |
|
| 35 |
-
|---|---|---|
|
| 36 |
-
| single_incident |
|
| 37 |
-
| multi_incident |
|
| 38 |
-
| mass_casualty | ~0.
|
| 39 |
-
| shift_surge | ~0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## Difficulty Gradient
|
| 33 |
|
| 34 |
+
| Task | Random Score | LLM Expected | Design Intent |
|
| 35 |
+
|---|---|---|---|
|
| 36 |
+
| single_incident | 0.20 | 0.55β0.75 | One decision, one unit β tests basic triage |
|
| 37 |
+
| multi_incident | 0.31 | 0.40β0.60 | Competing P1s β tests priority ordering |
|
| 38 |
+
| mass_casualty | ~0.28 | 0.30β0.50 | Surprise waves β tests adaptability |
|
| 39 |
+
| shift_surge | ~0.25 | 0.25β0.40 | Resource scarcity β tests planning under constraint |
|
| 40 |
+
|
| 41 |
+
The gap between random and LLM scores is the signal this benchmark measures.
|
| 42 |
+
A model that scores 0.70 on single_incident but 0.25 on shift_surge is demonstrating
|
| 43 |
+
exactly the capability boundary the environment is designed to expose.
|
docs/tasks.md
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Task Reference β 911 Dispatch Supervisor
|
| 2 |
+
|
| 3 |
+
## Task 1: `single_incident` β Easy
|
| 4 |
+
|
| 5 |
+
**Objective:** Dispatch the correct unit to a single cardiac arrest and resolve it
|
| 6 |
+
before the survival clock expires.
|
| 7 |
+
|
| 8 |
+
**Initial State:** 1 incident (cardiac arrest, P1), 3 units available (1 MEDIC, 1 ENGINE, 1 PATROL)
|
| 9 |
+
**Max Steps:** 20 | **Survival Clock:** 600s
|
| 10 |
+
|
| 11 |
+
**What a good agent does:** Immediately dispatches MEDIC to the cardiac arrest.
|
| 12 |
+
Does not dispatch ENGINE or PATROL (triage mismatch penalty).
|
| 13 |
+
|
| 14 |
+
**What a bad agent does:** Dispatches ENGINE (wrong unit type), wastes steps,
|
| 15 |
+
patient survival clock expires β Safety Gate β score capped at 0.2.
|
| 16 |
+
|
| 17 |
+
**Scoring:** 50% resolution + 30% correct unit type + 20% response speed
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Task 2: `multi_incident` β Medium
|
| 22 |
+
|
| 23 |
+
**Objective:** Triage 3 simultaneous incidents with competing priorities.
|
| 24 |
+
|
| 25 |
+
**Initial State:** 3 incidents (structure fire P2, cardiac arrest P1, shooting P1),
|
| 26 |
+
6 units available
|
| 27 |
+
**Max Steps:** 40
|
| 28 |
+
|
| 29 |
+
**What a good agent does:** Immediately dispatches MEDIC to cardiac arrest and
|
| 30 |
+
PATROL to shooting (both P1), then dispatches ENGINE to structure fire (P2).
|
| 31 |
+
|
| 32 |
+
**What a bad agent does:** Dispatches to the fire first (visible/dramatic but P2),
|
| 33 |
+
leaving P1 incidents unattended β Safety Gate.
|
| 34 |
+
|
| 35 |
+
**Scoring:** 50% P1 resolution + 30% overall resolution β 20% escalation penalty
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Task 3: `mass_casualty` β Hard
|
| 40 |
+
|
| 41 |
+
**Objective:** Manage a building collapse with surprise incident waves.
|
| 42 |
+
|
| 43 |
+
**Initial State:** 1 incident (building collapse P1, survival 480s), 7 units
|
| 44 |
+
**Max Steps:** 60
|
| 45 |
+
**Wave spawns:** Step 5 β structure fire; Step 12 β 2Γ cardiac arrests
|
| 46 |
+
|
| 47 |
+
**What a good agent does:** Responds to building collapse immediately, pre-stages
|
| 48 |
+
units for anticipated waves, adapts when cardiac arrests spawn at step 12.
|
| 49 |
+
|
| 50 |
+
**What a bad agent does:** Commits all units to building collapse, has no
|
| 51 |
+
available units when cardiac arrests spawn β multiple P1 failures β Safety Gate.
|
| 52 |
+
|
| 53 |
+
**Scoring:** 60% P1 survival + 30% mean step reward β failure penalty
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Task 4: `shift_surge` β Hard
|
| 58 |
+
|
| 59 |
+
**Objective:** Maintain coverage as units go out of service mid-shift.
|
| 60 |
+
|
| 61 |
+
**Initial State:** 5 units, 0 incidents (board starts empty)
|
| 62 |
+
**Max Steps:** 60 | **Wave spawn:** Every 8 steps | **Survival clock:** 720s
|
| 63 |
+
**OOS events:** 3 units go OUT_OF_SERVICE by step 5
|
| 64 |
+
|
| 65 |
+
**What a good agent does:** Anticipates resource scarcity, requests mutual aid
|
| 66 |
+
early, stages remaining units strategically, prioritizes P1 incidents as board fills.
|
| 67 |
+
|
| 68 |
+
**What a bad agent does:** Dispatches all units freely in early steps, has no
|
| 69 |
+
coverage when OOS events hit and new incidents spawn simultaneously.
|
| 70 |
+
|
| 71 |
+
**Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog +
|
| 72 |
+
10% step reward β 25% escalation penalty
|
src/models.py
CHANGED
|
@@ -80,6 +80,10 @@ class Observation(BaseModel):
|
|
| 80 |
issues: list[str] = Field(default_factory=list)
|
| 81 |
reward_breakdown: dict[str, float] | None = None
|
| 82 |
phraseology_score: float = 0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
|
| 85 |
class UnitState(BaseModel):
|
|
|
|
| 80 |
issues: list[str] = Field(default_factory=list)
|
| 81 |
reward_breakdown: dict[str, float] | None = None
|
| 82 |
phraseology_score: float = 0.0
|
| 83 |
+
active_p1_count: int = 0
|
| 84 |
+
units_available: int = 0
|
| 85 |
+
step_count: int = 0
|
| 86 |
+
episode_done: bool = False
|
| 87 |
|
| 88 |
|
| 89 |
class UnitState(BaseModel):
|
src/openenv_environment.py
CHANGED
|
@@ -64,12 +64,28 @@ class OpenEnvEnvironment:
|
|
| 64 |
self._state.metadata["episode_score"] = episode_score
|
| 65 |
|
| 66 |
done = self._machine.is_terminal(state)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
phraseology = 0.0
|
| 69 |
if obs.reward_breakdown:
|
| 70 |
phraseology = obs.reward_breakdown.get("protocol", 0.0)
|
| 71 |
-
|
| 72 |
-
obs = obs.model_copy(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
self._last_observation = obs
|
| 74 |
return obs, step_reward, done
|
| 75 |
|
|
|
|
| 64 |
self._state.metadata["episode_score"] = episode_score
|
| 65 |
|
| 66 |
done = self._machine.is_terminal(state)
|
| 67 |
+
|
| 68 |
+
active_p1_count = sum(
|
| 69 |
+
1
|
| 70 |
+
for i in state.incidents.values()
|
| 71 |
+
if i.severity.value == "PRIORITY_1" and i.status.value not in {"RESOLVED", "ESCALATED"}
|
| 72 |
+
)
|
| 73 |
+
units_available = sum(1 for u in state.units.values() if u.status.value == "AVAILABLE")
|
| 74 |
|
| 75 |
phraseology = 0.0
|
| 76 |
if obs.reward_breakdown:
|
| 77 |
phraseology = obs.reward_breakdown.get("protocol", 0.0)
|
| 78 |
+
|
| 79 |
+
obs = obs.model_copy(
|
| 80 |
+
update={
|
| 81 |
+
"score": episode_score,
|
| 82 |
+
"phraseology_score": phraseology,
|
| 83 |
+
"active_p1_count": active_p1_count,
|
| 84 |
+
"units_available": units_available,
|
| 85 |
+
"step_count": state.step_count,
|
| 86 |
+
"episode_done": done,
|
| 87 |
+
}
|
| 88 |
+
)
|
| 89 |
self._last_observation = obs
|
| 90 |
return obs, step_reward, done
|
| 91 |
|