Spaces:
Sleeping
title: 911 Dispatch Supervisor
emoji: π¨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 7860
tags:
- openenv
pinned: false
π¨ 911 Dispatch Supervisor
A city-wide emergency dispatch RL environment β train and evaluate LLM agents to manage simultaneous incidents by dispatching police, fire, and EMS units across a city grid under realistic resource constraints.
Why This Matters
911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision β which unit to send, in what order, with what priority β directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%.
The 911 Dispatch Supervisor is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance β all simultaneously.
This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability.
Overview
At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate β under time pressure, limited resources, and competing priorities across a 100Γ100 city grid.
This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:
- Requires triage β prioritizing life-threatening incidents over property damage
- Demands coverage awareness β keeping geographic zones protected
- Rewards correct unit-type matching β sending a MEDIC vs. an ENGINE
- Punishes delays that cause Priority-1 incidents to escalate
- Scores dispatch phraseology β realistic radio communication language
Environment Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenEnv Interface β
β reset() Β· step(action) Β· state() β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β DispatchStateMachine β
β β’ Validates actions via DispatchProtocolValidator β
β β’ Moves units toward incidents (Manhattan physics) β
β β’ Advances incident status: PENDING β RESPONDING β β
β ON_SCENE β RESOLVED (or ESCALATED if timeout) β
β β’ Spawns incident waves at configured step offsets β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β RewardCalculator β
β β’ response_time (30%) Β· triage (25%) Β· survival (25%) β
β β’ coverage (12%) Β· protocol (8%) β
β β’ Safety gate: P1 failure β score capped at 0.2 β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β Task-Specific Episode Graders β
β single_incident Β· multi_incident Β· mass_casualty Β· β
β shift_surge β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Action Space
Actions are structured Pydantic models β no free-text parsing required.
src.models.Action
| Field | Type | Description |
|---|---|---|
action_type |
DispatchAction |
One of: DISPATCH, CANCEL, REASSIGN, STAGE, MUTUAL_AID, UPGRADE, DOWNGRADE |
unit_id |
str |
Unit identifier, e.g. MED-1, ENG-2 |
incident_id |
str |
Incident identifier, e.g. INC-001 |
notes |
str | None |
Optional phraseology text for protocol scoring bonus |
priority_override |
IncidentSeverity | None |
Required for UPGRADE/DOWNGRADE actions |
Action Types
| Action | Description | Protocol Rule |
|---|---|---|
DISPATCH |
Send an available unit to an incident | Unit must be AVAILABLE; incident must not be RESOLVED |
CANCEL |
Release a unit from its current assignment | Unit must be assigned to the specified incident |
REASSIGN |
Redirect an assigned unit to a different incident | Unit must be DISPATCHED, ON_SCENE, or TRANSPORTING |
STAGE |
Pre-position a unit near an incident without committing | Unit must be AVAILABLE; incident must be PENDING |
MUTUAL_AID |
Request external unit of a given type | Only allowed when all local units of that type are busy |
UPGRADE |
Increase incident severity | New severity must be strictly higher than current |
DOWNGRADE |
Decrease incident severity | New severity must be strictly lower than current |
Dispatch Phraseology (bonus scoring)
The notes field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.
| Action | Example notes value |
|---|---|
| Dispatch MEDIC to cardiac | "Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes" |
| Dispatch ENGINE to fire | "Engine 2 responding to structure fire, Code 3, all units advised" |
| Mutual aid request | "Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72" |
| Stage unit | "Engine 1 staging at District 3 perimeter, awaiting scene clear" |
Observation Space
src.models.Observation
| Field | Type | Description |
|---|---|---|
result |
str |
Human-readable result of the last action |
score |
float |
Episode score in [0.0, 1.0] (task-level grade) |
protocol_ok |
bool |
Whether the action passed protocol validation |
issues |
list[str] |
Warnings or error codes from the validator |
reward_breakdown |
dict[str, float] | None |
Per-component reward scores for dashboard display |
Full State (src.models.State)
| Field | Type | Description |
|---|---|---|
units |
dict[str, UnitState] |
All units with type, status, location, ETA |
incidents |
dict[str, IncidentState] |
All incidents with type, severity, status, assigned units |
episode_id |
str |
Unique episode identifier |
step_count |
int |
Current step number |
task_id |
str |
Active task identifier |
city_time |
float |
Simulated city clock in seconds (30s per step) |
metadata |
dict |
Schema info, districts, seeds, wave configs, bookkeeping |
Unit Status Transitions
AVAILABLE β DISPATCHED β ON_SCENE β AVAILABLE
β
OUT_OF_SERVICE (shift_surge only)
Incident Status Transitions
PENDING β RESPONDING β ON_SCENE β RESOLVED
β β
ESCALATED ESCALATED (survival clock expires)
Reward Function
The step-level reward is a weighted combination of five components:
| Component | Weight | Description |
|---|---|---|
response_time |
30% | How quickly dispatched units reach incidents relative to severity benchmarks (P1: 240s, P2: 480s, P3: 900s) |
triage |
25% | Whether the dispatched unit type matches incident requirements (e.g., MEDIC for CARDIAC_ARREST) |
survival |
25% | Fraction of Priority-1 incidents resolved before the survival clock expires |
coverage |
12% | Geographic distribution of available units across city districts |
protocol |
8% | Action legality + optional phraseology/readback quality via Action.notes |
β οΈ Safety Gate: If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at 0.2 regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable β exactly as real dispatch protocol requires.
Non-DISPATCH actions receive neutral 0.5 for response_time and triage, allowing agents to maintain coverage without penalty.
Tasks
Task Difficulty Overview
| Task | Difficulty | Max Steps | Key Challenge |
|---|---|---|---|
single_incident |
π’ Easy | 20 | Dispatch the right unit type quickly |
multi_incident |
π‘ Medium | 40 | Triage 3 simultaneous incidents, protect P1s |
mass_casualty |
π΄ Hard | 60 | Manage wave-based surge with limited resources |
shift_surge |
π΄ Hard | 60 | Adapt as units fail and incidents stream continuously |
π’ Task 1: single_incident β Basic Dispatch (Easy)
Scenario: One active incident (CARDIAC_ARREST, Priority-1) in a small city. A MEDIC, ENGINE, and PATROL are all available.
Objective: Dispatch the correct unit type (MEDIC) to the incident as fast as possible.
Grader Logic:
score = 0.0
if incident RESOLVED: score += 0.50
if MEDIC dispatched correctly: score += 0.30
if resolved within 10 steps: score += 0.20
Why it's easy: One incident, one correct action, small state space.
What a good agent does: Immediately dispatches MED-1 β INC-001.
Scoring: 50% resolution + 30% correct unit type used + 20% response speed.
π‘ Task 2: multi_incident β Simultaneous Triage (Medium)
Scenario: Three concurrent incidents at episode start β a structure fire (P2), a cardiac arrest (P1), and a shooting (P1) β with 6 available units.
Objective: Respond to all incidents with the right unit types, prioritizing P1s.
Grader Logic:
score = 0.5 Γ p1_resolution_rate
+ 0.3 Γ overall_resolution_rate
- 0.2 Γ escalation_penalty
Why it's medium: Multiple incidents compete for units; wrong type dispatch wastes coverage; P1s must be addressed before P2.
What a good agent does: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.
Scoring: 50% P1 resolution + 30% overall resolution β 20% escalation penalty.
π΄ Task 3: mass_casualty β Wave-Based Surge (Hard)
Scenario: One critical incident (BUILDING_COLLAPSE, P1) at step 0. New waves arrive at steps 5 (structure fire) and 12 (two simultaneous cardiac arrests).
Objective: Maximize P1 survival across all waves despite resource conflicts.
Grader Logic:
score = 0.6 Γ p1_survival_rate
+ 0.3 Γ mean_step_reward
- failure_penalty
Why it's hard: Resources are exhausted when waves arrive. Agents must decide whether to reassign mid-scene or request mutual aid (at a 120s ETA penalty). Mutual aid is only legal when local units of the required type are fully committed.
What a good agent does: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.
Scoring: 60% P1 survival + 30% mean step reward β failure penalty if building collapse unresponded.
π΄ Task 4: shift_surge β Long-Horizon Degradation (Hard)
Scenario: 5 units start available, but 3 go OUT_OF_SERVICE by step 5. Incidents arrive in waves every 8 steps throughout the 60-step episode.
Objective: Maintain city-wide throughput and P1 survival despite progressive resource degradation.
Grader Logic:
score = 0.35 Γ resolution_ratio
+ 0.25 Γ p1_survival
+ 0.15 Γ coverage
+ 0.15 Γ (1 - backlog_ratio)
+ 0.10 Γ mean_reward
- 0.25 Γ escalation_ratio
Why it's hard: No single optimal strategy β agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.
Scoring: 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward β 25% escalation penalty.
Unit Types
| Unit | Code | Speed | Primary Use |
|---|---|---|---|
| Engine | ENGINE |
0.8 bl/s | Structure fires, hazmat support |
| Ladder | LADDER |
0.6 bl/s | Multi-story fires, rescues |
| Medic | MEDIC |
1.0 bl/s | Medical emergencies, trauma |
| Patrol | PATROL |
1.2 bl/s | Shootings, MVAs, crowd control |
| Hazmat | HAZMAT |
0.5 bl/s | Chemical/biological spills |
Incident Types
| Incident | Recommended Units | Default Severity |
|---|---|---|
CARDIAC_ARREST |
MEDIC | P1 |
STRUCTURE_FIRE |
ENGINE Γ 2, LADDER | P2 |
SHOOTING |
MEDIC, PATROL Γ 2 | P1 |
MULTI_VEHICLE_ACCIDENT |
MEDIC, PATROL | P2 |
BUILDING_COLLAPSE |
ENGINE, LADDER, MEDIC Γ 2 | P1 |
HAZMAT_SPILL |
HAZMAT, ENGINE | P2 |
OVERDOSE |
MEDIC | P2 |
MISSING_PERSON |
PATROL | P3 |
OpenEnv Interface
import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.models import Action, DispatchAction
async def main():
env = OpenEnvEnvironment(task_id="multi_incident", seed=42)
# Reset to initial state
obs = await env.reset()
print(obs.result) # "dispatch center online"
# Get legal actions (protocol-validated)
legal = env.legal_actions()
# Take a step
action = legal[0]
obs, reward, done = await env.step(action)
print(f"reward={reward:.3f}, done={done}, protocol_ok={obs.protocol_ok}")
# Inspect full state
state = env.state()
print(f"step={state.step_count}, city_time={state.city_time}s")
asyncio.run(main())
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check β returns {"status": "healthy"} |
/reset |
POST | Reset environment; body: {"task_id": "...", "seed": 42} (both optional) |
/step |
POST | Execute an action; body: {"action": {...}} |
/state |
GET | Current full environment state |
/tasks |
GET | List all available tasks with metadata |
/dashboard/state |
GET | Extended state for live HTML dashboard |
/schema |
GET | JSON schemas for Action, Observation, State |
/metadata |
GET | Environment name, version, description |
Quick Start
# Install dependencies
pip install -r requirements.txt
# Run the demo (non-interactive, no LLM required)
python demo.py
# Start the API server
python -m src.server.app
# Run random agent baseline (no API key required)
USE_RANDOM=true python inference.py
# Run LLM agent
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python inference.py
# Run full test suite
pytest tests/ -v
Docker
Build & Run
# Build image
docker build -t citywide-dispatch-supervisor .
# Run on port 7860 (required for HF Spaces)
docker run -p 7860:7860 citywide-dispatch-supervisor
# Health check
curl http://localhost:7860/health
# Reset to a specific task
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "multi_incident", "seed": 42}'
Hugging Face Spaces Deployment
This repository is deployed as a Docker-based HF Space.
- Create a new HF Space β select Docker
- Push this repository to the Space
- The server reads
PORTfrom the environment (HF setsPORT=7860) - Once running, the following endpoints are publicly available:
GET /healthPOST /resetPOST /stepGET /state
Validate your deployment with the prevalidation script:
bash samplematerial/prevalidation.sh https://your-space.hf.space .
Environment Variables
| Variable | Description | Default |
|---|---|---|
API_BASE_URL |
LLM API endpoint | https://router.huggingface.co/v1 |
MODEL_NAME |
Model identifier | meta-llama/Llama-3.1-8B-Instruct |
HF_TOKEN |
HuggingFace API key | β |
USE_RANDOM |
Set true for deterministic random baseline |
false |
PORT |
Server port | 7860 |
Baseline Scores
Scores normalized to [0.0, 1.0] using sum(rewards) / max_steps.
Run with USE_RANDOM=true python inference.py (seed=42, fully deterministic).
| Task | Difficulty | Max Steps | Random Agent Score |
|---|---|---|---|
single_incident |
Easy | 20 | 0.2000 |
multi_incident |
Medium | 40 | 0.3117 |
mass_casualty |
Hard | 60 | 0.4645 |
shift_surge |
Hard | 60 | 0.3183 |
Note: Earlier README versions showed higher scores (~0.30β0.74) from a different scoring path (
observation.score). These figures use the canonical competition normalization:sum(step_rewards) / max_steps, clamped to[0.0, 1.0].
What the scores mean
A random agent scoring 0.20 on the easiest task confirms the environment is not trivially solvable β there is no reward for random dispatching. The gradient from 0.20 β 0.46 across tasks reflects genuine increasing complexity, not just more steps.
A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score 0.55β0.75 on single_incident and 0.30β0.45 on shift_surge, demonstrating the environment meaningfully differentiates agent capability.
LLM agents (meta-llama/Llama-3.1-8B-Instruct via https://router.huggingface.co/v1) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.
Run the baseline matrix (random + LLM reruns) and emit a JSON report:
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python scripts/run_baseline_matrix.py --random-runs 1 --llm-runs 3 --output-json baseline_report.json
Windows PowerShell shortcut:
$env:HF_TOKEN="your_token"
powershell -ExecutionPolicy Bypass -File scripts/run_nemotron_baseline.ps1 -RandomRuns 1 -LlmRuns 3
Project Structure
.
βββ src/
β βββ models.py # Pydantic typed contracts (Action, Observation, State)
β βββ protocol.py # Dispatch protocol validator
β βββ physics.py # City-grid movement / ETA helpers
β βββ city_schema.py # City topology + unit configuration loader
β βββ state_machine.py # Core dispatch state machine
β βββ rewards.py # Reward engine + episode graders
β βββ phraseology.py # Dispatch phraseology renderer/judge
β βββ api.py # REST API client wrapper
β βββ grading.py # Centralized episode grading router
β βββ benchmark.py # Benchmark runner (list/run all tasks)
β βββ openenv_environment.py # OpenEnv-compatible environment wrapper
β βββ tasks/
β β βββ registry.py # Task registry + deterministic scenario fixtures
β β βββ single_incident.py # Easy task + grader
β β βββ multi_incident.py # Medium task + grader
β β βββ mass_casualty.py # Hard task + grader
β β βββ shift_surge.py # Hard task + grader
β βββ server/
β β βββ app.py # FastAPI server (reset/step/state endpoints)
β β βββ requirements.txt
β β βββ Dockerfile
β βββ visualizer/
β βββ viewer.py # Read-only 2D Matplotlib visualizer
βββ data/
β βββ metro_city.json # Large city schema (default)
β βββ city_small.json # Small city schema (testing)
βββ tests/ # TDD test suite (~20 test modules)
βββ samplematerial/
β βββ prevalidation.sh # HF Space + Docker validation script
βββ demo.py # Non-interactive demo (no LLM required)
βββ inference.py # Competition inference script
βββ live_dashboard.html # Browser-based live dashboard
βββ validate_local.py # Local pre-submission validation
βββ openenv.yaml # OpenEnv specification
βββ pyproject.toml # uv project config
βββ requirements.txt # pip dependencies
βββ Dockerfile # Root Docker build
Live Dashboard
After starting the server and calling /reset, open live_dashboard.html in a browser:
# Terminal 1: start server
python -m src.server.app
# Terminal 2: reset to a task
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "multi_incident"}'
# Browser: open live_dashboard.html
The dashboard polls /dashboard/state every 500ms and renders:
- Unit cards (status, ETA, assignment, location)
- Incident cards (type, severity, status, assigned units)
- City map (2D grid with unit and incident markers)
- Per-step reward component bars
2D Visualizer (Programmatic)
import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.visualizer.viewer import Viewer2D
async def main():
env = OpenEnvEnvironment(task_id="multi_incident", seed=42)
await env.reset()
Viewer2D().render_to_file("frame.png", env.state())
env.close()
asyncio.run(main())
Determinism
All scenarios are deterministic under a fixed seed:
env1 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
env2 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
# env1 and env2 produce identical episodes
Incident positions include small seeded perturbations for realism; the overall episode structure (waves, unit positions, incident types) is fully reproducible.
Running Tests
# Full test suite
pytest tests/ -v
# Individual modules
pytest tests/test_state_machine.py -v
pytest tests/test_rewards.py -v
pytest tests/test_openenv_integration.py -v
pytest tests/test_inference.py -v
Pre-Submission Validation
# Full local validation (tests + inference + docker + benchmark scores)
python validate_local.py
# OpenEnv spec validation
openenv validate
# HF Space validation (requires deployed space)
bash samplematerial/prevalidation.sh https://your-space.hf.space .
# Windows (explicit Git Bash)
"C:/Program Files/Git/bin/bash.exe" samplematerial/prevalidation.sh https://your-space.hf.space .
License
MIT License