911 / README.md
SayedZahur786's picture
Merge branch 'main' of https://github.com/garvitsachdevaa/911-Dispatch-Supervisor
d95e083
metadata
title: 911 Dispatch Supervisor
emoji: 🚨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 7860
tags:
  - openenv
pinned: false

🚨 911 Dispatch Supervisor

A city-wide emergency dispatch RL environment β€” train and evaluate LLM agents to manage simultaneous incidents by dispatching police, fire, and EMS units across a city grid under realistic resource constraints.

OpenEnv Docker HF Space License: MIT


Why This Matters

911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision β€” which unit to send, in what order, with what priority β€” directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%.

The 911 Dispatch Supervisor is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance β€” all simultaneously.

This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability.

Overview

At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate β€” under time pressure, limited resources, and competing priorities across a 100Γ—100 city grid.

This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:

  • Requires triage β€” prioritizing life-threatening incidents over property damage
  • Demands coverage awareness β€” keeping geographic zones protected
  • Rewards correct unit-type matching β€” sending a MEDIC vs. an ENGINE
  • Punishes delays that cause Priority-1 incidents to escalate
  • Scores dispatch phraseology β€” realistic radio communication language

Environment Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   OpenEnv Interface                     β”‚
β”‚         reset() Β· step(action) Β· state()                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DispatchStateMachine                       β”‚
β”‚  β€’ Validates actions via DispatchProtocolValidator      β”‚
β”‚  β€’ Moves units toward incidents (Manhattan physics)     β”‚
β”‚  β€’ Advances incident status: PENDING β†’ RESPONDING β†’     β”‚
β”‚    ON_SCENE β†’ RESOLVED (or ESCALATED if timeout)        β”‚
β”‚  β€’ Spawns incident waves at configured step offsets     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  RewardCalculator                       β”‚
β”‚  β€’ response_time (30%) Β· triage (25%) Β· survival (25%) β”‚
β”‚  β€’ coverage (12%) Β· protocol (8%)                       β”‚
β”‚  β€’ Safety gate: P1 failure β†’ score capped at 0.2        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Task-Specific Episode Graders                β”‚
β”‚  single_incident Β· multi_incident Β· mass_casualty Β·     β”‚
β”‚                   shift_surge                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Action Space

Actions are structured Pydantic models β€” no free-text parsing required.

src.models.Action

Field Type Description
action_type DispatchAction One of: DISPATCH, CANCEL, REASSIGN, STAGE, MUTUAL_AID, UPGRADE, DOWNGRADE
unit_id str Unit identifier, e.g. MED-1, ENG-2
incident_id str Incident identifier, e.g. INC-001
notes str | None Optional phraseology text for protocol scoring bonus
priority_override IncidentSeverity | None Required for UPGRADE/DOWNGRADE actions

Action Types

Action Description Protocol Rule
DISPATCH Send an available unit to an incident Unit must be AVAILABLE; incident must not be RESOLVED
CANCEL Release a unit from its current assignment Unit must be assigned to the specified incident
REASSIGN Redirect an assigned unit to a different incident Unit must be DISPATCHED, ON_SCENE, or TRANSPORTING
STAGE Pre-position a unit near an incident without committing Unit must be AVAILABLE; incident must be PENDING
MUTUAL_AID Request external unit of a given type Only allowed when all local units of that type are busy
UPGRADE Increase incident severity New severity must be strictly higher than current
DOWNGRADE Decrease incident severity New severity must be strictly lower than current

Dispatch Phraseology (bonus scoring)

The notes field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.

Action Example notes value
Dispatch MEDIC to cardiac "Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"
Dispatch ENGINE to fire "Engine 2 responding to structure fire, Code 3, all units advised"
Mutual aid request "Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"
Stage unit "Engine 1 staging at District 3 perimeter, awaiting scene clear"

Observation Space

src.models.Observation

Field Type Description
result str Human-readable result of the last action
score float Episode score in [0.0, 1.0] (task-level grade)
protocol_ok bool Whether the action passed protocol validation
issues list[str] Warnings or error codes from the validator
reward_breakdown dict[str, float] | None Per-component reward scores for dashboard display

Full State (src.models.State)

Field Type Description
units dict[str, UnitState] All units with type, status, location, ETA
incidents dict[str, IncidentState] All incidents with type, severity, status, assigned units
episode_id str Unique episode identifier
step_count int Current step number
task_id str Active task identifier
city_time float Simulated city clock in seconds (30s per step)
metadata dict Schema info, districts, seeds, wave configs, bookkeeping

Unit Status Transitions

AVAILABLE β†’ DISPATCHED β†’ ON_SCENE β†’ AVAILABLE
                ↓
         OUT_OF_SERVICE (shift_surge only)

Incident Status Transitions

PENDING β†’ RESPONDING β†’ ON_SCENE β†’ RESOLVED
   ↓           ↓
ESCALATED   ESCALATED    (survival clock expires)

Reward Function

The step-level reward is a weighted combination of five components:

Component Weight Description
response_time 30% How quickly dispatched units reach incidents relative to severity benchmarks (P1: 240s, P2: 480s, P3: 900s)
triage 25% Whether the dispatched unit type matches incident requirements (e.g., MEDIC for CARDIAC_ARREST)
survival 25% Fraction of Priority-1 incidents resolved before the survival clock expires
coverage 12% Geographic distribution of available units across city districts
protocol 8% Action legality + optional phraseology/readback quality via Action.notes

⚠️ Safety Gate: If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at 0.2 regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable β€” exactly as real dispatch protocol requires.

Non-DISPATCH actions receive neutral 0.5 for response_time and triage, allowing agents to maintain coverage without penalty.


Tasks

Task Difficulty Overview

Task Difficulty Max Steps Key Challenge
single_incident 🟒 Easy 20 Dispatch the right unit type quickly
multi_incident 🟑 Medium 40 Triage 3 simultaneous incidents, protect P1s
mass_casualty πŸ”΄ Hard 60 Manage wave-based surge with limited resources
shift_surge πŸ”΄ Hard 60 Adapt as units fail and incidents stream continuously

🟒 Task 1: single_incident β€” Basic Dispatch (Easy)

Scenario: One active incident (CARDIAC_ARREST, Priority-1) in a small city. A MEDIC, ENGINE, and PATROL are all available.

Objective: Dispatch the correct unit type (MEDIC) to the incident as fast as possible.

Grader Logic:

score = 0.0
if incident RESOLVED:          score += 0.50
if MEDIC dispatched correctly: score += 0.30
if resolved within 10 steps:   score += 0.20

Why it's easy: One incident, one correct action, small state space.

What a good agent does: Immediately dispatches MED-1 β†’ INC-001.

Scoring: 50% resolution + 30% correct unit type used + 20% response speed.


🟑 Task 2: multi_incident β€” Simultaneous Triage (Medium)

Scenario: Three concurrent incidents at episode start β€” a structure fire (P2), a cardiac arrest (P1), and a shooting (P1) β€” with 6 available units.

Objective: Respond to all incidents with the right unit types, prioritizing P1s.

Grader Logic:

score = 0.5 Γ— p1_resolution_rate
      + 0.3 Γ— overall_resolution_rate
      - 0.2 Γ— escalation_penalty

Why it's medium: Multiple incidents compete for units; wrong type dispatch wastes coverage; P1s must be addressed before P2.

What a good agent does: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.

Scoring: 50% P1 resolution + 30% overall resolution βˆ’ 20% escalation penalty.


πŸ”΄ Task 3: mass_casualty β€” Wave-Based Surge (Hard)

Scenario: One critical incident (BUILDING_COLLAPSE, P1) at step 0. New waves arrive at steps 5 (structure fire) and 12 (two simultaneous cardiac arrests).

Objective: Maximize P1 survival across all waves despite resource conflicts.

Grader Logic:

score = 0.6 Γ— p1_survival_rate
      + 0.3 Γ— mean_step_reward
      - failure_penalty

Why it's hard: Resources are exhausted when waves arrive. Agents must decide whether to reassign mid-scene or request mutual aid (at a 120s ETA penalty). Mutual aid is only legal when local units of the required type are fully committed.

What a good agent does: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.

Scoring: 60% P1 survival + 30% mean step reward βˆ’ failure penalty if building collapse unresponded.


πŸ”΄ Task 4: shift_surge β€” Long-Horizon Degradation (Hard)

Scenario: 5 units start available, but 3 go OUT_OF_SERVICE by step 5. Incidents arrive in waves every 8 steps throughout the 60-step episode.

Objective: Maintain city-wide throughput and P1 survival despite progressive resource degradation.

Grader Logic:

score = 0.35 Γ— resolution_ratio
      + 0.25 Γ— p1_survival
      + 0.15 Γ— coverage
      + 0.15 Γ— (1 - backlog_ratio)
      + 0.10 Γ— mean_reward
      - 0.25 Γ— escalation_ratio

Why it's hard: No single optimal strategy β€” agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.

Scoring: 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward βˆ’ 25% escalation penalty.


Unit Types

Unit Code Speed Primary Use
Engine ENGINE 0.8 bl/s Structure fires, hazmat support
Ladder LADDER 0.6 bl/s Multi-story fires, rescues
Medic MEDIC 1.0 bl/s Medical emergencies, trauma
Patrol PATROL 1.2 bl/s Shootings, MVAs, crowd control
Hazmat HAZMAT 0.5 bl/s Chemical/biological spills

Incident Types

Incident Recommended Units Default Severity
CARDIAC_ARREST MEDIC P1
STRUCTURE_FIRE ENGINE Γ— 2, LADDER P2
SHOOTING MEDIC, PATROL Γ— 2 P1
MULTI_VEHICLE_ACCIDENT MEDIC, PATROL P2
BUILDING_COLLAPSE ENGINE, LADDER, MEDIC Γ— 2 P1
HAZMAT_SPILL HAZMAT, ENGINE P2
OVERDOSE MEDIC P2
MISSING_PERSON PATROL P3

OpenEnv Interface

import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.models import Action, DispatchAction

async def main():
    env = OpenEnvEnvironment(task_id="multi_incident", seed=42)

    # Reset to initial state
    obs = await env.reset()
    print(obs.result)  # "dispatch center online"

    # Get legal actions (protocol-validated)
    legal = env.legal_actions()

    # Take a step
    action = legal[0]
    obs, reward, done = await env.step(action)

    print(f"reward={reward:.3f}, done={done}, protocol_ok={obs.protocol_ok}")

    # Inspect full state
    state = env.state()
    print(f"step={state.step_count}, city_time={state.city_time}s")

asyncio.run(main())

API Endpoints

Endpoint Method Description
/health GET Health check β€” returns {"status": "healthy"}
/reset POST Reset environment; body: {"task_id": "...", "seed": 42} (both optional)
/step POST Execute an action; body: {"action": {...}}
/state GET Current full environment state
/tasks GET List all available tasks with metadata
/dashboard/state GET Extended state for live HTML dashboard
/schema GET JSON schemas for Action, Observation, State
/metadata GET Environment name, version, description

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the demo (non-interactive, no LLM required)
python demo.py

# Start the API server
python -m src.server.app

# Run random agent baseline (no API key required)
USE_RANDOM=true python inference.py

# Run LLM agent
API_BASE_URL=https://router.huggingface.co/v1 \
  MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  HF_TOKEN=your_token \
  python inference.py

# Run full test suite
pytest tests/ -v

Docker

Build & Run

# Build image
docker build -t citywide-dispatch-supervisor .

# Run on port 7860 (required for HF Spaces)
docker run -p 7860:7860 citywide-dispatch-supervisor

# Health check
curl http://localhost:7860/health

# Reset to a specific task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "multi_incident", "seed": 42}'

Hugging Face Spaces Deployment

This repository is deployed as a Docker-based HF Space.

  1. Create a new HF Space β†’ select Docker
  2. Push this repository to the Space
  3. The server reads PORT from the environment (HF sets PORT=7860)
  4. Once running, the following endpoints are publicly available:
    • GET /health
    • POST /reset
    • POST /step
    • GET /state

Validate your deployment with the prevalidation script:

bash samplematerial/prevalidation.sh https://your-space.hf.space .

Environment Variables

Variable Description Default
API_BASE_URL LLM API endpoint https://router.huggingface.co/v1
MODEL_NAME Model identifier meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN HuggingFace API key β€”
USE_RANDOM Set true for deterministic random baseline false
PORT Server port 7860

Baseline Scores

Scores normalized to [0.0, 1.0] using sum(rewards) / max_steps.
Run with USE_RANDOM=true python inference.py (seed=42, fully deterministic).

Task Difficulty Max Steps Random Agent Score
single_incident Easy 20 0.2000
multi_incident Medium 40 0.3117
mass_casualty Hard 60 0.4645
shift_surge Hard 60 0.3183

Note: Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (observation.score). These figures use the canonical competition normalization: sum(step_rewards) / max_steps, clamped to [0.0, 1.0].

What the scores mean

A random agent scoring 0.20 on the easiest task confirms the environment is not trivially solvable β€” there is no reward for random dispatching. The gradient from 0.20 β†’ 0.46 across tasks reflects genuine increasing complexity, not just more steps.

A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score 0.55–0.75 on single_incident and 0.30–0.45 on shift_surge, demonstrating the environment meaningfully differentiates agent capability.

LLM agents (meta-llama/Llama-3.1-8B-Instruct via https://router.huggingface.co/v1) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.

Run the baseline matrix (random + LLM reruns) and emit a JSON report:

API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python scripts/run_baseline_matrix.py --random-runs 1 --llm-runs 3 --output-json baseline_report.json

Windows PowerShell shortcut:

$env:HF_TOKEN="your_token"
powershell -ExecutionPolicy Bypass -File scripts/run_nemotron_baseline.ps1 -RandomRuns 1 -LlmRuns 3

Project Structure

.
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models.py               # Pydantic typed contracts (Action, Observation, State)
β”‚   β”œβ”€β”€ protocol.py             # Dispatch protocol validator
β”‚   β”œβ”€β”€ physics.py              # City-grid movement / ETA helpers
β”‚   β”œβ”€β”€ city_schema.py          # City topology + unit configuration loader
β”‚   β”œβ”€β”€ state_machine.py        # Core dispatch state machine
β”‚   β”œβ”€β”€ rewards.py              # Reward engine + episode graders
β”‚   β”œβ”€β”€ phraseology.py          # Dispatch phraseology renderer/judge
β”‚   β”œβ”€β”€ api.py                  # REST API client wrapper
β”‚   β”œβ”€β”€ grading.py              # Centralized episode grading router
β”‚   β”œβ”€β”€ benchmark.py            # Benchmark runner (list/run all tasks)
β”‚   β”œβ”€β”€ openenv_environment.py  # OpenEnv-compatible environment wrapper
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ registry.py         # Task registry + deterministic scenario fixtures
β”‚   β”‚   β”œβ”€β”€ single_incident.py  # Easy task + grader
β”‚   β”‚   β”œβ”€β”€ multi_incident.py   # Medium task + grader
β”‚   β”‚   β”œβ”€β”€ mass_casualty.py    # Hard task + grader
β”‚   β”‚   └── shift_surge.py      # Hard task + grader
β”‚   β”œβ”€β”€ server/
β”‚   β”‚   β”œβ”€β”€ app.py              # FastAPI server (reset/step/state endpoints)
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── Dockerfile
β”‚   └── visualizer/
β”‚       └── viewer.py           # Read-only 2D Matplotlib visualizer
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ metro_city.json         # Large city schema (default)
β”‚   └── city_small.json         # Small city schema (testing)
β”œβ”€β”€ tests/                      # TDD test suite (~20 test modules)
β”œβ”€β”€ samplematerial/
β”‚   └── prevalidation.sh        # HF Space + Docker validation script
β”œβ”€β”€ demo.py                     # Non-interactive demo (no LLM required)
β”œβ”€β”€ inference.py                # Competition inference script
β”œβ”€β”€ live_dashboard.html         # Browser-based live dashboard
β”œβ”€β”€ validate_local.py           # Local pre-submission validation
β”œβ”€β”€ openenv.yaml                # OpenEnv specification
β”œβ”€β”€ pyproject.toml              # uv project config
β”œβ”€β”€ requirements.txt            # pip dependencies
└── Dockerfile                  # Root Docker build

Live Dashboard

After starting the server and calling /reset, open live_dashboard.html in a browser:

# Terminal 1: start server
python -m src.server.app

# Terminal 2: reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "multi_incident"}'

# Browser: open live_dashboard.html

The dashboard polls /dashboard/state every 500ms and renders:

  • Unit cards (status, ETA, assignment, location)
  • Incident cards (type, severity, status, assigned units)
  • City map (2D grid with unit and incident markers)
  • Per-step reward component bars

2D Visualizer (Programmatic)

import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.visualizer.viewer import Viewer2D

async def main():
    env = OpenEnvEnvironment(task_id="multi_incident", seed=42)
    await env.reset()
    Viewer2D().render_to_file("frame.png", env.state())
    env.close()

asyncio.run(main())


Determinism

All scenarios are deterministic under a fixed seed:

env1 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
env2 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
# env1 and env2 produce identical episodes

Incident positions include small seeded perturbations for realism; the overall episode structure (waves, unit positions, incident types) is fully reproducible.


Running Tests

# Full test suite
pytest tests/ -v

# Individual modules
pytest tests/test_state_machine.py -v
pytest tests/test_rewards.py -v
pytest tests/test_openenv_integration.py -v
pytest tests/test_inference.py -v

Pre-Submission Validation

# Full local validation (tests + inference + docker + benchmark scores)
python validate_local.py

# OpenEnv spec validation
openenv validate

# HF Space validation (requires deployed space)
bash samplematerial/prevalidation.sh https://your-space.hf.space .

# Windows (explicit Git Bash)
"C:/Program Files/Git/bin/bash.exe" samplematerial/prevalidation.sh https://your-space.hf.space .

License

MIT License