Spaces:

garvitsachdeva
/

911

Sleeping

App Files Files Community

911 / README.md

SayedZahur786

Merge branch 'main' of https://github.com/garvitsachdevaa/911-Dispatch-Supervisor

d95e083 about 1 month ago

preview code

raw

history blame contribute delete

23.6 kB

metadata

title: 911 Dispatch Supervisor
emoji: 🚨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 7860
tags:
  - openenv
pinned: false

🚨 911 Dispatch Supervisor

A city-wide emergency dispatch RL environment — train and evaluate LLM agents to manage simultaneous incidents by dispatching police, fire, and EMS units across a city grid under realistic resource constraints.

Why This Matters

911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision — which unit to send, in what order, with what priority — directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%.

The 911 Dispatch Supervisor is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance — all simultaneously.

This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability.

Overview

At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate — under time pressure, limited resources, and competing priorities across a 100×100 city grid.

This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:

Requires triage — prioritizing life-threatening incidents over property damage
Demands coverage awareness — keeping geographic zones protected
Rewards correct unit-type matching — sending a MEDIC vs. an ENGINE
Punishes delays that cause Priority-1 incidents to escalate
Scores dispatch phraseology — realistic radio communication language

Environment Architecture

┌─────────────────────────────────────────────────────────┐
│                   OpenEnv Interface                     │
│         reset() · step(action) · state()                │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│              DispatchStateMachine                       │
│  • Validates actions via DispatchProtocolValidator      │
│  • Moves units toward incidents (Manhattan physics)     │
│  • Advances incident status: PENDING → RESPONDING →     │
│    ON_SCENE → RESOLVED (or ESCALATED if timeout)        │
│  • Spawns incident waves at configured step offsets     │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│                  RewardCalculator                       │
│  • response_time (30%) · triage (25%) · survival (25%) │
│  • coverage (12%) · protocol (8%)                       │
│  • Safety gate: P1 failure → score capped at 0.2        │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│            Task-Specific Episode Graders                │
│  single_incident · multi_incident · mass_casualty ·     │
│                   shift_surge                           │
└─────────────────────────────────────────────────────────┘

Action Space

Actions are structured Pydantic models — no free-text parsing required.

src.models.Action

Field	Type	Description
`action_type`	`DispatchAction`	One of: `DISPATCH`, `CANCEL`, `REASSIGN`, `STAGE`, `MUTUAL_AID`, `UPGRADE`, `DOWNGRADE`
`unit_id`	`str`	Unit identifier, e.g. `MED-1`, `ENG-2`
`incident_id`	`str`	Incident identifier, e.g. `INC-001`
`notes`	`str \| None`	Optional phraseology text for protocol scoring bonus
`priority_override`	`IncidentSeverity \| None`	Required for `UPGRADE`/`DOWNGRADE` actions

Action Types

Action	Description	Protocol Rule
`DISPATCH`	Send an available unit to an incident	Unit must be `AVAILABLE`; incident must not be `RESOLVED`
`CANCEL`	Release a unit from its current assignment	Unit must be assigned to the specified incident
`REASSIGN`	Redirect an assigned unit to a different incident	Unit must be `DISPATCHED`, `ON_SCENE`, or `TRANSPORTING`
`STAGE`	Pre-position a unit near an incident without committing	Unit must be `AVAILABLE`; incident must be `PENDING`
`MUTUAL_AID`	Request external unit of a given type	Only allowed when all local units of that type are busy
`UPGRADE`	Increase incident severity	New severity must be strictly higher than current
`DOWNGRADE`	Decrease incident severity	New severity must be strictly lower than current

Dispatch Phraseology (bonus scoring)

The notes field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.

Action	Example notes value
Dispatch MEDIC to cardiac	`"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"`
Dispatch ENGINE to fire	`"Engine 2 responding to structure fire, Code 3, all units advised"`
Mutual aid request	`"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"`
Stage unit	`"Engine 1 staging at District 3 perimeter, awaiting scene clear"`

Observation Space

src.models.Observation

Field	Type	Description
`result`	`str`	Human-readable result of the last action
`score`	`float`	Episode score in `[0.0, 1.0]` (task-level grade)
`protocol_ok`	`bool`	Whether the action passed protocol validation
`issues`	`list[str]`	Warnings or error codes from the validator
`reward_breakdown`	`dict[str, float] \| None`	Per-component reward scores for dashboard display

Full State (src.models.State)

Field	Type	Description
`units`	`dict[str, UnitState]`	All units with type, status, location, ETA
`incidents`	`dict[str, IncidentState]`	All incidents with type, severity, status, assigned units
`episode_id`	`str`	Unique episode identifier
`step_count`	`int`	Current step number
`task_id`	`str`	Active task identifier
`city_time`	`float`	Simulated city clock in seconds (30s per step)
`metadata`	`dict`	Schema info, districts, seeds, wave configs, bookkeeping

Unit Status Transitions

AVAILABLE → DISPATCHED → ON_SCENE → AVAILABLE
                ↓
         OUT_OF_SERVICE (shift_surge only)

Incident Status Transitions

PENDING → RESPONDING → ON_SCENE → RESOLVED
   ↓           ↓
ESCALATED   ESCALATED    (survival clock expires)

Reward Function

The step-level reward is a weighted combination of five components:

Component	Weight	Description
`response_time`	30%	How quickly dispatched units reach incidents relative to severity benchmarks (P1: 240s, P2: 480s, P3: 900s)
`triage`	25%	Whether the dispatched unit type matches incident requirements (e.g., MEDIC for CARDIAC_ARREST)
`survival`	25%	Fraction of Priority-1 incidents resolved before the survival clock expires
`coverage`	12%	Geographic distribution of available units across city districts
`protocol`	8%	Action legality + optional phraseology/readback quality via `Action.notes`

⚠️ Safety Gate: If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at 0.2 regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable — exactly as real dispatch protocol requires.

Non-DISPATCH actions receive neutral 0.5 for response_time and triage, allowing agents to maintain coverage without penalty.

Tasks

Task Difficulty Overview

Task	Difficulty	Max Steps	Key Challenge
`single_incident`	🟢 Easy	20	Dispatch the right unit type quickly
`multi_incident`	🟡 Medium	40	Triage 3 simultaneous incidents, protect P1s
`mass_casualty`	🔴 Hard	60	Manage wave-based surge with limited resources
`shift_surge`	🔴 Hard	60	Adapt as units fail and incidents stream continuously

🟢 Task 1: `single_incident` — Basic Dispatch (Easy)

Scenario: One active incident (CARDIAC_ARREST, Priority-1) in a small city. A MEDIC, ENGINE, and PATROL are all available.

Objective: Dispatch the correct unit type (MEDIC) to the incident as fast as possible.

Grader Logic:

score = 0.0
if incident RESOLVED:          score += 0.50
if MEDIC dispatched correctly: score += 0.30
if resolved within 10 steps:   score += 0.20

Why it's easy: One incident, one correct action, small state space.

What a good agent does: Immediately dispatches MED-1 → INC-001.

Scoring: 50% resolution + 30% correct unit type used + 20% response speed.

🟡 Task 2: `multi_incident` — Simultaneous Triage (Medium)

Scenario: Three concurrent incidents at episode start — a structure fire (P2), a cardiac arrest (P1), and a shooting (P1) — with 6 available units.

Objective: Respond to all incidents with the right unit types, prioritizing P1s.

Grader Logic:

score = 0.5 × p1_resolution_rate
      + 0.3 × overall_resolution_rate
      - 0.2 × escalation_penalty

Why it's medium: Multiple incidents compete for units; wrong type dispatch wastes coverage; P1s must be addressed before P2.

What a good agent does: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.

Scoring: 50% P1 resolution + 30% overall resolution − 20% escalation penalty.

🔴 Task 3: `mass_casualty` — Wave-Based Surge (Hard)

Scenario: One critical incident (BUILDING_COLLAPSE, P1) at step 0. New waves arrive at steps 5 (structure fire) and 12 (two simultaneous cardiac arrests).

Objective: Maximize P1 survival across all waves despite resource conflicts.

Grader Logic:

score = 0.6 × p1_survival_rate
      + 0.3 × mean_step_reward
      - failure_penalty

Why it's hard: Resources are exhausted when waves arrive. Agents must decide whether to reassign mid-scene or request mutual aid (at a 120s ETA penalty). Mutual aid is only legal when local units of the required type are fully committed.

What a good agent does: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.

Scoring: 60% P1 survival + 30% mean step reward − failure penalty if building collapse unresponded.

🔴 Task 4: `shift_surge` — Long-Horizon Degradation (Hard)

Scenario: 5 units start available, but 3 go OUT_OF_SERVICE by step 5. Incidents arrive in waves every 8 steps throughout the 60-step episode.

Objective: Maintain city-wide throughput and P1 survival despite progressive resource degradation.

Grader Logic:

score = 0.35 × resolution_ratio
      + 0.25 × p1_survival
      + 0.15 × coverage
      + 0.15 × (1 - backlog_ratio)
      + 0.10 × mean_reward
      - 0.25 × escalation_ratio

Why it's hard: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.

Scoring: 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward − 25% escalation penalty.

Unit Types

Unit	Code	Speed	Primary Use
Engine	`ENGINE`	0.8 bl/s	Structure fires, hazmat support
Ladder	`LADDER`	0.6 bl/s	Multi-story fires, rescues
Medic	`MEDIC`	1.0 bl/s	Medical emergencies, trauma
Patrol	`PATROL`	1.2 bl/s	Shootings, MVAs, crowd control
Hazmat	`HAZMAT`	0.5 bl/s	Chemical/biological spills

Incident Types

Incident	Recommended Units	Default Severity
`CARDIAC_ARREST`	MEDIC	P1
`STRUCTURE_FIRE`	ENGINE × 2, LADDER	P2
`SHOOTING`	MEDIC, PATROL × 2	P1
`MULTI_VEHICLE_ACCIDENT`	MEDIC, PATROL	P2
`BUILDING_COLLAPSE`	ENGINE, LADDER, MEDIC × 2	P1
`HAZMAT_SPILL`	HAZMAT, ENGINE	P2
`OVERDOSE`	MEDIC	P2
`MISSING_PERSON`	PATROL	P3

OpenEnv Interface

import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.models import Action, DispatchAction

async def main():
    env = OpenEnvEnvironment(task_id="multi_incident", seed=42)

    # Reset to initial state
    obs = await env.reset()
    print(obs.result)  # "dispatch center online"

    # Get legal actions (protocol-validated)
    legal = env.legal_actions()

    # Take a step
    action = legal[0]
    obs, reward, done = await env.step(action)

    print(f"reward={reward:.3f}, done={done}, protocol_ok={obs.protocol_ok}")

    # Inspect full state
    state = env.state()
    print(f"step={state.step_count}, city_time={state.city_time}s")

asyncio.run(main())

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check — returns `{"status": "healthy"}`
`/reset`	POST	Reset environment; body: `{"task_id": "...", "seed": 42}` (both optional)
`/step`	POST	Execute an action; body: `{"action": {...}}`
`/state`	GET	Current full environment state
`/tasks`	GET	List all available tasks with metadata
`/dashboard/state`	GET	Extended state for live HTML dashboard
`/schema`	GET	JSON schemas for Action, Observation, State
`/metadata`	GET	Environment name, version, description

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the demo (non-interactive, no LLM required)
python demo.py

# Start the API server
python -m src.server.app

# Run random agent baseline (no API key required)
USE_RANDOM=true python inference.py

# Run LLM agent
API_BASE_URL=https://router.huggingface.co/v1 \
  MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  HF_TOKEN=your_token \
  python inference.py

# Run full test suite
pytest tests/ -v

Docker

Build & Run

# Build image
docker build -t citywide-dispatch-supervisor .

# Run on port 7860 (required for HF Spaces)
docker run -p 7860:7860 citywide-dispatch-supervisor

# Health check
curl http://localhost:7860/health

# Reset to a specific task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "multi_incident", "seed": 42}'

Hugging Face Spaces Deployment

This repository is deployed as a Docker-based HF Space.

Create a new HF Space → select Docker
Push this repository to the Space
The server reads PORT from the environment (HF sets PORT=7860)
Once running, the following endpoints are publicly available:
- GET /health
- POST /reset
- POST /step
- GET /state

Validate your deployment with the prevalidation script:

bash samplematerial/prevalidation.sh https://your-space.hf.space .

Environment Variables

Variable	Description	Default
`API_BASE_URL`	LLM API endpoint	`https://router.huggingface.co/v1`
`MODEL_NAME`	Model identifier	`meta-llama/Llama-3.1-8B-Instruct`
`HF_TOKEN`	HuggingFace API key	—
`USE_RANDOM`	Set `true` for deterministic random baseline	`false`
`PORT`	Server port	`7860`

Baseline Scores

Scores normalized to [0.0, 1.0] using sum(rewards) / max_steps.
Run with USE_RANDOM=true python inference.py (seed=42, fully deterministic).

Task	Difficulty	Max Steps	Random Agent Score
`single_incident`	Easy	20	0.2000
`multi_incident`	Medium	40	0.3117
`mass_casualty`	Hard	60	0.4645
`shift_surge`	Hard	60	0.3183

Note: Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (observation.score). These figures use the canonical competition normalization: sum(step_rewards) / max_steps, clamped to [0.0, 1.0].

What the scores mean

A random agent scoring 0.20 on the easiest task confirms the environment is not trivially solvable — there is no reward for random dispatching. The gradient from 0.20 → 0.46 across tasks reflects genuine increasing complexity, not just more steps.

A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score 0.55–0.75 on single_incident and 0.30–0.45 on shift_surge, demonstrating the environment meaningfully differentiates agent capability.

LLM agents (meta-llama/Llama-3.1-8B-Instruct via https://router.huggingface.co/v1) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.

Run the baseline matrix (random + LLM reruns) and emit a JSON report:

API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python scripts/run_baseline_matrix.py --random-runs 1 --llm-runs 3 --output-json baseline_report.json

Windows PowerShell shortcut:

$env:HF_TOKEN="your_token"
powershell -ExecutionPolicy Bypass -File scripts/run_nemotron_baseline.ps1 -RandomRuns 1 -LlmRuns 3

Project Structure

.
├── src/
│   ├── models.py               # Pydantic typed contracts (Action, Observation, State)
│   ├── protocol.py             # Dispatch protocol validator
│   ├── physics.py              # City-grid movement / ETA helpers
│   ├── city_schema.py          # City topology + unit configuration loader
│   ├── state_machine.py        # Core dispatch state machine
│   ├── rewards.py              # Reward engine + episode graders
│   ├── phraseology.py          # Dispatch phraseology renderer/judge
│   ├── api.py                  # REST API client wrapper
│   ├── grading.py              # Centralized episode grading router
│   ├── benchmark.py            # Benchmark runner (list/run all tasks)
│   ├── openenv_environment.py  # OpenEnv-compatible environment wrapper
│   ├── tasks/
│   │   ├── registry.py         # Task registry + deterministic scenario fixtures
│   │   ├── single_incident.py  # Easy task + grader
│   │   ├── multi_incident.py   # Medium task + grader
│   │   ├── mass_casualty.py    # Hard task + grader
│   │   └── shift_surge.py      # Hard task + grader
│   ├── server/
│   │   ├── app.py              # FastAPI server (reset/step/state endpoints)
│   │   ├── requirements.txt
│   │   └── Dockerfile
│   └── visualizer/
│       └── viewer.py           # Read-only 2D Matplotlib visualizer
├── data/
│   ├── metro_city.json         # Large city schema (default)
│   └── city_small.json         # Small city schema (testing)
├── tests/                      # TDD test suite (~20 test modules)
├── samplematerial/
│   └── prevalidation.sh        # HF Space + Docker validation script
├── demo.py                     # Non-interactive demo (no LLM required)
├── inference.py                # Competition inference script
├── live_dashboard.html         # Browser-based live dashboard
├── validate_local.py           # Local pre-submission validation
├── openenv.yaml                # OpenEnv specification
├── pyproject.toml              # uv project config
├── requirements.txt            # pip dependencies
└── Dockerfile                  # Root Docker build

Live Dashboard

After starting the server and calling /reset, open live_dashboard.html in a browser:

# Terminal 1: start server
python -m src.server.app

# Terminal 2: reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "multi_incident"}'

# Browser: open live_dashboard.html

The dashboard polls /dashboard/state every 500ms and renders:

Unit cards (status, ETA, assignment, location)
Incident cards (type, severity, status, assigned units)
City map (2D grid with unit and incident markers)
Per-step reward component bars

2D Visualizer (Programmatic)

import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.visualizer.viewer import Viewer2D

async def main():
    env = OpenEnvEnvironment(task_id="multi_incident", seed=42)
    await env.reset()
    Viewer2D().render_to_file("frame.png", env.state())
    env.close()

asyncio.run(main())

Determinism

All scenarios are deterministic under a fixed seed:

env1 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
env2 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
# env1 and env2 produce identical episodes

Incident positions include small seeded perturbations for realism; the overall episode structure (waves, unit positions, incident types) is fully reproducible.

Running Tests

# Full test suite
pytest tests/ -v

# Individual modules
pytest tests/test_state_machine.py -v
pytest tests/test_rewards.py -v
pytest tests/test_openenv_integration.py -v
pytest tests/test_inference.py -v

Pre-Submission Validation

# Full local validation (tests + inference + docker + benchmark scores)
python validate_local.py

# OpenEnv spec validation
openenv validate

# HF Space validation (requires deployed space)
bash samplematerial/prevalidation.sh https://your-space.hf.space .

# Windows (explicit Git Bash)
"C:/Program Files/Git/bin/bash.exe" samplematerial/prevalidation.sh https://your-space.hf.space .

License

MIT License

🚨 911 Dispatch Supervisor

Why This Matters

Overview

Environment Architecture

Action Space

Dispatch Phraseology (bonus scoring)

Observation Space

Reward Function

Tasks

Task Difficulty Overview

🟢 Task 1: single_incident — Basic Dispatch (Easy)

🟡 Task 2: multi_incident — Simultaneous Triage (Medium)

🔴 Task 3: mass_casualty — Wave-Based Surge (Hard)

🔴 Task 4: shift_surge — Long-Horizon Degradation (Hard)

Unit Types

Incident Types

OpenEnv Interface

API Endpoints

Quick Start

Docker

Build & Run

Hugging Face Spaces Deployment

Environment Variables

Baseline Scores

What the scores mean

Project Structure

Live Dashboard

2D Visualizer (Programmatic)

Determinism

Running Tests

Pre-Submission Validation

License

🟢 Task 1: `single_incident` — Basic Dispatch (Easy)

🟡 Task 2: `multi_incident` — Simultaneous Triage (Medium)

🔴 Task 3: `mass_casualty` — Wave-Based Surge (Hard)

🔴 Task 4: `shift_surge` — Long-Horizon Degradation (Hard)