Spaces:

garvitsachdeva
/

911

Sleeping

App Files Files Community

911 / README.md

SayedZahur786

Merge branch 'main' of https://github.com/garvitsachdevaa/911-Dispatch-Supervisor

d95e083 about 1 month ago

preview code

raw

history blame contribute delete

23.6 kB

	---
	title: 911 Dispatch Supervisor
	emoji: "🚨"
	colorFrom: red
	colorTo: gray
	sdk: docker
	app_port: 7860
	tags:
	- openenv
	pinned: false
	---

	# 🚨 911 Dispatch Supervisor

	> A city-wide emergency dispatch RL environment — train and evaluate LLM agents to manage simultaneous incidents by dispatching police, fire, and EMS units across a city grid under realistic resource constraints.

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-green)](https://openenv.dev)
	[![Docker](https://img.shields.io/badge/Docker-ready-blue)](https://hub.docker.com)
	[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces)
	[![License: MIT](https://img.shields.io/badge/License-MIT-lightgrey)](LICENSE)

	---

	## Why This Matters

	911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision — which unit to send, in what order, with what priority — directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%.

	The 911 Dispatch Supervisor is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance — all simultaneously.

	This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability.

	## Overview

	At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate — under time pressure, limited resources, and competing priorities across a 100×100 city grid.

	This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:
	- Requires triage — prioritizing life-threatening incidents over property damage
	- Demands coverage awareness — keeping geographic zones protected
	- Rewards correct unit-type matching — sending a MEDIC vs. an ENGINE
	- Punishes delays that cause Priority-1 incidents to escalate
	- Scores dispatch phraseology — realistic radio communication language

	---

	## Environment Architecture

	```
	┌─────────────────────────────────────────────────────────┐
	│ OpenEnv Interface │
	│ reset() · step(action) · state() │
	└──────────────────────┬──────────────────────────────────┘
	│
	┌──────────────────────▼──────────────────────────────────┐
	│ DispatchStateMachine │
	│ • Validates actions via DispatchProtocolValidator │
	│ • Moves units toward incidents (Manhattan physics) │
	│ • Advances incident status: PENDING → RESPONDING → │
	│ ON_SCENE → RESOLVED (or ESCALATED if timeout) │
	│ • Spawns incident waves at configured step offsets │
	└──────────────────────┬──────────────────────────────────┘
	│
	┌──────────────────────▼──────────────────────────────────┐
	│ RewardCalculator │
	│ • response_time (30%) · triage (25%) · survival (25%) │
	│ • coverage (12%) · protocol (8%) │
	│ • Safety gate: P1 failure → score capped at 0.2 │
	└──────────────────────┬──────────────────────────────────┘
	│
	┌──────────────────────▼──────────────────────────────────┐
	│ Task-Specific Episode Graders │
	│ single_incident · multi_incident · mass_casualty · │
	│ shift_surge │
	└─────────────────────────────────────────────────────────┘
	```

	---

	## Action Space

	Actions are structured Pydantic models — no free-text parsing required.

	`src.models.Action`

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `action_type` \| `DispatchAction` \| One of: `DISPATCH`, `CANCEL`, `REASSIGN`, `STAGE`, `MUTUAL_AID`, `UPGRADE`, `DOWNGRADE` \|
	\| `unit_id` \| `str` \| Unit identifier, e.g. `MED-1`, `ENG-2` \|
	\| `incident_id` \| `str` \| Incident identifier, e.g. `INC-001` \|
	\| `notes` \| `str \\| None` \| Optional phraseology text for protocol scoring bonus \|
	\| `priority_override` \| `IncidentSeverity \\| None` \| Required for `UPGRADE`/`DOWNGRADE` actions \|

	Action Types

	\| Action \| Description \| Protocol Rule \|
	\|---\|---\|---\|
	\| `DISPATCH` \| Send an available unit to an incident \| Unit must be `AVAILABLE`; incident must not be `RESOLVED` \|
	\| `CANCEL` \| Release a unit from its current assignment \| Unit must be assigned to the specified incident \|
	\| `REASSIGN` \| Redirect an assigned unit to a different incident \| Unit must be `DISPATCHED`, `ON_SCENE`, or `TRANSPORTING` \|
	\| `STAGE` \| Pre-position a unit near an incident without committing \| Unit must be `AVAILABLE`; incident must be `PENDING` \|
	\| `MUTUAL_AID` \| Request external unit of a given type \| Only allowed when all local units of that type are busy \|
	\| `UPGRADE` \| Increase incident severity \| New severity must be strictly higher than current \|
	\| `DOWNGRADE` \| Decrease incident severity \| New severity must be strictly lower than current \|

	#### Dispatch Phraseology (bonus scoring)

	The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.

	\| Action \| Example notes value \|
	\|---\|---\|
	\| Dispatch MEDIC to cardiac \| `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` \|
	\| Dispatch ENGINE to fire \| `"Engine 2 responding to structure fire, Code 3, all units advised"` \|
	\| Mutual aid request \| `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` \|
	\| Stage unit \| `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` \|

	---

	## Observation Space

	`src.models.Observation`

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `result` \| `str` \| Human-readable result of the last action \|
	\| `score` \| `float` \| Episode score in `[0.0, 1.0]` (task-level grade) \|
	\| `protocol_ok` \| `bool` \| Whether the action passed protocol validation \|
	\| `issues` \| `list[str]` \| Warnings or error codes from the validator \|
	\| `reward_breakdown` \| `dict[str, float] \\| None` \| Per-component reward scores for dashboard display \|

	Full State (`src.models.State`)

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `units` \| `dict[str, UnitState]` \| All units with type, status, location, ETA \|
	\| `incidents` \| `dict[str, IncidentState]` \| All incidents with type, severity, status, assigned units \|
	\| `episode_id` \| `str` \| Unique episode identifier \|
	\| `step_count` \| `int` \| Current step number \|
	\| `task_id` \| `str` \| Active task identifier \|
	\| `city_time` \| `float` \| Simulated city clock in seconds (30s per step) \|
	\| `metadata` \| `dict` \| Schema info, districts, seeds, wave configs, bookkeeping \|

	Unit Status Transitions

	```
	AVAILABLE → DISPATCHED → ON_SCENE → AVAILABLE
	↓
	OUT_OF_SERVICE (shift_surge only)
	```

	Incident Status Transitions

	```
	PENDING → RESPONDING → ON_SCENE → RESOLVED
	↓ ↓
	ESCALATED ESCALATED (survival clock expires)
	```

	---

	## Reward Function

	The step-level reward is a weighted combination of five components:

	\| Component \| Weight \| Description \|
	\|---\|---\|---\|
	\| `response_time` \| 30% \| How quickly dispatched units reach incidents relative to severity benchmarks (P1: 240s, P2: 480s, P3: 900s) \|
	\| `triage` \| 25% \| Whether the dispatched unit type matches incident requirements (e.g., MEDIC for CARDIAC_ARREST) \|
	\| `survival` \| 25% \| Fraction of Priority-1 incidents resolved before the survival clock expires \|
	\| `coverage` \| 12% \| Geographic distribution of available units across city districts \|
	\| `protocol` \| 8% \| Action legality + optional phraseology/readback quality via `Action.notes` \|

	> ⚠️ Safety Gate: If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at 0.2 regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable — exactly as real dispatch protocol requires.

	Non-DISPATCH actions receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty.

	---

	## Tasks

	### Task Difficulty Overview

	\| Task \| Difficulty \| Max Steps \| Key Challenge \|
	\|---\|---\|---\|---\|
	\| `single_incident` \| 🟢 Easy \| 20 \| Dispatch the right unit type quickly \|
	\| `multi_incident` \| 🟡 Medium \| 40 \| Triage 3 simultaneous incidents, protect P1s \|
	\| `mass_casualty` \| 🔴 Hard \| 60 \| Manage wave-based surge with limited resources \|
	\| `shift_surge` \| 🔴 Hard \| 60 \| Adapt as units fail and incidents stream continuously \|

	---

	### 🟢 Task 1: `single_incident` — Basic Dispatch (Easy)

	Scenario: One active incident (`CARDIAC_ARREST`, Priority-1) in a small city. A MEDIC, ENGINE, and PATROL are all available.

	Objective: Dispatch the correct unit type (MEDIC) to the incident as fast as possible.

	Grader Logic:
	```
	score = 0.0
	if incident RESOLVED: score += 0.50
	if MEDIC dispatched correctly: score += 0.30
	if resolved within 10 steps: score += 0.20
	```

	Why it's easy: One incident, one correct action, small state space.

	What a good agent does: Immediately dispatches `MED-1 → INC-001`.

	Scoring: 50% resolution + 30% correct unit type used + 20% response speed.

	---

	### 🟡 Task 2: `multi_incident` — Simultaneous Triage (Medium)

	Scenario: Three concurrent incidents at episode start — a structure fire (P2), a cardiac arrest (P1), and a shooting (P1) — with 6 available units.

	Objective: Respond to all incidents with the right unit types, prioritizing P1s.

	Grader Logic:
	```
	score = 0.5 × p1_resolution_rate
	+ 0.3 × overall_resolution_rate
	- 0.2 × escalation_penalty
	```

	Why it's medium: Multiple incidents compete for units; wrong type dispatch wastes coverage; P1s must be addressed before P2.

	What a good agent does: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.

	Scoring: 50% P1 resolution + 30% overall resolution − 20% escalation penalty.

	---

	### 🔴 Task 3: `mass_casualty` — Wave-Based Surge (Hard)

	Scenario: One critical incident (`BUILDING_COLLAPSE`, P1) at step 0. New waves arrive at steps 5 (structure fire) and 12 (two simultaneous cardiac arrests).

	Objective: Maximize P1 survival across all waves despite resource conflicts.

	Grader Logic:
	```
	score = 0.6 × p1_survival_rate
	+ 0.3 × mean_step_reward
	- failure_penalty
	```

	Why it's hard: Resources are exhausted when waves arrive. Agents must decide whether to reassign mid-scene or request mutual aid (at a 120s ETA penalty). Mutual aid is only legal when local units of the required type are fully committed.

	What a good agent does: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.

	Scoring: 60% P1 survival + 30% mean step reward − failure penalty if building collapse unresponded.

	---

	### 🔴 Task 4: `shift_surge` — Long-Horizon Degradation (Hard)

	Scenario: 5 units start available, but 3 go `OUT_OF_SERVICE` by step 5. Incidents arrive in waves every 8 steps throughout the 60-step episode.

	Objective: Maintain city-wide throughput and P1 survival despite progressive resource degradation.

	Grader Logic:
	```
	score = 0.35 × resolution_ratio
	+ 0.25 × p1_survival
	+ 0.15 × coverage
	+ 0.15 × (1 - backlog_ratio)
	+ 0.10 × mean_reward
	- 0.25 × escalation_ratio
	```

	Why it's hard: No single optimal strategy — agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.

	Scoring: 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward − 25% escalation penalty.

	---

	## Unit Types

	\| Unit \| Code \| Speed \| Primary Use \|
	\|---\|---\|---\|---\|
	\| Engine \| `ENGINE` \| 0.8 bl/s \| Structure fires, hazmat support \|
	\| Ladder \| `LADDER` \| 0.6 bl/s \| Multi-story fires, rescues \|
	\| Medic \| `MEDIC` \| 1.0 bl/s \| Medical emergencies, trauma \|
	\| Patrol \| `PATROL` \| 1.2 bl/s \| Shootings, MVAs, crowd control \|
	\| Hazmat \| `HAZMAT` \| 0.5 bl/s \| Chemical/biological spills \|

	## Incident Types

	\| Incident \| Recommended Units \| Default Severity \|
	\|---\|---\|---\|
	\| `CARDIAC_ARREST` \| MEDIC \| P1 \|
	\| `STRUCTURE_FIRE` \| ENGINE × 2, LADDER \| P2 \|
	\| `SHOOTING` \| MEDIC, PATROL × 2 \| P1 \|
	\| `MULTI_VEHICLE_ACCIDENT` \| MEDIC, PATROL \| P2 \|
	\| `BUILDING_COLLAPSE` \| ENGINE, LADDER, MEDIC × 2 \| P1 \|
	\| `HAZMAT_SPILL` \| HAZMAT, ENGINE \| P2 \|
	\| `OVERDOSE` \| MEDIC \| P2 \|
	\| `MISSING_PERSON` \| PATROL \| P3 \|

	---

	## OpenEnv Interface

	```python
	import asyncio
	from src.openenv_environment import OpenEnvEnvironment
	from src.models import Action, DispatchAction

	async def main():
	env = OpenEnvEnvironment(task_id="multi_incident", seed=42)

	# Reset to initial state
	obs = await env.reset()
	print(obs.result) # "dispatch center online"

	# Get legal actions (protocol-validated)
	legal = env.legal_actions()

	# Take a step
	action = legal[0]
	obs, reward, done = await env.step(action)

	print(f"reward={reward:.3f}, done={done}, protocol_ok={obs.protocol_ok}")

	# Inspect full state
	state = env.state()
	print(f"step={state.step_count}, city_time={state.city_time}s")

	asyncio.run(main())
	```

	---

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/health` \| GET \| Health check — returns `{"status": "healthy"}` \|
	\| `/reset` \| POST \| Reset environment; body: `{"task_id": "...", "seed": 42}` (both optional) \|
	\| `/step` \| POST \| Execute an action; body: `{"action": {...}}` \|
	\| `/state` \| GET \| Current full environment state \|
	\| `/tasks` \| GET \| List all available tasks with metadata \|
	\| `/dashboard/state` \| GET \| Extended state for live HTML dashboard \|
	\| `/schema` \| GET \| JSON schemas for Action, Observation, State \|
	\| `/metadata` \| GET \| Environment name, version, description \|

	---

	## Quick Start

	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run the demo (non-interactive, no LLM required)
	python demo.py

	# Start the API server
	python -m src.server.app

	# Run random agent baseline (no API key required)
	USE_RANDOM=true python inference.py

	# Run LLM agent
	API_BASE_URL=https://router.huggingface.co/v1 \
	MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
	HF_TOKEN=your_token \
	python inference.py

	# Run full test suite
	pytest tests/ -v
	```

	---

	## Docker

	### Build & Run

	```bash
	# Build image
	docker build -t citywide-dispatch-supervisor .

	# Run on port 7860 (required for HF Spaces)
	docker run -p 7860:7860 citywide-dispatch-supervisor

	# Health check
	curl http://localhost:7860/health

	# Reset to a specific task
	curl -X POST http://localhost:7860/reset \
	-H "Content-Type: application/json" \
	-d '{"task_id": "multi_incident", "seed": 42}'
	```

	---

	## Hugging Face Spaces Deployment

	This repository is deployed as a Docker-based HF Space.

	1. Create a new HF Space → select Docker
	2. Push this repository to the Space
	3. The server reads `PORT` from the environment (HF sets `PORT=7860`)
	4. Once running, the following endpoints are publicly available:
	- `GET /health`
	- `POST /reset`
	- `POST /step`
	- `GET /state`

	Validate your deployment with the prevalidation script:

	```bash
	bash samplematerial/prevalidation.sh https://your-space.hf.space .
	```

	---

	## Environment Variables

	\| Variable \| Description \| Default \|
	\|---\|---\|---\|
	\| `API_BASE_URL` \| LLM API endpoint \| `https://router.huggingface.co/v1` \|
	\| `MODEL_NAME` \| Model identifier \| `meta-llama/Llama-3.1-8B-Instruct` \|
	\| `HF_TOKEN` \| HuggingFace API key \| — \|
	\| `USE_RANDOM` \| Set `true` for deterministic random baseline \| `false` \|
	\| `PORT` \| Server port \| `7860` \|

	---

	## Baseline Scores

	Scores normalized to `[0.0, 1.0]` using `sum(rewards) / max_steps`.
	Run with `USE_RANDOM=true python inference.py` (seed=42, fully deterministic).

	\| Task \| Difficulty \| Max Steps \| Random Agent Score \|
	\|---\|---\|---\|---\|
	\| `single_incident` \| Easy \| 20 \| 0.2000 \|
	\| `multi_incident` \| Medium \| 40 \| 0.3117 \|
	\| `mass_casualty` \| Hard \| 60 \| 0.4645 \|
	\| `shift_surge` \| Hard \| 60 \| 0.3183 \|

	> Note: Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.

	### What the scores mean

	A random agent scoring 0.20 on the easiest task confirms the environment is not trivially solvable — there is no reward for random dispatching. The gradient from 0.20 → 0.46 across tasks reflects genuine increasing complexity, not just more steps.

	A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score 0.55–0.75 on single_incident and 0.30–0.45 on shift_surge, demonstrating the environment meaningfully differentiates agent capability.

	LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.

	Run the baseline matrix (random + LLM reruns) and emit a JSON report:

	```bash
	API_BASE_URL=https://router.huggingface.co/v1 \
	MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
	HF_TOKEN=your_token \
	python scripts/run_baseline_matrix.py --random-runs 1 --llm-runs 3 --output-json baseline_report.json
	```

	Windows PowerShell shortcut:

	```powershell
	$env:HF_TOKEN="your_token"
	powershell -ExecutionPolicy Bypass -File scripts/run_nemotron_baseline.ps1 -RandomRuns 1 -LlmRuns 3
	```

	---

	## Project Structure

	```
	.
	├── src/
	│ ├── models.py # Pydantic typed contracts (Action, Observation, State)
	│ ├── protocol.py # Dispatch protocol validator
	│ ├── physics.py # City-grid movement / ETA helpers
	│ ├── city_schema.py # City topology + unit configuration loader
	│ ├── state_machine.py # Core dispatch state machine
	│ ├── rewards.py # Reward engine + episode graders
	│ ├── phraseology.py # Dispatch phraseology renderer/judge
	│ ├── api.py # REST API client wrapper
	│ ├── grading.py # Centralized episode grading router
	│ ├── benchmark.py # Benchmark runner (list/run all tasks)
	│ ├── openenv_environment.py # OpenEnv-compatible environment wrapper
	│ ├── tasks/
	│ │ ├── registry.py # Task registry + deterministic scenario fixtures
	│ │ ├── single_incident.py # Easy task + grader
	│ │ ├── multi_incident.py # Medium task + grader
	│ │ ├── mass_casualty.py # Hard task + grader
	│ │ └── shift_surge.py # Hard task + grader
	│ ├── server/
	│ │ ├── app.py # FastAPI server (reset/step/state endpoints)
	│ │ ├── requirements.txt
	│ │ └── Dockerfile
	│ └── visualizer/
	│ └── viewer.py # Read-only 2D Matplotlib visualizer
	├── data/
	│ ├── metro_city.json # Large city schema (default)
	│ └── city_small.json # Small city schema (testing)
	├── tests/ # TDD test suite (~20 test modules)
	├── samplematerial/
	│ └── prevalidation.sh # HF Space + Docker validation script
	├── demo.py # Non-interactive demo (no LLM required)
	├── inference.py # Competition inference script
	├── live_dashboard.html # Browser-based live dashboard
	├── validate_local.py # Local pre-submission validation
	├── openenv.yaml # OpenEnv specification
	├── pyproject.toml # uv project config
	├── requirements.txt # pip dependencies
	└── Dockerfile # Root Docker build
	```

	---

	## Live Dashboard

	After starting the server and calling `/reset`, open `live_dashboard.html` in a browser:

	```bash
	# Terminal 1: start server
	python -m src.server.app

	# Terminal 2: reset to a task
	curl -X POST http://localhost:7860/reset \
	-H "Content-Type: application/json" \
	-d '{"task_id": "multi_incident"}'

	# Browser: open live_dashboard.html
	```

	The dashboard polls `/dashboard/state` every 500ms and renders:

	- Unit cards (status, ETA, assignment, location)
	- Incident cards (type, severity, status, assigned units)
	- City map (2D grid with unit and incident markers)
	- Per-step reward component bars

	---

	## 2D Visualizer (Programmatic)

	```python
	import asyncio
	from src.openenv_environment import OpenEnvEnvironment
	from src.visualizer.viewer import Viewer2D

	async def main():
	env = OpenEnvEnvironment(task_id="multi_incident", seed=42)
	await env.reset()
	Viewer2D().render_to_file("frame.png", env.state())
	env.close()

	asyncio.run(main())
	```

	---

	---

	## Determinism

	All scenarios are deterministic under a fixed seed:

	```python
	env1 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
	env2 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
	# env1 and env2 produce identical episodes
	```

	Incident positions include small seeded perturbations for realism; the overall episode structure (waves, unit positions, incident types) is fully reproducible.

	---

	## Running Tests

	```bash
	# Full test suite
	pytest tests/ -v

	# Individual modules
	pytest tests/test_state_machine.py -v
	pytest tests/test_rewards.py -v
	pytest tests/test_openenv_integration.py -v
	pytest tests/test_inference.py -v
	```

	---

	## Pre-Submission Validation

	```bash
	# Full local validation (tests + inference + docker + benchmark scores)
	python validate_local.py

	# OpenEnv spec validation
	openenv validate

	# HF Space validation (requires deployed space)
	bash samplematerial/prevalidation.sh https://your-space.hf.space .

	# Windows (explicit Git Bash)
	"C:/Program Files/Git/bin/bash.exe" samplematerial/prevalidation.sh https://your-space.hf.space .
	```

	---

	## License

	MIT License