--- title: 911 Dispatch Supervisor emoji: "๐Ÿšจ" colorFrom: red colorTo: gray sdk: docker app_port: 7860 tags: - openenv pinned: false --- # ๐Ÿšจ 911 Dispatch Supervisor > **A city-wide emergency dispatch RL environment** โ€” train and evaluate LLM agents to manage simultaneous incidents by dispatching police, fire, and EMS units across a city grid under realistic resource constraints. [![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-green)](https://openenv.dev) [![Docker](https://img.shields.io/badge/Docker-ready-blue)](https://hub.docker.com) [![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces) [![License: MIT](https://img.shields.io/badge/License-MIT-lightgrey)](LICENSE) --- ## Why This Matters 911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision โ€” which unit to send, in what order, with what priority โ€” directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%. The **911 Dispatch Supervisor** is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance โ€” all simultaneously. This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability. ## Overview At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate โ€” under time pressure, limited resources, and competing priorities across a 100ร—100 city grid. This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that: - Requires **triage** โ€” prioritizing life-threatening incidents over property damage - Demands **coverage awareness** โ€” keeping geographic zones protected - Rewards **correct unit-type matching** โ€” sending a MEDIC vs. an ENGINE - Punishes **delays** that cause Priority-1 incidents to escalate - Scores **dispatch phraseology** โ€” realistic radio communication language --- ## Environment Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ OpenEnv Interface โ”‚ โ”‚ reset() ยท step(action) ยท state() โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ DispatchStateMachine โ”‚ โ”‚ โ€ข Validates actions via DispatchProtocolValidator โ”‚ โ”‚ โ€ข Moves units toward incidents (Manhattan physics) โ”‚ โ”‚ โ€ข Advances incident status: PENDING โ†’ RESPONDING โ†’ โ”‚ โ”‚ ON_SCENE โ†’ RESOLVED (or ESCALATED if timeout) โ”‚ โ”‚ โ€ข Spawns incident waves at configured step offsets โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ RewardCalculator โ”‚ โ”‚ โ€ข response_time (30%) ยท triage (25%) ยท survival (25%) โ”‚ โ”‚ โ€ข coverage (12%) ยท protocol (8%) โ”‚ โ”‚ โ€ข Safety gate: P1 failure โ†’ score capped at 0.2 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Task-Specific Episode Graders โ”‚ โ”‚ single_incident ยท multi_incident ยท mass_casualty ยท โ”‚ โ”‚ shift_surge โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## Action Space Actions are structured Pydantic models โ€” no free-text parsing required. **`src.models.Action`** | Field | Type | Description | |---|---|---| | `action_type` | `DispatchAction` | One of: `DISPATCH`, `CANCEL`, `REASSIGN`, `STAGE`, `MUTUAL_AID`, `UPGRADE`, `DOWNGRADE` | | `unit_id` | `str` | Unit identifier, e.g. `MED-1`, `ENG-2` | | `incident_id` | `str` | Incident identifier, e.g. `INC-001` | | `notes` | `str \| None` | Optional phraseology text for protocol scoring bonus | | `priority_override` | `IncidentSeverity \| None` | Required for `UPGRADE`/`DOWNGRADE` actions | **Action Types** | Action | Description | Protocol Rule | |---|---|---| | `DISPATCH` | Send an available unit to an incident | Unit must be `AVAILABLE`; incident must not be `RESOLVED` | | `CANCEL` | Release a unit from its current assignment | Unit must be assigned to the specified incident | | `REASSIGN` | Redirect an assigned unit to a different incident | Unit must be `DISPATCHED`, `ON_SCENE`, or `TRANSPORTING` | | `STAGE` | Pre-position a unit near an incident without committing | Unit must be `AVAILABLE`; incident must be `PENDING` | | `MUTUAL_AID` | Request external unit of a given type | Only allowed when all local units of that type are busy | | `UPGRADE` | Increase incident severity | New severity must be strictly higher than current | | `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current | #### Dispatch Phraseology (bonus scoring) The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score. | Action | Example notes value | |---|---| | Dispatch MEDIC to cardiac | `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` | | Dispatch ENGINE to fire | `"Engine 2 responding to structure fire, Code 3, all units advised"` | | Mutual aid request | `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` | | Stage unit | `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` | --- ## Observation Space **`src.models.Observation`** | Field | Type | Description | |---|---|---| | `result` | `str` | Human-readable result of the last action | | `score` | `float` | Episode score in `[0.0, 1.0]` (task-level grade) | | `protocol_ok` | `bool` | Whether the action passed protocol validation | | `issues` | `list[str]` | Warnings or error codes from the validator | | `reward_breakdown` | `dict[str, float] \| None` | Per-component reward scores for dashboard display | **Full State (`src.models.State`)** | Field | Type | Description | |---|---|---| | `units` | `dict[str, UnitState]` | All units with type, status, location, ETA | | `incidents` | `dict[str, IncidentState]` | All incidents with type, severity, status, assigned units | | `episode_id` | `str` | Unique episode identifier | | `step_count` | `int` | Current step number | | `task_id` | `str` | Active task identifier | | `city_time` | `float` | Simulated city clock in seconds (30s per step) | | `metadata` | `dict` | Schema info, districts, seeds, wave configs, bookkeeping | **Unit Status Transitions** ``` AVAILABLE โ†’ DISPATCHED โ†’ ON_SCENE โ†’ AVAILABLE โ†“ OUT_OF_SERVICE (shift_surge only) ``` **Incident Status Transitions** ``` PENDING โ†’ RESPONDING โ†’ ON_SCENE โ†’ RESOLVED โ†“ โ†“ ESCALATED ESCALATED (survival clock expires) ``` --- ## Reward Function The step-level reward is a weighted combination of five components: | Component | Weight | Description | |---|---|---| | `response_time` | **30%** | How quickly dispatched units reach incidents relative to severity benchmarks (P1: 240s, P2: 480s, P3: 900s) | | `triage` | **25%** | Whether the dispatched unit type matches incident requirements (e.g., MEDIC for CARDIAC_ARREST) | | `survival` | **25%** | Fraction of Priority-1 incidents resolved before the survival clock expires | | `coverage` | **12%** | Geographic distribution of available units across city districts | | `protocol` | **8%** | Action legality + optional phraseology/readback quality via `Action.notes` | > **โš ๏ธ Safety Gate:** If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at **0.2** regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable โ€” exactly as real dispatch protocol requires. **Non-DISPATCH actions** receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty. --- ## Tasks ### Task Difficulty Overview | Task | Difficulty | Max Steps | Key Challenge | |---|---|---|---| | `single_incident` | ๐ŸŸข Easy | 20 | Dispatch the right unit type quickly | | `multi_incident` | ๐ŸŸก Medium | 40 | Triage 3 simultaneous incidents, protect P1s | | `mass_casualty` | ๐Ÿ”ด Hard | 60 | Manage wave-based surge with limited resources | | `shift_surge` | ๐Ÿ”ด Hard | 60 | Adapt as units fail and incidents stream continuously | --- ### ๐ŸŸข Task 1: `single_incident` โ€” Basic Dispatch (Easy) **Scenario**: One active incident (`CARDIAC_ARREST`, Priority-1) in a small city. A MEDIC, ENGINE, and PATROL are all available. **Objective**: Dispatch the correct unit type (MEDIC) to the incident as fast as possible. **Grader Logic**: ``` score = 0.0 if incident RESOLVED: score += 0.50 if MEDIC dispatched correctly: score += 0.30 if resolved within 10 steps: score += 0.20 ``` **Why it's easy**: One incident, one correct action, small state space. **What a good agent does**: Immediately dispatches `MED-1 โ†’ INC-001`. **Scoring:** 50% resolution + 30% correct unit type used + 20% response speed. --- ### ๐ŸŸก Task 2: `multi_incident` โ€” Simultaneous Triage (Medium) **Scenario**: Three concurrent incidents at episode start โ€” a structure fire (P2), a cardiac arrest (P1), and a shooting (P1) โ€” with 6 available units. **Objective**: Respond to all incidents with the right unit types, prioritizing P1s. **Grader Logic**: ``` score = 0.5 ร— p1_resolution_rate + 0.3 ร— overall_resolution_rate - 0.2 ร— escalation_penalty ``` **Why it's medium**: Multiple incidents compete for units; wrong type dispatch wastes coverage; P1s must be addressed before P2. **What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER. **Scoring:** 50% P1 resolution + 30% overall resolution โˆ’ 20% escalation penalty. --- ### ๐Ÿ”ด Task 3: `mass_casualty` โ€” Wave-Based Surge (Hard) **Scenario**: One critical incident (`BUILDING_COLLAPSE`, P1) at step 0. New waves arrive at steps 5 (structure fire) and 12 (two simultaneous cardiac arrests). **Objective**: Maximize P1 survival across all waves despite resource conflicts. **Grader Logic**: ``` score = 0.6 ร— p1_survival_rate + 0.3 ร— mean_step_reward - failure_penalty ``` **Why it's hard**: Resources are exhausted when waves arrive. Agents must decide whether to reassign mid-scene or request mutual aid (at a 120s ETA penalty). Mutual aid is only legal when local units of the required type are fully committed. **What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves. **Scoring:** 60% P1 survival + 30% mean step reward โˆ’ failure penalty if building collapse unresponded. --- ### ๐Ÿ”ด Task 4: `shift_surge` โ€” Long-Horizon Degradation (Hard) **Scenario**: 5 units start available, but 3 go `OUT_OF_SERVICE` by step 5. Incidents arrive in waves every 8 steps throughout the 60-step episode. **Objective**: Maintain city-wide throughput and P1 survival despite progressive resource degradation. **Grader Logic**: ``` score = 0.35 ร— resolution_ratio + 0.25 ร— p1_survival + 0.15 ร— coverage + 0.15 ร— (1 - backlog_ratio) + 0.10 ร— mean_reward - 0.25 ร— escalation_ratio ``` **Why it's hard**: No single optimal strategy โ€” agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows. **Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward โˆ’ 25% escalation penalty. --- ## Unit Types | Unit | Code | Speed | Primary Use | |---|---|---|---| | Engine | `ENGINE` | 0.8 bl/s | Structure fires, hazmat support | | Ladder | `LADDER` | 0.6 bl/s | Multi-story fires, rescues | | Medic | `MEDIC` | 1.0 bl/s | Medical emergencies, trauma | | Patrol | `PATROL` | 1.2 bl/s | Shootings, MVAs, crowd control | | Hazmat | `HAZMAT` | 0.5 bl/s | Chemical/biological spills | ## Incident Types | Incident | Recommended Units | Default Severity | |---|---|---| | `CARDIAC_ARREST` | MEDIC | P1 | | `STRUCTURE_FIRE` | ENGINE ร— 2, LADDER | P2 | | `SHOOTING` | MEDIC, PATROL ร— 2 | P1 | | `MULTI_VEHICLE_ACCIDENT` | MEDIC, PATROL | P2 | | `BUILDING_COLLAPSE` | ENGINE, LADDER, MEDIC ร— 2 | P1 | | `HAZMAT_SPILL` | HAZMAT, ENGINE | P2 | | `OVERDOSE` | MEDIC | P2 | | `MISSING_PERSON` | PATROL | P3 | --- ## OpenEnv Interface ```python import asyncio from src.openenv_environment import OpenEnvEnvironment from src.models import Action, DispatchAction async def main(): env = OpenEnvEnvironment(task_id="multi_incident", seed=42) # Reset to initial state obs = await env.reset() print(obs.result) # "dispatch center online" # Get legal actions (protocol-validated) legal = env.legal_actions() # Take a step action = legal[0] obs, reward, done = await env.step(action) print(f"reward={reward:.3f}, done={done}, protocol_ok={obs.protocol_ok}") # Inspect full state state = env.state() print(f"step={state.step_count}, city_time={state.city_time}s") asyncio.run(main()) ``` --- ## API Endpoints | Endpoint | Method | Description | |---|---|---| | `/health` | GET | Health check โ€” returns `{"status": "healthy"}` | | `/reset` | POST | Reset environment; body: `{"task_id": "...", "seed": 42}` (both optional) | | `/step` | POST | Execute an action; body: `{"action": {...}}` | | `/state` | GET | Current full environment state | | `/tasks` | GET | List all available tasks with metadata | | `/dashboard/state` | GET | Extended state for live HTML dashboard | | `/schema` | GET | JSON schemas for Action, Observation, State | | `/metadata` | GET | Environment name, version, description | --- ## Quick Start ```bash # Install dependencies pip install -r requirements.txt # Run the demo (non-interactive, no LLM required) python demo.py # Start the API server python -m src.server.app # Run random agent baseline (no API key required) USE_RANDOM=true python inference.py # Run LLM agent API_BASE_URL=https://router.huggingface.co/v1 \ MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \ HF_TOKEN=your_token \ python inference.py # Run full test suite pytest tests/ -v ``` --- ## Docker ### Build & Run ```bash # Build image docker build -t citywide-dispatch-supervisor . # Run on port 7860 (required for HF Spaces) docker run -p 7860:7860 citywide-dispatch-supervisor # Health check curl http://localhost:7860/health # Reset to a specific task curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "multi_incident", "seed": 42}' ``` --- ## Hugging Face Spaces Deployment This repository is deployed as a Docker-based HF Space. 1. Create a new HF Space โ†’ select **Docker** 2. Push this repository to the Space 3. The server reads `PORT` from the environment (HF sets `PORT=7860`) 4. Once running, the following endpoints are publicly available: - `GET /health` - `POST /reset` - `POST /step` - `GET /state` Validate your deployment with the prevalidation script: ```bash bash samplematerial/prevalidation.sh https://your-space.hf.space . ``` --- ## Environment Variables | Variable | Description | Default | |---|---|---| | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` | | `MODEL_NAME` | Model identifier | `meta-llama/Llama-3.1-8B-Instruct` | | `HF_TOKEN` | HuggingFace API key | โ€” | | `USE_RANDOM` | Set `true` for deterministic random baseline | `false` | | `PORT` | Server port | `7860` | --- ## Baseline Scores Scores normalized to `[0.0, 1.0]` using `sum(rewards) / max_steps`. Run with `USE_RANDOM=true python inference.py` (seed=42, fully deterministic). | Task | Difficulty | Max Steps | Random Agent Score | |---|---|---|---| | `single_incident` | Easy | 20 | 0.2000 | | `multi_incident` | Medium | 40 | 0.3117 | | `mass_casualty` | Hard | 60 | 0.4645 | | `shift_surge` | Hard | 60 | 0.3183 | > **Note:** Earlier README versions showed higher scores (~0.30โ€“0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`. ### What the scores mean A random agent scoring **0.20 on the easiest task** confirms the environment is not trivially solvable โ€” there is no reward for random dispatching. The gradient from 0.20 โ†’ 0.46 across tasks reflects genuine increasing complexity, not just more steps. A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score **0.55โ€“0.75 on single_incident** and **0.30โ€“0.45 on shift_surge**, demonstrating the environment meaningfully differentiates agent capability. LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types. Run the baseline matrix (random + LLM reruns) and emit a JSON report: ```bash API_BASE_URL=https://router.huggingface.co/v1 \ MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \ HF_TOKEN=your_token \ python scripts/run_baseline_matrix.py --random-runs 1 --llm-runs 3 --output-json baseline_report.json ``` Windows PowerShell shortcut: ```powershell $env:HF_TOKEN="your_token" powershell -ExecutionPolicy Bypass -File scripts/run_nemotron_baseline.ps1 -RandomRuns 1 -LlmRuns 3 ``` --- ## Project Structure ``` . โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ models.py # Pydantic typed contracts (Action, Observation, State) โ”‚ โ”œโ”€โ”€ protocol.py # Dispatch protocol validator โ”‚ โ”œโ”€โ”€ physics.py # City-grid movement / ETA helpers โ”‚ โ”œโ”€โ”€ city_schema.py # City topology + unit configuration loader โ”‚ โ”œโ”€โ”€ state_machine.py # Core dispatch state machine โ”‚ โ”œโ”€โ”€ rewards.py # Reward engine + episode graders โ”‚ โ”œโ”€โ”€ phraseology.py # Dispatch phraseology renderer/judge โ”‚ โ”œโ”€โ”€ api.py # REST API client wrapper โ”‚ โ”œโ”€โ”€ grading.py # Centralized episode grading router โ”‚ โ”œโ”€โ”€ benchmark.py # Benchmark runner (list/run all tasks) โ”‚ โ”œโ”€โ”€ openenv_environment.py # OpenEnv-compatible environment wrapper โ”‚ โ”œโ”€โ”€ tasks/ โ”‚ โ”‚ โ”œโ”€โ”€ registry.py # Task registry + deterministic scenario fixtures โ”‚ โ”‚ โ”œโ”€โ”€ single_incident.py # Easy task + grader โ”‚ โ”‚ โ”œโ”€โ”€ multi_incident.py # Medium task + grader โ”‚ โ”‚ โ”œโ”€โ”€ mass_casualty.py # Hard task + grader โ”‚ โ”‚ โ””โ”€โ”€ shift_surge.py # Hard task + grader โ”‚ โ”œโ”€โ”€ server/ โ”‚ โ”‚ โ”œโ”€โ”€ app.py # FastAPI server (reset/step/state endpoints) โ”‚ โ”‚ โ”œโ”€โ”€ requirements.txt โ”‚ โ”‚ โ””โ”€โ”€ Dockerfile โ”‚ โ””โ”€โ”€ visualizer/ โ”‚ โ””โ”€โ”€ viewer.py # Read-only 2D Matplotlib visualizer โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ metro_city.json # Large city schema (default) โ”‚ โ””โ”€โ”€ city_small.json # Small city schema (testing) โ”œโ”€โ”€ tests/ # TDD test suite (~20 test modules) โ”œโ”€โ”€ samplematerial/ โ”‚ โ””โ”€โ”€ prevalidation.sh # HF Space + Docker validation script โ”œโ”€โ”€ demo.py # Non-interactive demo (no LLM required) โ”œโ”€โ”€ inference.py # Competition inference script โ”œโ”€โ”€ live_dashboard.html # Browser-based live dashboard โ”œโ”€โ”€ validate_local.py # Local pre-submission validation โ”œโ”€โ”€ openenv.yaml # OpenEnv specification โ”œโ”€โ”€ pyproject.toml # uv project config โ”œโ”€โ”€ requirements.txt # pip dependencies โ””โ”€โ”€ Dockerfile # Root Docker build ``` --- ## Live Dashboard After starting the server and calling `/reset`, open `live_dashboard.html` in a browser: ```bash # Terminal 1: start server python -m src.server.app # Terminal 2: reset to a task curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "multi_incident"}' # Browser: open live_dashboard.html ``` The dashboard polls `/dashboard/state` every 500ms and renders: - Unit cards (status, ETA, assignment, location) - Incident cards (type, severity, status, assigned units) - City map (2D grid with unit and incident markers) - Per-step reward component bars --- ## 2D Visualizer (Programmatic) ```python import asyncio from src.openenv_environment import OpenEnvEnvironment from src.visualizer.viewer import Viewer2D async def main(): env = OpenEnvEnvironment(task_id="multi_incident", seed=42) await env.reset() Viewer2D().render_to_file("frame.png", env.state()) env.close() asyncio.run(main()) ``` --- --- ## Determinism All scenarios are deterministic under a fixed seed: ```python env1 = OpenEnvEnvironment(task_id="shift_surge", seed=42) env2 = OpenEnvEnvironment(task_id="shift_surge", seed=42) # env1 and env2 produce identical episodes ``` Incident positions include small seeded perturbations for realism; the overall episode structure (waves, unit positions, incident types) is fully reproducible. --- ## Running Tests ```bash # Full test suite pytest tests/ -v # Individual modules pytest tests/test_state_machine.py -v pytest tests/test_rewards.py -v pytest tests/test_openenv_integration.py -v pytest tests/test_inference.py -v ``` --- ## Pre-Submission Validation ```bash # Full local validation (tests + inference + docker + benchmark scores) python validate_local.py # OpenEnv spec validation openenv validate # HF Space validation (requires deployed space) bash samplematerial/prevalidation.sh https://your-space.hf.space . # Windows (explicit Git Bash) "C:/Program Files/Git/bin/bash.exe" samplematerial/prevalidation.sh https://your-space.hf.space . ``` --- ## License MIT License