# 911 Dispatch Project - Complete Beginner Guide ## 1. What this project is (in plain language) This project is a simulator where an AI agent learns to behave like a city emergency dispatch supervisor. Think of it like a strategy game: - There are emergencies (incidents). - There are responders (fire, police, EMS units). - The agent must choose what to do each turn (dispatch, reassign, cancel, request mutual aid, etc.). - The simulator gives a score for each decision and a final score for the whole run. The goal is to train and evaluate decision-making quality under pressure. ## 2. What an RL environment means RL means Reinforcement Learning. In RL, four core ideas exist: - Agent: the decision-maker (your model or baseline policy). - Environment: the world that reacts to actions (this simulator). - Reward: a number that says how good/bad the last action outcome was. - Episode: one complete run from start to finish. For this project: - Agent picks an action. - Environment updates city state. - Environment returns: - updated observation, - reward, - done flag (whether run is over). That loop repeats until the episode ends. ## 3. Important clarification: "scheme of electricity" vs "city schema" There is no electricity scheme in this codebase. What exists is a city schema. City schema means a configuration blueprint for the simulation: - city size (grid), - districts, - available units, - unit speeds, - default recommended unit types for each incident type. The schema is loaded from data files and used to initialize deterministic, repeatable scenarios. ## 4. Project architecture (high level) 1. Scenario/task setup - A task fixture builds initial units/incidents and metadata. 2. State machine update engine - Validates actions. - Applies action effects. - Advances time by one tick. - Updates incident statuses and unit statuses. 3. Reward + scoring - Computes per-step reward components. - Computes episode-level score using task-specific graders. 4. API server - Exposes reset/step/state endpoints. 5. Dashboard - Polls backend state repeatedly and renders units/incidents + reward bars. ## 5. What is the task? A task is a scenario type with its own initial conditions, difficulty, and final grading logic. This project has 4 tasks: 1. single_incident (easy) - One incident, small unit pool. - Focus: dispatch the right unit fast. 2. multi_incident (medium) - Multiple incidents at the same time. - Focus: triage/prioritization and handling P1 incidents. 3. mass_casualty (hard) - Incident waves with severe emergencies and resource conflicts. - Focus: survival outcomes under surge. 4. shift_surge (hard) - New incidents arrive over time and some units go out of service. - Focus: long-horizon operations and city coverage under degradation. ## 6. What is an episode? An episode is one full run of a task from reset until terminal condition. Episode starts when reset is called. - step_count starts at 0. - city_time starts at 0 seconds. - units and incidents are loaded from selected task fixture. Episode ends when any terminal condition is hit: - max steps reached, - at least one incident escalates, - all incidents resolved. ## 7. What is a step? A step is one action cycle: 1. Agent sends one action. 2. Validator checks if action is legal. 3. State machine applies action effects. 4. Time advances by 30 seconds. 5. Reward is computed. 6. Observation + reward + done are returned. Important: - step_count increases by 1 per step. - city_time increases by 30 seconds per step. ## 8. At what step are we right now? Snapshot from the live backend at the time this guide was generated: - task_id: multi_incident - episode_id: d2cd525e-2596-44cb-bbe3-af33236264a0 - step_count: 8 - city_time: 240.0 seconds - cumulative_reward: 1.6 - episode_score: 0.0 - legal_actions currently available: 36 This is a live value, not a constant. If you reset again, step_count returns to 0. ## 9. Action space (what actions exist) Current action types include: - DISPATCH - CANCEL - REASSIGN - STAGE - MUTUAL_AID - UPGRADE - DOWNGRADE Legal actions are generated from current state and filtered by protocol validation, so only valid actions appear in legal_actions. ## 10. How scoring works (complete detail) There are two scoring layers: 1. Step reward (every action) 2. Episode score (whole run) ### 10.1 Step reward (RewardCalculator) Step reward uses a weighted sum of 5 components: - response_time: 30% - triage: 25% - survival: 25% - coverage: 12% - protocol: 8% Total formula: - total = 0.30 * response_time + 0.25 * triage + 0.25 * survival + 0.12 * coverage + 0.08 * protocol - result is clamped to [0, 1] Safety rule: - If any Priority-1 incident existed and survival component is 0, total score is capped at 0.2. Component details: 1. response_time - Only meaningful for DISPATCH. - For non-DISPATCH actions it returns neutral 0.5. - For DISPATCH: compares ETA to severity benchmark. 2. triage - Only meaningful for DISPATCH. - Checks if dispatched unit type matches required unit types for incident type. - Handles enum-qualified metadata keys safely. 3. survival - Based on P1 incidents seen vs resolved without failure. - Uses metadata lists: p1_seen, resolved_incidents, failed_incidents. 4. coverage - Measures how many districts still have AVAILABLE coverage. 5. protocol - If action invalid: 0.0. - If valid and no phraseology text in Action.notes: neutral 0.5. - If Action.notes provided: uses PhraseologyJudge score + readback correctness. ### 10.2 Episode score (whole run) Episode score is task-specific via a central grade_episode router. Why this matters: - Different tasks need different definitions of success. - Mean step reward alone is often too weak for real evaluation. Task-specific episode graders: 1. single_incident - +0.50 if incident resolved - +0.30 if MEDIC dispatched correctly - +0.20 if resolved within first 10 steps 2. multi_incident - Uses P1 resolution, overall resolution ratio, and escalation penalty - score = 0.5 * p1_score + 0.3 * resolution_score - 0.2 * failure_penalty 3. mass_casualty - Emphasizes P1 survival with penalties for failures - score = 0.6 * survival_score + 0.3 * mean_reward - failure_penalty 4. shift_surge (improved) - Emphasizes long-horizon operational quality: - incident throughput (resolved ratio) - P1 survival - coverage - low backlog - mean reward - escalation penalty ## 11. Very important score semantics In the OpenEnv wrapper: - reward return value from step is per-step reward. - observation.score is overwritten to episode score. Also stored in metadata: - cumulative_reward: running sum of step rewards. - episode_rewards: list of per-step rewards. - episode_score: current episode-level grade. So if you compare values: - reward = immediate local quality for this action - observation.score = global task progress quality for the run ## 12. Is the dashboard connected to backend or just static? It is connected to backend. How we know: - The dashboard JavaScript calls API endpoint http://localhost:8000/dashboard/state. - It polls every 500 ms. - It renders live units/incidents, step, and reward breakdown from backend response. Connection behavior: - If backend is unreachable, dashboard shows disconnected status. - If backend is running and reset was called, dashboard updates live as step changes. ## 13. Why we used Docker Docker is used to package the app and dependencies so it runs consistently everywhere. Benefits: - Same runtime on your machine, CI, and deployment platforms. - No "works on my machine" package mismatch issues. - Easy deployment with a single container image. - Port compatibility: server reads PORT environment variable (important for hosted platforms). In this project: - Root Dockerfile runs uvicorn on 0.0.0.0 and PORT (default 8000). - That makes it suitable for local run and hosted environments. ## 14. What API key are we using? The project expects environment variables. Keys are not hardcoded in repository files. Required for LLM mode: - API_BASE_URL - MODEL_NAME - OPENAI_API_KEY Compatibility fallback: - HF_TOKEN is accepted if OPENAI_API_KEY is not set. No-key mode: - USE_RANDOM=true bypasses LLM and uses a deterministic random baseline agent. Practical meaning: - If USE_RANDOM=true, you can run without any API key. - If USE_RANDOM is not true, OPENAI_API_KEY (or HF_TOKEN fallback) is needed. ## 15. Backend API endpoints (what each does) - GET /health - health check - GET /tasks - list available tasks - POST /reset - start new episode for selected task - POST /step - apply one action and move simulation one step - GET /state - current state - GET /dashboard/state - extended state for HTML dashboard (includes legal actions + last observation) - GET /metadata and GET /schema - environment metadata and contracts - POST /mcp - minimal JSON-RPC endpoint ## 16. What the dashboard shows vs what it does not show Shows: - Unit cards (status, assignment, ETA, location) - Incident cards (type, severity, status, assigned units) - Map view for units/incidents - Last step reward component bars - Header task/episode/step values Nuance: - Header "Score" currently uses metadata.cumulative_reward. - Episode score is available too (metadata.episode_score), but not currently shown as the main header score. ## 17. Beginner glossary - incident: emergency case to be handled - unit: responder vehicle/team (EMS, fire, police, etc.) - legal action: an action that passes protocol checks in current state - reward: immediate feedback signal for one step - episode score: overall quality of a full run - terminal: episode is finished ## 18. Practical "how to think" summary When you judge behavior quality in this project: - Use step rewards to understand local tactical quality. - Use episode score to understand mission success for the selected task. - Use dashboard to observe live state transitions. - Use task definitions to interpret what success means in each scenario. If you remember one thing: - This is not a generic chatbot app. It is a decision simulator where actions change a world state over time and are graded both step-by-step and across full episodes.