911 / PROJECT_COMPLETE_GUIDE.md
garvitsachdeva's picture
docs: polish README; remove emoji
984aa3b

911 Dispatch Project - Complete Beginner Guide

1. What this project is (in plain language)

This project is a simulator where an AI agent learns to behave like a city emergency dispatch supervisor.

Think of it like a strategy game:

  • There are emergencies (incidents).
  • There are responders (fire, police, EMS units).
  • The agent must choose what to do each turn (dispatch, reassign, cancel, request mutual aid, etc.).
  • The simulator gives a score for each decision and a final score for the whole run.

The goal is to train and evaluate decision-making quality under pressure.

2. What an RL environment means

RL means Reinforcement Learning.

In RL, four core ideas exist:

  • Agent: the decision-maker (your model or baseline policy).
  • Environment: the world that reacts to actions (this simulator).
  • Reward: a number that says how good/bad the last action outcome was.
  • Episode: one complete run from start to finish.

For this project:

  • Agent picks an action.
  • Environment updates city state.
  • Environment returns:
    • updated observation,
    • reward,
    • done flag (whether run is over).

That loop repeats until the episode ends.

3. Important clarification: "scheme of electricity" vs "city schema"

There is no electricity scheme in this codebase.

What exists is a city schema.

City schema means a configuration blueprint for the simulation:

  • city size (grid),
  • districts,
  • available units,
  • unit speeds,
  • default recommended unit types for each incident type.

The schema is loaded from data files and used to initialize deterministic, repeatable scenarios.

4. Project architecture (high level)

  1. Scenario/task setup
  • A task fixture builds initial units/incidents and metadata.
  1. State machine update engine
  • Validates actions.
  • Applies action effects.
  • Advances time by one tick.
  • Updates incident statuses and unit statuses.
  1. Reward + scoring
  • Computes per-step reward components.
  • Computes episode-level score using task-specific graders.
  1. API server
  • Exposes reset/step/state endpoints.
  1. Dashboard
  • Polls backend state repeatedly and renders units/incidents + reward bars.

5. What is the task?

A task is a scenario type with its own initial conditions, difficulty, and final grading logic.

This project has 4 tasks:

  1. single_incident (easy)
  • One incident, small unit pool.
  • Focus: dispatch the right unit fast.
  1. multi_incident (medium)
  • Multiple incidents at the same time.
  • Focus: triage/prioritization and handling P1 incidents.
  1. mass_casualty (hard)
  • Incident waves with severe emergencies and resource conflicts.
  • Focus: survival outcomes under surge.
  1. shift_surge (hard)
  • New incidents arrive over time and some units go out of service.
  • Focus: long-horizon operations and city coverage under degradation.

6. What is an episode?

An episode is one full run of a task from reset until terminal condition.

Episode starts when reset is called.

  • step_count starts at 0.
  • city_time starts at 0 seconds.
  • units and incidents are loaded from selected task fixture.

Episode ends when any terminal condition is hit:

  • max steps reached,
  • at least one incident escalates,
  • all incidents resolved.

7. What is a step?

A step is one action cycle:

  1. Agent sends one action.
  2. Validator checks if action is legal.
  3. State machine applies action effects.
  4. Time advances by 30 seconds.
  5. Reward is computed.
  6. Observation + reward + done are returned.

Important:

  • step_count increases by 1 per step.
  • city_time increases by 30 seconds per step.

8. At what step are we right now?

Snapshot from the live backend at the time this guide was generated:

  • task_id: multi_incident
  • episode_id: d2cd525e-2596-44cb-bbe3-af33236264a0
  • step_count: 8
  • city_time: 240.0 seconds
  • cumulative_reward: 1.6
  • episode_score: 0.0
  • legal_actions currently available: 36

This is a live value, not a constant. If you reset again, step_count returns to 0.

9. Action space (what actions exist)

Current action types include:

  • DISPATCH
  • CANCEL
  • REASSIGN
  • STAGE
  • MUTUAL_AID
  • UPGRADE
  • DOWNGRADE

Legal actions are generated from current state and filtered by protocol validation, so only valid actions appear in legal_actions.

10. How scoring works (complete detail)

There are two scoring layers:

  1. Step reward (every action)
  2. Episode score (whole run)

10.1 Step reward (RewardCalculator)

Step reward uses a weighted sum of 5 components:

  • response_time: 30%
  • triage: 25%
  • survival: 25%
  • coverage: 12%
  • protocol: 8%

Total formula:

  • total = 0.30 * response_time + 0.25 * triage + 0.25 * survival + 0.12 * coverage + 0.08 * protocol
  • result is clamped to [0, 1]

Safety rule:

  • If any Priority-1 incident existed and survival component is 0, total score is capped at 0.2.

Component details:

  1. response_time
  • Only meaningful for DISPATCH.
  • For non-DISPATCH actions it returns neutral 0.5.
  • For DISPATCH: compares ETA to severity benchmark.
  1. triage
  • Only meaningful for DISPATCH.
  • Checks if dispatched unit type matches required unit types for incident type.
  • Handles enum-qualified metadata keys safely.
  1. survival
  • Based on P1 incidents seen vs resolved without failure.
  • Uses metadata lists: p1_seen, resolved_incidents, failed_incidents.
  1. coverage
  • Measures how many districts still have AVAILABLE coverage.
  1. protocol
  • If action invalid: 0.0.
  • If valid and no phraseology text in Action.notes: neutral 0.5.
  • If Action.notes provided: uses PhraseologyJudge score + readback correctness.

10.2 Episode score (whole run)

Episode score is task-specific via a central grade_episode router.

Why this matters:

  • Different tasks need different definitions of success.
  • Mean step reward alone is often too weak for real evaluation.

Task-specific episode graders:

  1. single_incident
  • +0.50 if incident resolved
  • +0.30 if MEDIC dispatched correctly
  • +0.20 if resolved within first 10 steps
  1. multi_incident
  • Uses P1 resolution, overall resolution ratio, and escalation penalty
  • score = 0.5 * p1_score + 0.3 * resolution_score - 0.2 * failure_penalty
  1. mass_casualty
  • Emphasizes P1 survival with penalties for failures
  • score = 0.6 * survival_score + 0.3 * mean_reward - failure_penalty
  1. shift_surge (improved)
  • Emphasizes long-horizon operational quality:
    • incident throughput (resolved ratio)
    • P1 survival
    • coverage
    • low backlog
    • mean reward
    • escalation penalty

11. Very important score semantics

In the OpenEnv wrapper:

  • reward return value from step is per-step reward.
  • observation.score is overwritten to episode score.

Also stored in metadata:

  • cumulative_reward: running sum of step rewards.
  • episode_rewards: list of per-step rewards.
  • episode_score: current episode-level grade.

So if you compare values:

  • reward = immediate local quality for this action
  • observation.score = global task progress quality for the run

12. Is the dashboard connected to backend or just static?

It is connected to backend.

How we know:

  • The dashboard JavaScript calls API endpoint http://localhost:8000/dashboard/state.
  • It polls every 500 ms.
  • It renders live units/incidents, step, and reward breakdown from backend response.

Connection behavior:

  • If backend is unreachable, dashboard shows disconnected status.
  • If backend is running and reset was called, dashboard updates live as step changes.

13. Why we used Docker

Docker is used to package the app and dependencies so it runs consistently everywhere.

Benefits:

  • Same runtime on your machine, CI, and deployment platforms.
  • No "works on my machine" package mismatch issues.
  • Easy deployment with a single container image.
  • Port compatibility: server reads PORT environment variable (important for hosted platforms).

In this project:

  • Root Dockerfile runs uvicorn on 0.0.0.0 and PORT (default 8000).
  • That makes it suitable for local run and hosted environments.

14. What API key are we using?

The project expects environment variables. Keys are not hardcoded in repository files.

Required for LLM mode:

  • API_BASE_URL
  • MODEL_NAME
  • OPENAI_API_KEY

Compatibility fallback:

  • HF_TOKEN is accepted if OPENAI_API_KEY is not set.

No-key mode:

  • USE_RANDOM=true bypasses LLM and uses a deterministic random baseline agent.

Practical meaning:

  • If USE_RANDOM=true, you can run without any API key.
  • If USE_RANDOM is not true, OPENAI_API_KEY (or HF_TOKEN fallback) is needed.

15. Backend API endpoints (what each does)

  • GET /health

    • health check
  • GET /tasks

    • list available tasks
  • POST /reset

    • start new episode for selected task
  • POST /step

    • apply one action and move simulation one step
  • GET /state

    • current state
  • GET /dashboard/state

    • extended state for HTML dashboard (includes legal actions + last observation)
  • GET /metadata and GET /schema

    • environment metadata and contracts
  • POST /mcp

    • minimal JSON-RPC endpoint

16. What the dashboard shows vs what it does not show

Shows:

  • Unit cards (status, assignment, ETA, location)
  • Incident cards (type, severity, status, assigned units)
  • Map view for units/incidents
  • Last step reward component bars
  • Header task/episode/step values

Nuance:

  • Header "Score" currently uses metadata.cumulative_reward.
  • Episode score is available too (metadata.episode_score), but not currently shown as the main header score.

17. Beginner glossary

  • incident: emergency case to be handled
  • unit: responder vehicle/team (EMS, fire, police, etc.)
  • legal action: an action that passes protocol checks in current state
  • reward: immediate feedback signal for one step
  • episode score: overall quality of a full run
  • terminal: episode is finished

18. Practical "how to think" summary

When you judge behavior quality in this project:

  • Use step rewards to understand local tactical quality.
  • Use episode score to understand mission success for the selected task.
  • Use dashboard to observe live state transitions.
  • Use task definitions to interpret what success means in each scenario.

If you remember one thing:

  • This is not a generic chatbot app. It is a decision simulator where actions change a world state over time and are graded both step-by-step and across full episodes.