# 911 Dispatch Project - Complete Beginner Guide

## 1. What this project is (in plain language)

This project is a simulator where an AI agent learns to behave like a city emergency dispatch supervisor.

Think of it like a strategy game:
- There are emergencies (incidents).
- There are responders (fire, police, EMS units).
- The agent must choose what to do each turn (dispatch, reassign, cancel, request mutual aid, etc.).
- The simulator gives a score for each decision and a final score for the whole run.

The goal is to train and evaluate decision-making quality under pressure.

## 2. What an RL environment means

RL means Reinforcement Learning.

In RL, four core ideas exist:
- Agent: the decision-maker (your model or baseline policy).
- Environment: the world that reacts to actions (this simulator).
- Reward: a number that says how good/bad the last action outcome was.
- Episode: one complete run from start to finish.

For this project:
- Agent picks an action.
- Environment updates city state.
- Environment returns:
  - updated observation,
  - reward,
  - done flag (whether run is over).

That loop repeats until the episode ends.

## 3. Important clarification: "scheme of electricity" vs "city schema"

There is no electricity scheme in this codebase.

What exists is a city schema.

City schema means a configuration blueprint for the simulation:
- city size (grid),
- districts,
- available units,
- unit speeds,
- default recommended unit types for each incident type.

The schema is loaded from data files and used to initialize deterministic, repeatable scenarios.

## 4. Project architecture (high level)

1. Scenario/task setup
- A task fixture builds initial units/incidents and metadata.

2. State machine update engine
- Validates actions.
- Applies action effects.
- Advances time by one tick.
- Updates incident statuses and unit statuses.

3. Reward + scoring
- Computes per-step reward components.
- Computes episode-level score using task-specific graders.

4. API server
- Exposes reset/step/state endpoints.

5. Dashboard
- Polls backend state repeatedly and renders units/incidents + reward bars.

## 5. What is the task?

A task is a scenario type with its own initial conditions, difficulty, and final grading logic.

This project has 4 tasks:

1. single_incident (easy)
- One incident, small unit pool.
- Focus: dispatch the right unit fast.

2. multi_incident (medium)
- Multiple incidents at the same time.
- Focus: triage/prioritization and handling P1 incidents.

3. mass_casualty (hard)
- Incident waves with severe emergencies and resource conflicts.
- Focus: survival outcomes under surge.

4. shift_surge (hard)
- New incidents arrive over time and some units go out of service.
- Focus: long-horizon operations and city coverage under degradation.

## 6. What is an episode?

An episode is one full run of a task from reset until terminal condition.

Episode starts when reset is called.
- step_count starts at 0.
- city_time starts at 0 seconds.
- units and incidents are loaded from selected task fixture.

Episode ends when any terminal condition is hit:
- max steps reached,
- at least one incident escalates,
- all incidents resolved.

## 7. What is a step?

A step is one action cycle:

1. Agent sends one action.
2. Validator checks if action is legal.
3. State machine applies action effects.
4. Time advances by 30 seconds.
5. Reward is computed.
6. Observation + reward + done are returned.

Important:
- step_count increases by 1 per step.
- city_time increases by 30 seconds per step.

## 8. At what step are we right now?

Snapshot from the live backend at the time this guide was generated:

- task_id: multi_incident
- episode_id: d2cd525e-2596-44cb-bbe3-af33236264a0
- step_count: 8
- city_time: 240.0 seconds
- cumulative_reward: 1.6
- episode_score: 0.0
- legal_actions currently available: 36

This is a live value, not a constant. If you reset again, step_count returns to 0.

## 9. Action space (what actions exist)

Current action types include:
- DISPATCH
- CANCEL
- REASSIGN
- STAGE
- MUTUAL_AID
- UPGRADE
- DOWNGRADE

Legal actions are generated from current state and filtered by protocol validation, so only valid actions appear in legal_actions.

## 10. How scoring works (complete detail)

There are two scoring layers:

1. Step reward (every action)
2. Episode score (whole run)

### 10.1 Step reward (RewardCalculator)

Step reward uses a weighted sum of 5 components:
- response_time: 30%
- triage: 25%
- survival: 25%
- coverage: 12%
- protocol: 8%

Total formula:
- total = 0.30 * response_time + 0.25 * triage + 0.25 * survival + 0.12 * coverage + 0.08 * protocol
- result is clamped to [0, 1]

Safety rule:
- If any Priority-1 incident existed and survival component is 0, total score is capped at 0.2.

Component details:

1. response_time
- Only meaningful for DISPATCH.
- For non-DISPATCH actions it returns neutral 0.5.
- For DISPATCH: compares ETA to severity benchmark.

2. triage
- Only meaningful for DISPATCH.
- Checks if dispatched unit type matches required unit types for incident type.
- Handles enum-qualified metadata keys safely.

3. survival
- Based on P1 incidents seen vs resolved without failure.
- Uses metadata lists: p1_seen, resolved_incidents, failed_incidents.

4. coverage
- Measures how many districts still have AVAILABLE coverage.

5. protocol
- If action invalid: 0.0.
- If valid and no phraseology text in Action.notes: neutral 0.5.
- If Action.notes provided: uses PhraseologyJudge score + readback correctness.

### 10.2 Episode score (whole run)

Episode score is task-specific via a central grade_episode router.

Why this matters:
- Different tasks need different definitions of success.
- Mean step reward alone is often too weak for real evaluation.

Task-specific episode graders:

1. single_incident
- +0.50 if incident resolved
- +0.30 if MEDIC dispatched correctly
- +0.20 if resolved within first 10 steps

2. multi_incident
- Uses P1 resolution, overall resolution ratio, and escalation penalty
- score = 0.5 * p1_score + 0.3 * resolution_score - 0.2 * failure_penalty

3. mass_casualty
- Emphasizes P1 survival with penalties for failures
- score = 0.6 * survival_score + 0.3 * mean_reward - failure_penalty

4. shift_surge (improved)
- Emphasizes long-horizon operational quality:
  - incident throughput (resolved ratio)
  - P1 survival
  - coverage
  - low backlog
  - mean reward
  - escalation penalty

## 11. Very important score semantics

In the OpenEnv wrapper:
- reward return value from step is per-step reward.
- observation.score is overwritten to episode score.

Also stored in metadata:
- cumulative_reward: running sum of step rewards.
- episode_rewards: list of per-step rewards.
- episode_score: current episode-level grade.

So if you compare values:
- reward = immediate local quality for this action
- observation.score = global task progress quality for the run

## 12. Is the dashboard connected to backend or just static?

It is connected to backend.

How we know:
- The dashboard JavaScript calls API endpoint http://localhost:8000/dashboard/state.
- It polls every 500 ms.
- It renders live units/incidents, step, and reward breakdown from backend response.

Connection behavior:
- If backend is unreachable, dashboard shows disconnected status.
- If backend is running and reset was called, dashboard updates live as step changes.

## 13. Why we used Docker

Docker is used to package the app and dependencies so it runs consistently everywhere.

Benefits:
- Same runtime on your machine, CI, and deployment platforms.
- No "works on my machine" package mismatch issues.
- Easy deployment with a single container image.
- Port compatibility: server reads PORT environment variable (important for hosted platforms).

In this project:
- Root Dockerfile runs uvicorn on 0.0.0.0 and PORT (default 8000).
- That makes it suitable for local run and hosted environments.

## 14. What API key are we using?

The project expects environment variables. Keys are not hardcoded in repository files.

Required for LLM mode:
- API_BASE_URL
- MODEL_NAME
- OPENAI_API_KEY

Compatibility fallback:
- HF_TOKEN is accepted if OPENAI_API_KEY is not set.

No-key mode:
- USE_RANDOM=true bypasses LLM and uses a deterministic random baseline agent.

Practical meaning:
- If USE_RANDOM=true, you can run without any API key.
- If USE_RANDOM is not true, OPENAI_API_KEY (or HF_TOKEN fallback) is needed.

## 15. Backend API endpoints (what each does)

- GET /health
  - health check

- GET /tasks
  - list available tasks

- POST /reset
  - start new episode for selected task

- POST /step
  - apply one action and move simulation one step

- GET /state
  - current state

- GET /dashboard/state
  - extended state for HTML dashboard (includes legal actions + last observation)

- GET /metadata and GET /schema
  - environment metadata and contracts

- POST /mcp
  - minimal JSON-RPC endpoint

## 16. What the dashboard shows vs what it does not show

Shows:
- Unit cards (status, assignment, ETA, location)
- Incident cards (type, severity, status, assigned units)
- Map view for units/incidents
- Last step reward component bars
- Header task/episode/step values

Nuance:
- Header "Score" currently uses metadata.cumulative_reward.
- Episode score is available too (metadata.episode_score), but not currently shown as the main header score.

## 17. Beginner glossary

- incident: emergency case to be handled
- unit: responder vehicle/team (EMS, fire, police, etc.)
- legal action: an action that passes protocol checks in current state
- reward: immediate feedback signal for one step
- episode score: overall quality of a full run
- terminal: episode is finished

## 18. Practical "how to think" summary

When you judge behavior quality in this project:
- Use step rewards to understand local tactical quality.
- Use episode score to understand mission success for the selected task.
- Use dashboard to observe live state transitions.
- Use task definitions to interpret what success means in each scenario.

If you remember one thing:
- This is not a generic chatbot app. It is a decision simulator where actions change a world state over time and are graded both step-by-step and across full episodes.