Spaces:

garvitsachdeva
/

911

Sleeping

App Files Files Community

911 / PROJECT_COMPLETE_GUIDE.md

garvitsachdeva

docs: polish README; remove emoji

984aa3b about 1 month ago

preview code

raw

history blame contribute delete

10.2 kB

	# 911 Dispatch Project - Complete Beginner Guide

	## 1. What this project is (in plain language)

	This project is a simulator where an AI agent learns to behave like a city emergency dispatch supervisor.

	Think of it like a strategy game:
	- There are emergencies (incidents).
	- There are responders (fire, police, EMS units).
	- The agent must choose what to do each turn (dispatch, reassign, cancel, request mutual aid, etc.).
	- The simulator gives a score for each decision and a final score for the whole run.

	The goal is to train and evaluate decision-making quality under pressure.

	## 2. What an RL environment means

	RL means Reinforcement Learning.

	In RL, four core ideas exist:
	- Agent: the decision-maker (your model or baseline policy).
	- Environment: the world that reacts to actions (this simulator).
	- Reward: a number that says how good/bad the last action outcome was.
	- Episode: one complete run from start to finish.

	For this project:
	- Agent picks an action.
	- Environment updates city state.
	- Environment returns:
	- updated observation,
	- reward,
	- done flag (whether run is over).

	That loop repeats until the episode ends.

	## 3. Important clarification: "scheme of electricity" vs "city schema"

	There is no electricity scheme in this codebase.

	What exists is a city schema.

	City schema means a configuration blueprint for the simulation:
	- city size (grid),
	- districts,
	- available units,
	- unit speeds,
	- default recommended unit types for each incident type.

	The schema is loaded from data files and used to initialize deterministic, repeatable scenarios.

	## 4. Project architecture (high level)

	1. Scenario/task setup
	- A task fixture builds initial units/incidents and metadata.

	2. State machine update engine
	- Validates actions.
	- Applies action effects.
	- Advances time by one tick.
	- Updates incident statuses and unit statuses.

	3. Reward + scoring
	- Computes per-step reward components.
	- Computes episode-level score using task-specific graders.

	4. API server
	- Exposes reset/step/state endpoints.

	5. Dashboard
	- Polls backend state repeatedly and renders units/incidents + reward bars.

	## 5. What is the task?

	A task is a scenario type with its own initial conditions, difficulty, and final grading logic.

	This project has 4 tasks:

	1. single_incident (easy)
	- One incident, small unit pool.
	- Focus: dispatch the right unit fast.

	2. multi_incident (medium)
	- Multiple incidents at the same time.
	- Focus: triage/prioritization and handling P1 incidents.

	3. mass_casualty (hard)
	- Incident waves with severe emergencies and resource conflicts.
	- Focus: survival outcomes under surge.

	4. shift_surge (hard)
	- New incidents arrive over time and some units go out of service.
	- Focus: long-horizon operations and city coverage under degradation.

	## 6. What is an episode?

	An episode is one full run of a task from reset until terminal condition.

	Episode starts when reset is called.
	- step_count starts at 0.
	- city_time starts at 0 seconds.
	- units and incidents are loaded from selected task fixture.

	Episode ends when any terminal condition is hit:
	- max steps reached,
	- at least one incident escalates,
	- all incidents resolved.

	## 7. What is a step?

	A step is one action cycle:

	1. Agent sends one action.
	2. Validator checks if action is legal.
	3. State machine applies action effects.
	4. Time advances by 30 seconds.
	5. Reward is computed.
	6. Observation + reward + done are returned.

	Important:
	- step_count increases by 1 per step.
	- city_time increases by 30 seconds per step.

	## 8. At what step are we right now?

	Snapshot from the live backend at the time this guide was generated:

	- task_id: multi_incident
	- episode_id: d2cd525e-2596-44cb-bbe3-af33236264a0
	- step_count: 8
	- city_time: 240.0 seconds
	- cumulative_reward: 1.6
	- episode_score: 0.0
	- legal_actions currently available: 36

	This is a live value, not a constant. If you reset again, step_count returns to 0.

	## 9. Action space (what actions exist)

	Current action types include:
	- DISPATCH
	- CANCEL
	- REASSIGN
	- STAGE
	- MUTUAL_AID
	- UPGRADE
	- DOWNGRADE

	Legal actions are generated from current state and filtered by protocol validation, so only valid actions appear in legal_actions.

	## 10. How scoring works (complete detail)

	There are two scoring layers:

	1. Step reward (every action)
	2. Episode score (whole run)

	### 10.1 Step reward (RewardCalculator)

	Step reward uses a weighted sum of 5 components:
	- response_time: 30%
	- triage: 25%
	- survival: 25%
	- coverage: 12%
	- protocol: 8%

	Total formula:
	- total = 0.30 * response_time + 0.25 * triage + 0.25 * survival + 0.12 * coverage + 0.08 * protocol
	- result is clamped to [0, 1]

	Safety rule:
	- If any Priority-1 incident existed and survival component is 0, total score is capped at 0.2.

	Component details:

	1. response_time
	- Only meaningful for DISPATCH.
	- For non-DISPATCH actions it returns neutral 0.5.
	- For DISPATCH: compares ETA to severity benchmark.

	2. triage
	- Only meaningful for DISPATCH.
	- Checks if dispatched unit type matches required unit types for incident type.
	- Handles enum-qualified metadata keys safely.

	3. survival
	- Based on P1 incidents seen vs resolved without failure.
	- Uses metadata lists: p1_seen, resolved_incidents, failed_incidents.

	4. coverage
	- Measures how many districts still have AVAILABLE coverage.

	5. protocol
	- If action invalid: 0.0.
	- If valid and no phraseology text in Action.notes: neutral 0.5.
	- If Action.notes provided: uses PhraseologyJudge score + readback correctness.

	### 10.2 Episode score (whole run)

	Episode score is task-specific via a central grade_episode router.

	Why this matters:
	- Different tasks need different definitions of success.
	- Mean step reward alone is often too weak for real evaluation.

	Task-specific episode graders:

	1. single_incident
	- +0.50 if incident resolved
	- +0.30 if MEDIC dispatched correctly
	- +0.20 if resolved within first 10 steps

	2. multi_incident
	- Uses P1 resolution, overall resolution ratio, and escalation penalty
	- score = 0.5 * p1_score + 0.3 * resolution_score - 0.2 * failure_penalty

	3. mass_casualty
	- Emphasizes P1 survival with penalties for failures
	- score = 0.6 * survival_score + 0.3 * mean_reward - failure_penalty

	4. shift_surge (improved)
	- Emphasizes long-horizon operational quality:
	- incident throughput (resolved ratio)
	- P1 survival
	- coverage
	- low backlog
	- mean reward
	- escalation penalty

	## 11. Very important score semantics

	In the OpenEnv wrapper:
	- reward return value from step is per-step reward.
	- observation.score is overwritten to episode score.

	Also stored in metadata:
	- cumulative_reward: running sum of step rewards.
	- episode_rewards: list of per-step rewards.
	- episode_score: current episode-level grade.

	So if you compare values:
	- reward = immediate local quality for this action
	- observation.score = global task progress quality for the run

	## 12. Is the dashboard connected to backend or just static?

	It is connected to backend.

	How we know:
	- The dashboard JavaScript calls API endpoint http://localhost:8000/dashboard/state.
	- It polls every 500 ms.
	- It renders live units/incidents, step, and reward breakdown from backend response.

	Connection behavior:
	- If backend is unreachable, dashboard shows disconnected status.
	- If backend is running and reset was called, dashboard updates live as step changes.

	## 13. Why we used Docker

	Docker is used to package the app and dependencies so it runs consistently everywhere.

	Benefits:
	- Same runtime on your machine, CI, and deployment platforms.
	- No "works on my machine" package mismatch issues.
	- Easy deployment with a single container image.
	- Port compatibility: server reads PORT environment variable (important for hosted platforms).

	In this project:
	- Root Dockerfile runs uvicorn on 0.0.0.0 and PORT (default 8000).
	- That makes it suitable for local run and hosted environments.

	## 14. What API key are we using?

	The project expects environment variables. Keys are not hardcoded in repository files.

	Required for LLM mode:
	- API_BASE_URL
	- MODEL_NAME
	- OPENAI_API_KEY

	Compatibility fallback:
	- HF_TOKEN is accepted if OPENAI_API_KEY is not set.

	No-key mode:
	- USE_RANDOM=true bypasses LLM and uses a deterministic random baseline agent.

	Practical meaning:
	- If USE_RANDOM=true, you can run without any API key.
	- If USE_RANDOM is not true, OPENAI_API_KEY (or HF_TOKEN fallback) is needed.

	## 15. Backend API endpoints (what each does)

	- GET /health
	- health check

	- GET /tasks
	- list available tasks

	- POST /reset
	- start new episode for selected task

	- POST /step
	- apply one action and move simulation one step

	- GET /state
	- current state

	- GET /dashboard/state
	- extended state for HTML dashboard (includes legal actions + last observation)

	- GET /metadata and GET /schema
	- environment metadata and contracts

	- POST /mcp
	- minimal JSON-RPC endpoint

	## 16. What the dashboard shows vs what it does not show

	Shows:
	- Unit cards (status, assignment, ETA, location)
	- Incident cards (type, severity, status, assigned units)
	- Map view for units/incidents
	- Last step reward component bars
	- Header task/episode/step values

	Nuance:
	- Header "Score" currently uses metadata.cumulative_reward.
	- Episode score is available too (metadata.episode_score), but not currently shown as the main header score.

	## 17. Beginner glossary

	- incident: emergency case to be handled
	- unit: responder vehicle/team (EMS, fire, police, etc.)
	- legal action: an action that passes protocol checks in current state
	- reward: immediate feedback signal for one step
	- episode score: overall quality of a full run
	- terminal: episode is finished

	## 18. Practical "how to think" summary

	When you judge behavior quality in this project:
	- Use step rewards to understand local tactical quality.
	- Use episode score to understand mission success for the selected task.
	- Use dashboard to observe live state transitions.
	- Use task definitions to interpret what success means in each scenario.

	If you remember one thing:
	- This is not a generic chatbot app. It is a decision simulator where actions change a world state over time and are graded both step-by-step and across full episodes.