Spaces:
Sleeping
Sleeping
File size: 10,249 Bytes
984aa3b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 | # 911 Dispatch Project - Complete Beginner Guide
## 1. What this project is (in plain language)
This project is a simulator where an AI agent learns to behave like a city emergency dispatch supervisor.
Think of it like a strategy game:
- There are emergencies (incidents).
- There are responders (fire, police, EMS units).
- The agent must choose what to do each turn (dispatch, reassign, cancel, request mutual aid, etc.).
- The simulator gives a score for each decision and a final score for the whole run.
The goal is to train and evaluate decision-making quality under pressure.
## 2. What an RL environment means
RL means Reinforcement Learning.
In RL, four core ideas exist:
- Agent: the decision-maker (your model or baseline policy).
- Environment: the world that reacts to actions (this simulator).
- Reward: a number that says how good/bad the last action outcome was.
- Episode: one complete run from start to finish.
For this project:
- Agent picks an action.
- Environment updates city state.
- Environment returns:
- updated observation,
- reward,
- done flag (whether run is over).
That loop repeats until the episode ends.
## 3. Important clarification: "scheme of electricity" vs "city schema"
There is no electricity scheme in this codebase.
What exists is a city schema.
City schema means a configuration blueprint for the simulation:
- city size (grid),
- districts,
- available units,
- unit speeds,
- default recommended unit types for each incident type.
The schema is loaded from data files and used to initialize deterministic, repeatable scenarios.
## 4. Project architecture (high level)
1. Scenario/task setup
- A task fixture builds initial units/incidents and metadata.
2. State machine update engine
- Validates actions.
- Applies action effects.
- Advances time by one tick.
- Updates incident statuses and unit statuses.
3. Reward + scoring
- Computes per-step reward components.
- Computes episode-level score using task-specific graders.
4. API server
- Exposes reset/step/state endpoints.
5. Dashboard
- Polls backend state repeatedly and renders units/incidents + reward bars.
## 5. What is the task?
A task is a scenario type with its own initial conditions, difficulty, and final grading logic.
This project has 4 tasks:
1. single_incident (easy)
- One incident, small unit pool.
- Focus: dispatch the right unit fast.
2. multi_incident (medium)
- Multiple incidents at the same time.
- Focus: triage/prioritization and handling P1 incidents.
3. mass_casualty (hard)
- Incident waves with severe emergencies and resource conflicts.
- Focus: survival outcomes under surge.
4. shift_surge (hard)
- New incidents arrive over time and some units go out of service.
- Focus: long-horizon operations and city coverage under degradation.
## 6. What is an episode?
An episode is one full run of a task from reset until terminal condition.
Episode starts when reset is called.
- step_count starts at 0.
- city_time starts at 0 seconds.
- units and incidents are loaded from selected task fixture.
Episode ends when any terminal condition is hit:
- max steps reached,
- at least one incident escalates,
- all incidents resolved.
## 7. What is a step?
A step is one action cycle:
1. Agent sends one action.
2. Validator checks if action is legal.
3. State machine applies action effects.
4. Time advances by 30 seconds.
5. Reward is computed.
6. Observation + reward + done are returned.
Important:
- step_count increases by 1 per step.
- city_time increases by 30 seconds per step.
## 8. At what step are we right now?
Snapshot from the live backend at the time this guide was generated:
- task_id: multi_incident
- episode_id: d2cd525e-2596-44cb-bbe3-af33236264a0
- step_count: 8
- city_time: 240.0 seconds
- cumulative_reward: 1.6
- episode_score: 0.0
- legal_actions currently available: 36
This is a live value, not a constant. If you reset again, step_count returns to 0.
## 9. Action space (what actions exist)
Current action types include:
- DISPATCH
- CANCEL
- REASSIGN
- STAGE
- MUTUAL_AID
- UPGRADE
- DOWNGRADE
Legal actions are generated from current state and filtered by protocol validation, so only valid actions appear in legal_actions.
## 10. How scoring works (complete detail)
There are two scoring layers:
1. Step reward (every action)
2. Episode score (whole run)
### 10.1 Step reward (RewardCalculator)
Step reward uses a weighted sum of 5 components:
- response_time: 30%
- triage: 25%
- survival: 25%
- coverage: 12%
- protocol: 8%
Total formula:
- total = 0.30 * response_time + 0.25 * triage + 0.25 * survival + 0.12 * coverage + 0.08 * protocol
- result is clamped to [0, 1]
Safety rule:
- If any Priority-1 incident existed and survival component is 0, total score is capped at 0.2.
Component details:
1. response_time
- Only meaningful for DISPATCH.
- For non-DISPATCH actions it returns neutral 0.5.
- For DISPATCH: compares ETA to severity benchmark.
2. triage
- Only meaningful for DISPATCH.
- Checks if dispatched unit type matches required unit types for incident type.
- Handles enum-qualified metadata keys safely.
3. survival
- Based on P1 incidents seen vs resolved without failure.
- Uses metadata lists: p1_seen, resolved_incidents, failed_incidents.
4. coverage
- Measures how many districts still have AVAILABLE coverage.
5. protocol
- If action invalid: 0.0.
- If valid and no phraseology text in Action.notes: neutral 0.5.
- If Action.notes provided: uses PhraseologyJudge score + readback correctness.
### 10.2 Episode score (whole run)
Episode score is task-specific via a central grade_episode router.
Why this matters:
- Different tasks need different definitions of success.
- Mean step reward alone is often too weak for real evaluation.
Task-specific episode graders:
1. single_incident
- +0.50 if incident resolved
- +0.30 if MEDIC dispatched correctly
- +0.20 if resolved within first 10 steps
2. multi_incident
- Uses P1 resolution, overall resolution ratio, and escalation penalty
- score = 0.5 * p1_score + 0.3 * resolution_score - 0.2 * failure_penalty
3. mass_casualty
- Emphasizes P1 survival with penalties for failures
- score = 0.6 * survival_score + 0.3 * mean_reward - failure_penalty
4. shift_surge (improved)
- Emphasizes long-horizon operational quality:
- incident throughput (resolved ratio)
- P1 survival
- coverage
- low backlog
- mean reward
- escalation penalty
## 11. Very important score semantics
In the OpenEnv wrapper:
- reward return value from step is per-step reward.
- observation.score is overwritten to episode score.
Also stored in metadata:
- cumulative_reward: running sum of step rewards.
- episode_rewards: list of per-step rewards.
- episode_score: current episode-level grade.
So if you compare values:
- reward = immediate local quality for this action
- observation.score = global task progress quality for the run
## 12. Is the dashboard connected to backend or just static?
It is connected to backend.
How we know:
- The dashboard JavaScript calls API endpoint http://localhost:8000/dashboard/state.
- It polls every 500 ms.
- It renders live units/incidents, step, and reward breakdown from backend response.
Connection behavior:
- If backend is unreachable, dashboard shows disconnected status.
- If backend is running and reset was called, dashboard updates live as step changes.
## 13. Why we used Docker
Docker is used to package the app and dependencies so it runs consistently everywhere.
Benefits:
- Same runtime on your machine, CI, and deployment platforms.
- No "works on my machine" package mismatch issues.
- Easy deployment with a single container image.
- Port compatibility: server reads PORT environment variable (important for hosted platforms).
In this project:
- Root Dockerfile runs uvicorn on 0.0.0.0 and PORT (default 8000).
- That makes it suitable for local run and hosted environments.
## 14. What API key are we using?
The project expects environment variables. Keys are not hardcoded in repository files.
Required for LLM mode:
- API_BASE_URL
- MODEL_NAME
- OPENAI_API_KEY
Compatibility fallback:
- HF_TOKEN is accepted if OPENAI_API_KEY is not set.
No-key mode:
- USE_RANDOM=true bypasses LLM and uses a deterministic random baseline agent.
Practical meaning:
- If USE_RANDOM=true, you can run without any API key.
- If USE_RANDOM is not true, OPENAI_API_KEY (or HF_TOKEN fallback) is needed.
## 15. Backend API endpoints (what each does)
- GET /health
- health check
- GET /tasks
- list available tasks
- POST /reset
- start new episode for selected task
- POST /step
- apply one action and move simulation one step
- GET /state
- current state
- GET /dashboard/state
- extended state for HTML dashboard (includes legal actions + last observation)
- GET /metadata and GET /schema
- environment metadata and contracts
- POST /mcp
- minimal JSON-RPC endpoint
## 16. What the dashboard shows vs what it does not show
Shows:
- Unit cards (status, assignment, ETA, location)
- Incident cards (type, severity, status, assigned units)
- Map view for units/incidents
- Last step reward component bars
- Header task/episode/step values
Nuance:
- Header "Score" currently uses metadata.cumulative_reward.
- Episode score is available too (metadata.episode_score), but not currently shown as the main header score.
## 17. Beginner glossary
- incident: emergency case to be handled
- unit: responder vehicle/team (EMS, fire, police, etc.)
- legal action: an action that passes protocol checks in current state
- reward: immediate feedback signal for one step
- episode score: overall quality of a full run
- terminal: episode is finished
## 18. Practical "how to think" summary
When you judge behavior quality in this project:
- Use step rewards to understand local tactical quality.
- Use episode score to understand mission success for the selected task.
- Use dashboard to observe live state transitions.
- Use task definitions to interpret what success means in each scenario.
If you remember one thing:
- This is not a generic chatbot app. It is a decision simulator where actions change a world state over time and are graded both step-by-step and across full episodes.
|