File size: 10,249 Bytes
984aa3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
# 911 Dispatch Project - Complete Beginner Guide

## 1. What this project is (in plain language)

This project is a simulator where an AI agent learns to behave like a city emergency dispatch supervisor.

Think of it like a strategy game:
- There are emergencies (incidents).
- There are responders (fire, police, EMS units).
- The agent must choose what to do each turn (dispatch, reassign, cancel, request mutual aid, etc.).
- The simulator gives a score for each decision and a final score for the whole run.

The goal is to train and evaluate decision-making quality under pressure.

## 2. What an RL environment means

RL means Reinforcement Learning.

In RL, four core ideas exist:
- Agent: the decision-maker (your model or baseline policy).
- Environment: the world that reacts to actions (this simulator).
- Reward: a number that says how good/bad the last action outcome was.
- Episode: one complete run from start to finish.

For this project:
- Agent picks an action.
- Environment updates city state.
- Environment returns:
  - updated observation,
  - reward,
  - done flag (whether run is over).

That loop repeats until the episode ends.

## 3. Important clarification: "scheme of electricity" vs "city schema"

There is no electricity scheme in this codebase.

What exists is a city schema.

City schema means a configuration blueprint for the simulation:
- city size (grid),
- districts,
- available units,
- unit speeds,
- default recommended unit types for each incident type.

The schema is loaded from data files and used to initialize deterministic, repeatable scenarios.

## 4. Project architecture (high level)

1. Scenario/task setup
- A task fixture builds initial units/incidents and metadata.

2. State machine update engine
- Validates actions.
- Applies action effects.
- Advances time by one tick.
- Updates incident statuses and unit statuses.

3. Reward + scoring
- Computes per-step reward components.
- Computes episode-level score using task-specific graders.

4. API server
- Exposes reset/step/state endpoints.

5. Dashboard
- Polls backend state repeatedly and renders units/incidents + reward bars.

## 5. What is the task?

A task is a scenario type with its own initial conditions, difficulty, and final grading logic.

This project has 4 tasks:

1. single_incident (easy)
- One incident, small unit pool.
- Focus: dispatch the right unit fast.

2. multi_incident (medium)
- Multiple incidents at the same time.
- Focus: triage/prioritization and handling P1 incidents.

3. mass_casualty (hard)
- Incident waves with severe emergencies and resource conflicts.
- Focus: survival outcomes under surge.

4. shift_surge (hard)
- New incidents arrive over time and some units go out of service.
- Focus: long-horizon operations and city coverage under degradation.

## 6. What is an episode?

An episode is one full run of a task from reset until terminal condition.

Episode starts when reset is called.
- step_count starts at 0.
- city_time starts at 0 seconds.
- units and incidents are loaded from selected task fixture.

Episode ends when any terminal condition is hit:
- max steps reached,
- at least one incident escalates,
- all incidents resolved.

## 7. What is a step?

A step is one action cycle:

1. Agent sends one action.
2. Validator checks if action is legal.
3. State machine applies action effects.
4. Time advances by 30 seconds.
5. Reward is computed.
6. Observation + reward + done are returned.

Important:
- step_count increases by 1 per step.
- city_time increases by 30 seconds per step.

## 8. At what step are we right now?

Snapshot from the live backend at the time this guide was generated:

- task_id: multi_incident
- episode_id: d2cd525e-2596-44cb-bbe3-af33236264a0
- step_count: 8
- city_time: 240.0 seconds
- cumulative_reward: 1.6
- episode_score: 0.0
- legal_actions currently available: 36

This is a live value, not a constant. If you reset again, step_count returns to 0.

## 9. Action space (what actions exist)

Current action types include:
- DISPATCH
- CANCEL
- REASSIGN
- STAGE
- MUTUAL_AID
- UPGRADE
- DOWNGRADE

Legal actions are generated from current state and filtered by protocol validation, so only valid actions appear in legal_actions.

## 10. How scoring works (complete detail)

There are two scoring layers:

1. Step reward (every action)
2. Episode score (whole run)

### 10.1 Step reward (RewardCalculator)

Step reward uses a weighted sum of 5 components:
- response_time: 30%
- triage: 25%
- survival: 25%
- coverage: 12%
- protocol: 8%

Total formula:
- total = 0.30 * response_time + 0.25 * triage + 0.25 * survival + 0.12 * coverage + 0.08 * protocol
- result is clamped to [0, 1]

Safety rule:
- If any Priority-1 incident existed and survival component is 0, total score is capped at 0.2.

Component details:

1. response_time
- Only meaningful for DISPATCH.
- For non-DISPATCH actions it returns neutral 0.5.
- For DISPATCH: compares ETA to severity benchmark.

2. triage
- Only meaningful for DISPATCH.
- Checks if dispatched unit type matches required unit types for incident type.
- Handles enum-qualified metadata keys safely.

3. survival
- Based on P1 incidents seen vs resolved without failure.
- Uses metadata lists: p1_seen, resolved_incidents, failed_incidents.

4. coverage
- Measures how many districts still have AVAILABLE coverage.

5. protocol
- If action invalid: 0.0.
- If valid and no phraseology text in Action.notes: neutral 0.5.
- If Action.notes provided: uses PhraseologyJudge score + readback correctness.

### 10.2 Episode score (whole run)

Episode score is task-specific via a central grade_episode router.

Why this matters:
- Different tasks need different definitions of success.
- Mean step reward alone is often too weak for real evaluation.

Task-specific episode graders:

1. single_incident
- +0.50 if incident resolved
- +0.30 if MEDIC dispatched correctly
- +0.20 if resolved within first 10 steps

2. multi_incident
- Uses P1 resolution, overall resolution ratio, and escalation penalty
- score = 0.5 * p1_score + 0.3 * resolution_score - 0.2 * failure_penalty

3. mass_casualty
- Emphasizes P1 survival with penalties for failures
- score = 0.6 * survival_score + 0.3 * mean_reward - failure_penalty

4. shift_surge (improved)
- Emphasizes long-horizon operational quality:
  - incident throughput (resolved ratio)
  - P1 survival
  - coverage
  - low backlog
  - mean reward
  - escalation penalty

## 11. Very important score semantics

In the OpenEnv wrapper:
- reward return value from step is per-step reward.
- observation.score is overwritten to episode score.

Also stored in metadata:
- cumulative_reward: running sum of step rewards.
- episode_rewards: list of per-step rewards.
- episode_score: current episode-level grade.

So if you compare values:
- reward = immediate local quality for this action
- observation.score = global task progress quality for the run

## 12. Is the dashboard connected to backend or just static?

It is connected to backend.

How we know:
- The dashboard JavaScript calls API endpoint http://localhost:8000/dashboard/state.
- It polls every 500 ms.
- It renders live units/incidents, step, and reward breakdown from backend response.

Connection behavior:
- If backend is unreachable, dashboard shows disconnected status.
- If backend is running and reset was called, dashboard updates live as step changes.

## 13. Why we used Docker

Docker is used to package the app and dependencies so it runs consistently everywhere.

Benefits:
- Same runtime on your machine, CI, and deployment platforms.
- No "works on my machine" package mismatch issues.
- Easy deployment with a single container image.
- Port compatibility: server reads PORT environment variable (important for hosted platforms).

In this project:
- Root Dockerfile runs uvicorn on 0.0.0.0 and PORT (default 8000).
- That makes it suitable for local run and hosted environments.

## 14. What API key are we using?

The project expects environment variables. Keys are not hardcoded in repository files.

Required for LLM mode:
- API_BASE_URL
- MODEL_NAME
- OPENAI_API_KEY

Compatibility fallback:
- HF_TOKEN is accepted if OPENAI_API_KEY is not set.

No-key mode:
- USE_RANDOM=true bypasses LLM and uses a deterministic random baseline agent.

Practical meaning:
- If USE_RANDOM=true, you can run without any API key.
- If USE_RANDOM is not true, OPENAI_API_KEY (or HF_TOKEN fallback) is needed.

## 15. Backend API endpoints (what each does)

- GET /health
  - health check

- GET /tasks
  - list available tasks

- POST /reset
  - start new episode for selected task

- POST /step
  - apply one action and move simulation one step

- GET /state
  - current state

- GET /dashboard/state
  - extended state for HTML dashboard (includes legal actions + last observation)

- GET /metadata and GET /schema
  - environment metadata and contracts

- POST /mcp
  - minimal JSON-RPC endpoint

## 16. What the dashboard shows vs what it does not show

Shows:
- Unit cards (status, assignment, ETA, location)
- Incident cards (type, severity, status, assigned units)
- Map view for units/incidents
- Last step reward component bars
- Header task/episode/step values

Nuance:
- Header "Score" currently uses metadata.cumulative_reward.
- Episode score is available too (metadata.episode_score), but not currently shown as the main header score.

## 17. Beginner glossary

- incident: emergency case to be handled
- unit: responder vehicle/team (EMS, fire, police, etc.)
- legal action: an action that passes protocol checks in current state
- reward: immediate feedback signal for one step
- episode score: overall quality of a full run
- terminal: episode is finished

## 18. Practical "how to think" summary

When you judge behavior quality in this project:
- Use step rewards to understand local tactical quality.
- Use episode score to understand mission success for the selected task.
- Use dashboard to observe live state transitions.
- Use task definitions to interpret what success means in each scenario.

If you remember one thing:
- This is not a generic chatbot app. It is a decision simulator where actions change a world state over time and are graded both step-by-step and across full episodes.