File size: 23,642 Bytes
1fb8f9c
 
 
 
 
 
 
 
 
 
 
e259b96
cb9e74c
4904e85
cb9e74c
 
 
 
 
 
 
 
4904e85
1d762f3
4904e85
1d762f3
4904e85
1d762f3
13517a8
1d762f3
13517a8
1d762f3
13517a8
1d762f3
13517a8
1d762f3
 
 
 
 
 
13517a8
cb9e74c
13517a8
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13517a8
4904e85
cb9e74c
14170d7
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d762f3
 
 
 
 
 
 
 
 
 
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d762f3
14170d7
cb9e74c
 
 
43f2683
4904e85
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d762f3
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d762f3
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4904e85
1d762f3
 
cb9e74c
4904e85
cb9e74c
4904e85
cb9e74c
4904e85
cb9e74c
4904e85
cb9e74c
 
 
 
 
 
 
 
 
6172160
cb9e74c
6172160
1d762f3
 
cb9e74c
6172160
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14170d7
cb9e74c
14170d7
cb9e74c
6172160
cb9e74c
 
 
 
6172160
cb9e74c
 
6172160
cb9e74c
 
 
6172160
cb9e74c
 
6172160
cb9e74c
 
 
6172160
cb9e74c
6172160
cb9e74c
 
 
6172160
cb9e74c
 
6172160
cb9e74c
6172160
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
4904e85
 
 
 
 
f4ed234
4904e85
cb9e74c
f4ed234
4904e85
cb9e74c
f4ed234
6172160
f4ed234
 
4904e85
f4ed234
 
 
 
 
4904e85
f4ed234
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
f4ed234
 
cb9e74c
 
f4ed234
4904e85
cb9e74c
f4ed234
cb9e74c
 
4904e85
 
cb9e74c
 
 
 
 
14170d7
cb9e74c
 
 
 
 
 
 
 
 
 
14170d7
 
cb9e74c
14170d7
 
cb9e74c
 
 
 
f4ed234
cb9e74c
f4ed234
 
 
 
 
cb9e74c
 
 
 
 
f4ed234
 
cb9e74c
f4ed234
 
 
 
 
 
 
 
 
1d762f3
 
 
 
 
 
f4ed234
14170d7
f4ed234
775befb
 
f4ed234
 
 
 
775befb
 
 
 
 
f4ed234
775befb
 
 
cb9e74c
e259b96
4904e85
 
 
 
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
4904e85
cb9e74c
 
 
 
 
4904e85
cb9e74c
 
4904e85
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4904e85
 
cb9e74c
 
 
4904e85
cb9e74c
4904e85
 
cb9e74c
f4ed234
cb9e74c
 
f4ed234
cb9e74c
 
 
 
4904e85
 
cb9e74c
4904e85
cb9e74c
 
 
 
4904e85
cb9e74c
 
 
 
 
 
 
 
 
 
 
 
 
 
4904e85
cb9e74c
4904e85
 
cb9e74c
4904e85
f4ed234
 
cb9e74c
 
 
 
 
 
 
 
 
 
 
4904e85
cb9e74c
 
 
 
 
 
f4ed234
cb9e74c
 
f4ed234
 
 
 
cb9e74c
 
 
 
 
 
 
 
f4ed234
43f2683
cb9e74c
775befb
43f2683
cb9e74c
 
775befb
 
 
cb9e74c
43f2683
cb9e74c
4904e85
 
 
1c662f4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
---
title: 911 Dispatch Supervisor
emoji: "🚨"
colorFrom: red
colorTo: gray
sdk: docker
app_port: 7860
tags:
  - openenv
pinned: false
---

# 🚨 911 Dispatch Supervisor

> **A city-wide emergency dispatch RL environment** β€” train and evaluate LLM agents to manage simultaneous incidents by dispatching police, fire, and EMS units across a city grid under realistic resource constraints.

[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-green)](https://openenv.dev)
[![Docker](https://img.shields.io/badge/Docker-ready-blue)](https://hub.docker.com)
[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces)
[![License: MIT](https://img.shields.io/badge/License-MIT-lightgrey)](LICENSE)

---

## Why This Matters

911 dispatch centers in the United States handle over 240 million calls per year. Every dispatcher decision β€” which unit to send, in what order, with what priority β€” directly determines survival outcomes. A 90-second delay in dispatching a MEDIC to a cardiac arrest drops survival probability by roughly 10%.

The **911 Dispatch Supervisor** is the first open RL benchmark for training and evaluating AI agents on emergency dispatch decisions. It models the exact tradeoffs real dispatchers face: triage under uncertainty, multi-unit resource allocation, geographic coverage, and protocol compliance β€” all simultaneously.

This fills a direct gap for researchers building AI copilots for public safety systems, and provides immediate evaluation value for any LLM claiming real-world decision-making capability.

## Overview

At every step, an LLM agent plays the role of a city-wide dispatch supervisor, deciding which units to dispatch, reassign, cancel, stage, or escalate β€” under time pressure, limited resources, and competing priorities across a 100Γ—100 city grid.

This is not a toy environment. Emergency dispatch is a high-stakes, multi-objective decision problem that:
- Requires **triage** β€” prioritizing life-threatening incidents over property damage
- Demands **coverage awareness** β€” keeping geographic zones protected
- Rewards **correct unit-type matching** β€” sending a MEDIC vs. an ENGINE
- Punishes **delays** that cause Priority-1 incidents to escalate
- Scores **dispatch phraseology** β€” realistic radio communication language

---

## Environment Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   OpenEnv Interface                     β”‚
β”‚         reset() Β· step(action) Β· state()                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DispatchStateMachine                       β”‚
β”‚  β€’ Validates actions via DispatchProtocolValidator      β”‚
β”‚  β€’ Moves units toward incidents (Manhattan physics)     β”‚
β”‚  β€’ Advances incident status: PENDING β†’ RESPONDING β†’     β”‚
β”‚    ON_SCENE β†’ RESOLVED (or ESCALATED if timeout)        β”‚
β”‚  β€’ Spawns incident waves at configured step offsets     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  RewardCalculator                       β”‚
β”‚  β€’ response_time (30%) Β· triage (25%) Β· survival (25%) β”‚
β”‚  β€’ coverage (12%) Β· protocol (8%)                       β”‚
β”‚  β€’ Safety gate: P1 failure β†’ score capped at 0.2        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Task-Specific Episode Graders                β”‚
β”‚  single_incident Β· multi_incident Β· mass_casualty Β·     β”‚
β”‚                   shift_surge                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Action Space

Actions are structured Pydantic models β€” no free-text parsing required.

**`src.models.Action`**

| Field | Type | Description |
|---|---|---|
| `action_type` | `DispatchAction` | One of: `DISPATCH`, `CANCEL`, `REASSIGN`, `STAGE`, `MUTUAL_AID`, `UPGRADE`, `DOWNGRADE` |
| `unit_id` | `str` | Unit identifier, e.g. `MED-1`, `ENG-2` |
| `incident_id` | `str` | Incident identifier, e.g. `INC-001` |
| `notes` | `str \| None` | Optional phraseology text for protocol scoring bonus |
| `priority_override` | `IncidentSeverity \| None` | Required for `UPGRADE`/`DOWNGRADE` actions |

**Action Types**

| Action | Description | Protocol Rule |
|---|---|---|
| `DISPATCH` | Send an available unit to an incident | Unit must be `AVAILABLE`; incident must not be `RESOLVED` |
| `CANCEL` | Release a unit from its current assignment | Unit must be assigned to the specified incident |
| `REASSIGN` | Redirect an assigned unit to a different incident | Unit must be `DISPATCHED`, `ON_SCENE`, or `TRANSPORTING` |
| `STAGE` | Pre-position a unit near an incident without committing | Unit must be `AVAILABLE`; incident must be `PENDING` |
| `MUTUAL_AID` | Request external unit of a given type | Only allowed when all local units of that type are busy |
| `UPGRADE` | Increase incident severity | New severity must be strictly higher than current |
| `DOWNGRADE` | Decrease incident severity | New severity must be strictly lower than current |

#### Dispatch Phraseology (bonus scoring)

The `notes` field is scored for realistic radio communication language. Agents that use proper dispatch phraseology receive up to 8% bonus on their protocol score.

| Action | Example notes value |
|---|---|
| Dispatch MEDIC to cardiac | `"Medic 1 en route to cardiac arrest, Code 3, ETA 4 minutes"` |
| Dispatch ENGINE to fire | `"Engine 2 responding to structure fire, Code 3, all units advised"` |
| Mutual aid request | `"Requesting mutual aid, all local MEDICs committed, Priority 1 cardiac at grid 45-72"` |
| Stage unit | `"Engine 1 staging at District 3 perimeter, awaiting scene clear"` |

---

## Observation Space

**`src.models.Observation`**

| Field | Type | Description |
|---|---|---|
| `result` | `str` | Human-readable result of the last action |
| `score` | `float` | Episode score in `[0.0, 1.0]` (task-level grade) |
| `protocol_ok` | `bool` | Whether the action passed protocol validation |
| `issues` | `list[str]` | Warnings or error codes from the validator |
| `reward_breakdown` | `dict[str, float] \| None` | Per-component reward scores for dashboard display |

**Full State (`src.models.State`)**

| Field | Type | Description |
|---|---|---|
| `units` | `dict[str, UnitState]` | All units with type, status, location, ETA |
| `incidents` | `dict[str, IncidentState]` | All incidents with type, severity, status, assigned units |
| `episode_id` | `str` | Unique episode identifier |
| `step_count` | `int` | Current step number |
| `task_id` | `str` | Active task identifier |
| `city_time` | `float` | Simulated city clock in seconds (30s per step) |
| `metadata` | `dict` | Schema info, districts, seeds, wave configs, bookkeeping |

**Unit Status Transitions**

```
AVAILABLE β†’ DISPATCHED β†’ ON_SCENE β†’ AVAILABLE
                ↓
         OUT_OF_SERVICE (shift_surge only)
```

**Incident Status Transitions**

```
PENDING β†’ RESPONDING β†’ ON_SCENE β†’ RESOLVED
   ↓           ↓
ESCALATED   ESCALATED    (survival clock expires)
```

---

## Reward Function

The step-level reward is a weighted combination of five components:

| Component | Weight | Description |
|---|---|---|
| `response_time` | **30%** | How quickly dispatched units reach incidents relative to severity benchmarks (P1: 240s, P2: 480s, P3: 900s) |
| `triage` | **25%** | Whether the dispatched unit type matches incident requirements (e.g., MEDIC for CARDIAC_ARREST) |
| `survival` | **25%** | Fraction of Priority-1 incidents resolved before the survival clock expires |
| `coverage` | **12%** | Geographic distribution of available units across city districts |
| `protocol` | **8%** | Action legality + optional phraseology/readback quality via `Action.notes` |

> **⚠️ Safety Gate:** If any Priority-1 incident (cardiac arrest, shooting, building collapse) results in zero survival score, the entire episode reward is hard-capped at **0.2** regardless of other performance. This forces agents to treat life-threatening incidents as non-negotiable β€” exactly as real dispatch protocol requires.

**Non-DISPATCH actions** receive neutral `0.5` for `response_time` and `triage`, allowing agents to maintain coverage without penalty.

---

## Tasks

### Task Difficulty Overview

| Task | Difficulty | Max Steps | Key Challenge |
|---|---|---|---|
| `single_incident` | 🟒 Easy | 20 | Dispatch the right unit type quickly |
| `multi_incident` | 🟑 Medium | 40 | Triage 3 simultaneous incidents, protect P1s |
| `mass_casualty` | πŸ”΄ Hard | 60 | Manage wave-based surge with limited resources |
| `shift_surge` | πŸ”΄ Hard | 60 | Adapt as units fail and incidents stream continuously |

---

### 🟒 Task 1: `single_incident` β€” Basic Dispatch (Easy)

**Scenario**: One active incident (`CARDIAC_ARREST`, Priority-1) in a small city. A MEDIC, ENGINE, and PATROL are all available.

**Objective**: Dispatch the correct unit type (MEDIC) to the incident as fast as possible.

**Grader Logic**:
```
score = 0.0
if incident RESOLVED:          score += 0.50
if MEDIC dispatched correctly: score += 0.30
if resolved within 10 steps:   score += 0.20
```

**Why it's easy**: One incident, one correct action, small state space.

**What a good agent does**: Immediately dispatches `MED-1 β†’ INC-001`.

**Scoring:** 50% resolution + 30% correct unit type used + 20% response speed.

---

### 🟑 Task 2: `multi_incident` β€” Simultaneous Triage (Medium)

**Scenario**: Three concurrent incidents at episode start β€” a structure fire (P2), a cardiac arrest (P1), and a shooting (P1) β€” with 6 available units.

**Objective**: Respond to all incidents with the right unit types, prioritizing P1s.

**Grader Logic**:
```
score = 0.5 Γ— p1_resolution_rate
      + 0.3 Γ— overall_resolution_rate
      - 0.2 Γ— escalation_penalty
```

**Why it's medium**: Multiple incidents compete for units; wrong type dispatch wastes coverage; P1s must be addressed before P2.

**What a good agent does**: Immediately dispatches MEDIC to cardiac arrest and patrol to shooting, then handles the fire with ENGINE/LADDER.

**Scoring:** 50% P1 resolution + 30% overall resolution βˆ’ 20% escalation penalty.

---

### πŸ”΄ Task 3: `mass_casualty` β€” Wave-Based Surge (Hard)

**Scenario**: One critical incident (`BUILDING_COLLAPSE`, P1) at step 0. New waves arrive at steps 5 (structure fire) and 12 (two simultaneous cardiac arrests).

**Objective**: Maximize P1 survival across all waves despite resource conflicts.

**Grader Logic**:
```
score = 0.6 Γ— p1_survival_rate
      + 0.3 Γ— mean_step_reward
      - failure_penalty
```

**Why it's hard**: Resources are exhausted when waves arrive. Agents must decide whether to reassign mid-scene or request mutual aid (at a 120s ETA penalty). Mutual aid is only legal when local units of the required type are fully committed.

**What a good agent does**: Dispatches immediately to initial collapse, stages additional units near expected wave arrival zones, requests mutual aid for later waves.

**Scoring:** 60% P1 survival + 30% mean step reward βˆ’ failure penalty if building collapse unresponded.

---

### πŸ”΄ Task 4: `shift_surge` β€” Long-Horizon Degradation (Hard)

**Scenario**: 5 units start available, but 3 go `OUT_OF_SERVICE` by step 5. Incidents arrive in waves every 8 steps throughout the 60-step episode.

**Objective**: Maintain city-wide throughput and P1 survival despite progressive resource degradation.

**Grader Logic**:
```
score = 0.35 Γ— resolution_ratio
      + 0.25 Γ— p1_survival
      + 0.15 Γ— coverage
      + 0.15 Γ— (1 - backlog_ratio)
      + 0.10 Γ— mean_reward
      - 0.25 Γ— escalation_ratio
```

**Why it's hard**: No single optimal strategy β€” agents must continuously rebalance between throughput and coverage as available resources shrink and incident demand grows.

**Scoring:** 35% resolution + 25% P1 survival + 15% coverage + 15% backlog management + 10% step reward βˆ’ 25% escalation penalty.

---

## Unit Types

| Unit | Code | Speed | Primary Use |
|---|---|---|---|
| Engine | `ENGINE` | 0.8 bl/s | Structure fires, hazmat support |
| Ladder | `LADDER` | 0.6 bl/s | Multi-story fires, rescues |
| Medic | `MEDIC` | 1.0 bl/s | Medical emergencies, trauma |
| Patrol | `PATROL` | 1.2 bl/s | Shootings, MVAs, crowd control |
| Hazmat | `HAZMAT` | 0.5 bl/s | Chemical/biological spills |

## Incident Types

| Incident | Recommended Units | Default Severity |
|---|---|---|
| `CARDIAC_ARREST` | MEDIC | P1 |
| `STRUCTURE_FIRE` | ENGINE Γ— 2, LADDER | P2 |
| `SHOOTING` | MEDIC, PATROL Γ— 2 | P1 |
| `MULTI_VEHICLE_ACCIDENT` | MEDIC, PATROL | P2 |
| `BUILDING_COLLAPSE` | ENGINE, LADDER, MEDIC Γ— 2 | P1 |
| `HAZMAT_SPILL` | HAZMAT, ENGINE | P2 |
| `OVERDOSE` | MEDIC | P2 |
| `MISSING_PERSON` | PATROL | P3 |

---

## OpenEnv Interface

```python
import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.models import Action, DispatchAction

async def main():
    env = OpenEnvEnvironment(task_id="multi_incident", seed=42)

    # Reset to initial state
    obs = await env.reset()
    print(obs.result)  # "dispatch center online"

    # Get legal actions (protocol-validated)
    legal = env.legal_actions()

    # Take a step
    action = legal[0]
    obs, reward, done = await env.step(action)

    print(f"reward={reward:.3f}, done={done}, protocol_ok={obs.protocol_ok}")

    # Inspect full state
    state = env.state()
    print(f"step={state.step_count}, city_time={state.city_time}s")

asyncio.run(main())
```

---

## API Endpoints

| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Health check β€” returns `{"status": "healthy"}` |
| `/reset` | POST | Reset environment; body: `{"task_id": "...", "seed": 42}` (both optional) |
| `/step` | POST | Execute an action; body: `{"action": {...}}` |
| `/state` | GET | Current full environment state |
| `/tasks` | GET | List all available tasks with metadata |
| `/dashboard/state` | GET | Extended state for live HTML dashboard |
| `/schema` | GET | JSON schemas for Action, Observation, State |
| `/metadata` | GET | Environment name, version, description |

---

## Quick Start

```bash
# Install dependencies
pip install -r requirements.txt

# Run the demo (non-interactive, no LLM required)
python demo.py

# Start the API server
python -m src.server.app

# Run random agent baseline (no API key required)
USE_RANDOM=true python inference.py

# Run LLM agent
API_BASE_URL=https://router.huggingface.co/v1 \
  MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  HF_TOKEN=your_token \
  python inference.py

# Run full test suite
pytest tests/ -v
```

---

## Docker

### Build & Run

```bash
# Build image
docker build -t citywide-dispatch-supervisor .

# Run on port 7860 (required for HF Spaces)
docker run -p 7860:7860 citywide-dispatch-supervisor

# Health check
curl http://localhost:7860/health

# Reset to a specific task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "multi_incident", "seed": 42}'
```

---

## Hugging Face Spaces Deployment

This repository is deployed as a Docker-based HF Space.

1. Create a new HF Space β†’ select **Docker**
2. Push this repository to the Space
3. The server reads `PORT` from the environment (HF sets `PORT=7860`)
4. Once running, the following endpoints are publicly available:
   - `GET /health`
   - `POST /reset`
   - `POST /step`
   - `GET /state`

Validate your deployment with the prevalidation script:

```bash
bash samplematerial/prevalidation.sh https://your-space.hf.space .
```

---

## Environment Variables

| Variable | Description | Default |
|---|---|---|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `meta-llama/Llama-3.1-8B-Instruct` |
| `HF_TOKEN` | HuggingFace API key | β€” |
| `USE_RANDOM` | Set `true` for deterministic random baseline | `false` |
| `PORT` | Server port | `7860` |

---

## Baseline Scores

Scores normalized to `[0.0, 1.0]` using `sum(rewards) / max_steps`.  
Run with `USE_RANDOM=true python inference.py` (seed=42, fully deterministic).

| Task | Difficulty | Max Steps | Random Agent Score |
|---|---|---|---|
| `single_incident` | Easy | 20 | 0.2000 |
| `multi_incident` | Medium | 40 | 0.3117 |
| `mass_casualty` | Hard | 60 | 0.4645 |
| `shift_surge` | Hard | 60 | 0.3183 |

> **Note:** Earlier README versions showed higher scores (~0.30–0.74) from a different scoring path (`observation.score`). These figures use the canonical competition normalization: `sum(step_rewards) / max_steps`, clamped to `[0.0, 1.0]`.

### What the scores mean

A random agent scoring **0.20 on the easiest task** confirms the environment is not trivially solvable β€” there is no reward for random dispatching. The gradient from 0.20 β†’ 0.46 across tasks reflects genuine increasing complexity, not just more steps.

A well-prompted frontier LLM (GPT-4o, Llama-3.1-70B) is expected to score **0.55–0.75 on single_incident** and **0.30–0.45 on shift_surge**, demonstrating the environment meaningfully differentiates agent capability.

LLM agents (`meta-llama/Llama-3.1-8B-Instruct` via `https://router.huggingface.co/v1`) are expected to score meaningfully higher on easy and medium tasks by correctly prioritizing P1 incidents and matching unit types.

Run the baseline matrix (random + LLM reruns) and emit a JSON report:

```bash
API_BASE_URL=https://router.huggingface.co/v1 \
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
HF_TOKEN=your_token \
python scripts/run_baseline_matrix.py --random-runs 1 --llm-runs 3 --output-json baseline_report.json
```

Windows PowerShell shortcut:

```powershell
$env:HF_TOKEN="your_token"
powershell -ExecutionPolicy Bypass -File scripts/run_nemotron_baseline.ps1 -RandomRuns 1 -LlmRuns 3
```

---

## Project Structure

```
.
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models.py               # Pydantic typed contracts (Action, Observation, State)
β”‚   β”œβ”€β”€ protocol.py             # Dispatch protocol validator
β”‚   β”œβ”€β”€ physics.py              # City-grid movement / ETA helpers
β”‚   β”œβ”€β”€ city_schema.py          # City topology + unit configuration loader
β”‚   β”œβ”€β”€ state_machine.py        # Core dispatch state machine
β”‚   β”œβ”€β”€ rewards.py              # Reward engine + episode graders
β”‚   β”œβ”€β”€ phraseology.py          # Dispatch phraseology renderer/judge
β”‚   β”œβ”€β”€ api.py                  # REST API client wrapper
β”‚   β”œβ”€β”€ grading.py              # Centralized episode grading router
β”‚   β”œβ”€β”€ benchmark.py            # Benchmark runner (list/run all tasks)
β”‚   β”œβ”€β”€ openenv_environment.py  # OpenEnv-compatible environment wrapper
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ registry.py         # Task registry + deterministic scenario fixtures
β”‚   β”‚   β”œβ”€β”€ single_incident.py  # Easy task + grader
β”‚   β”‚   β”œβ”€β”€ multi_incident.py   # Medium task + grader
β”‚   β”‚   β”œβ”€β”€ mass_casualty.py    # Hard task + grader
β”‚   β”‚   └── shift_surge.py      # Hard task + grader
β”‚   β”œβ”€β”€ server/
β”‚   β”‚   β”œβ”€β”€ app.py              # FastAPI server (reset/step/state endpoints)
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── Dockerfile
β”‚   └── visualizer/
β”‚       └── viewer.py           # Read-only 2D Matplotlib visualizer
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ metro_city.json         # Large city schema (default)
β”‚   └── city_small.json         # Small city schema (testing)
β”œβ”€β”€ tests/                      # TDD test suite (~20 test modules)
β”œβ”€β”€ samplematerial/
β”‚   └── prevalidation.sh        # HF Space + Docker validation script
β”œβ”€β”€ demo.py                     # Non-interactive demo (no LLM required)
β”œβ”€β”€ inference.py                # Competition inference script
β”œβ”€β”€ live_dashboard.html         # Browser-based live dashboard
β”œβ”€β”€ validate_local.py           # Local pre-submission validation
β”œβ”€β”€ openenv.yaml                # OpenEnv specification
β”œβ”€β”€ pyproject.toml              # uv project config
β”œβ”€β”€ requirements.txt            # pip dependencies
└── Dockerfile                  # Root Docker build
```

---

## Live Dashboard

After starting the server and calling `/reset`, open `live_dashboard.html` in a browser:

```bash
# Terminal 1: start server
python -m src.server.app

# Terminal 2: reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "multi_incident"}'

# Browser: open live_dashboard.html
```

The dashboard polls `/dashboard/state` every 500ms and renders:

- Unit cards (status, ETA, assignment, location)
- Incident cards (type, severity, status, assigned units)
- City map (2D grid with unit and incident markers)
- Per-step reward component bars

---

## 2D Visualizer (Programmatic)

```python
import asyncio
from src.openenv_environment import OpenEnvEnvironment
from src.visualizer.viewer import Viewer2D

async def main():
    env = OpenEnvEnvironment(task_id="multi_incident", seed=42)
    await env.reset()
    Viewer2D().render_to_file("frame.png", env.state())
    env.close()

asyncio.run(main())
```

---

---

## Determinism

All scenarios are deterministic under a fixed seed:

```python
env1 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
env2 = OpenEnvEnvironment(task_id="shift_surge", seed=42)
# env1 and env2 produce identical episodes
```

Incident positions include small seeded perturbations for realism; the overall episode structure (waves, unit positions, incident types) is fully reproducible.

---

## Running Tests

```bash
# Full test suite
pytest tests/ -v

# Individual modules
pytest tests/test_state_machine.py -v
pytest tests/test_rewards.py -v
pytest tests/test_openenv_integration.py -v
pytest tests/test_inference.py -v
```

---

## Pre-Submission Validation

```bash
# Full local validation (tests + inference + docker + benchmark scores)
python validate_local.py

# OpenEnv spec validation
openenv validate

# HF Space validation (requires deployed space)
bash samplematerial/prevalidation.sh https://your-space.hf.space .

# Windows (explicit Git Bash)
"C:/Program Files/Git/bin/bash.exe" samplematerial/prevalidation.sh https://your-space.hf.space .
```

---

## License

MIT License