File size: 30,104 Bytes
cb054fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
# Overflow Environment β€” Low-Level Design Document

## Table of Contents

1. [Architecture Overview](#1-architecture-overview)
2. [File-by-File Breakdown](#2-file-by-file-breakdown)
3. [Data Models (Wire Format)](#3-data-models-wire-format)
4. [Simulation Internals](#4-simulation-internals)
5. [Step-by-Step Execution Pipeline](#5-step-by-step-execution-pipeline)
6. [Distance and Collision Model](#6-distance-and-collision-model)
7. [Reward Function β€” Complete Breakdown](#7-reward-function--complete-breakdown)
8. [Scripted Car AI](#8-scripted-car-ai)
9. [Action Parsing β€” How LLM Output Becomes a Decision](#9-action-parsing--how-llm-output-becomes-a-decision)
10. [Observation Text Format](#10-observation-text-format)
11. [Server Protocol β€” What Training Scripts Must Send](#11-server-protocol--what-training-scripts-must-send)
12. [Training Integration β€” GRPO / TRL](#12-training-integration--grpo--trl)
13. [Episode Dynamics and RL Characteristics](#13-episode-dynamics-and-rl-characteristics)
14. [Configuration Constants](#14-configuration-constants)
15. [Docker and Deployment](#15-docker-and-deployment)

---

## 1. Architecture Overview

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Training Script (GRPO)                β”‚
β”‚  calls reset(), reads observation, calls step(action)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ WebSocket (persistent session)
                         β”‚ JSON messages over ws://host:8000/ws
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              FastAPI Server (app.py)                     β”‚
β”‚  create_app(OverflowEnvironment, OverflowAction,        β”‚
β”‚             OverflowObservation)                         β”‚
β”‚                                                         β”‚
β”‚  Endpoints:                                             β”‚
β”‚    WS  /ws       ← primary (stateful session)           β”‚
β”‚    POST /reset   ← HTTP fallback                        β”‚
β”‚    POST /step    ← HTTP fallback                        β”‚
β”‚    GET  /state   ← HTTP fallback                        β”‚
β”‚    GET  /health  ← health check                         β”‚
β”‚    GET  /schema  ← JSON schemas for action/obs/state    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         OverflowEnvironment (pure Python)                β”‚
β”‚                                                         β”‚
β”‚  Internal state:                                        β”‚
β”‚    _cars: List[Car]     (5 cars, car 0 = agent)         β”‚
β”‚    _state: OverflowState (episode tracking)             β”‚
β”‚    _rng: random.Random  (seeded per episode)            β”‚
β”‚    _done: bool                                          β”‚
β”‚                                                         β”‚
β”‚  Methods:                                               β”‚
β”‚    reset(seed, episode_id) β†’ OverflowObservation        β”‚
β”‚    step(OverflowAction)    β†’ OverflowObservation        β”‚
β”‚    state (property)        β†’ OverflowState              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Key invariant**: The training loop calls `reset()`. The LLM agent only calls `step()` via the training harness. Agents can never reset β€” if they could undo consequences, training breaks.

**Session model**: Each WebSocket connection gets its own `OverflowEnvironment` instance. The `create_app` function receives the class (factory), not an instance. When a WebSocket connects, the server instantiates a fresh environment for that session.

---

## 2. File-by-File Breakdown

### `models.py` β€” Pydantic data models

Defines three classes inheriting from OpenEnv core types:

| Class | Parent | Purpose |
|-------|--------|---------|
| `OverflowAction(Action)` | `openenv.core.env_server.types.Action` | What the LLM sends each step |
| `OverflowObservation(Observation)` | `openenv.core.env_server.types.Observation` | What the environment returns |
| `OverflowState(State)` | `openenv.core.env_server.types.State` | Internal state exposed via `/state` |

All three are Pydantic `BaseModel` subclasses. The parent classes provide `metadata: Dict[str, Any]` (on Action and Observation) and `episode_id: str`, `step_count: int` (on State). The parent `Observation` provides `done: bool` and `reward: float | None`.

### `server/overflow_environment.py` β€” All game logic

Contains:
- `Car` dataclass β€” per-car state (id, lane, position, speed, goal, is_agent, reached_goal)
- `_parse_decision()` β€” tolerant action parser
- `_compute_reasoning_bonus()` β€” reasoning quality scorer
- `_scripted_car_action()` β€” NPC car AI
- `_apply_action()` β€” mutates a car's speed/lane
- `_generate_scene_description()` β€” builds the text observation
- `OverflowEnvironment(Environment)` β€” the main class with `reset()`, `step()`, `state`

### `server/app.py` β€” FastAPI wiring

Introspects `create_app` to determine if it expects a factory (class) or an instance. Passes `OverflowEnvironment`, `OverflowAction`, `OverflowObservation` to `create_app`. The resulting `app` object is what uvicorn serves.

### `client.py` β€” WebSocket client

`OverflowEnv(EnvClient[OverflowAction, OverflowObservation, OverflowState])` with three required methods:
- `_step_payload(action)` β€” serializes `OverflowAction` to `{"decision": ..., "reasoning": ...}`
- `_parse_result(payload)` β€” deserializes server JSON into `StepResult[OverflowObservation]`
- `_parse_state(payload)` β€” deserializes server JSON into `OverflowState`

### `__init__.py` β€” Public API

Exports: `OverflowAction`, `OverflowObservation`, `OverflowState`, `OverflowEnv`.

---

## 3. Data Models (Wire Format)

### OverflowAction β€” What the training script sends to `/step`

```json
{
  "action": {
    "decision": "brake",
    "reasoning": "Car 3 is ahead in my lane, 15 units away, going slower. I should brake."
  }
}
```

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `decision` | `str` | No | `"maintain"` | One of: `accelerate`, `brake`, `lane_change_left`, `lane_change_right`, `maintain` |
| `reasoning` | `str` | No | `""` | Free-text chain-of-thought. Affects reward via reasoning bonus (0.0–2.0). |

The `decision` field is parsed tolerantly β€” see Section 9.

### OverflowObservation β€” What the server returns

Each observation carries **both** text (for the LLM) and structured data (for the frontend/viz).

```json
{
  "observation": {
    "scene_description": "You are Car 0 in lane 2, position 45, speed 60.\n...",
    "incident_report": "Observer: No incidents this step.",
    "done": false,
    "reward": 1.45,
    "cars": [
      {"carId": 0, "lane": 2, "position": {"x": 45.0, "y": 7.4}, "speed": 60.0, "acceleration": 5.0},
      {"carId": 1, "lane": 1, "position": {"x": 43.0, "y": 3.7}, "speed": 55.0, "acceleration": 0.0}
    ],
    "proximities": [
      {"carA": 0, "carB": 1, "distance": 10.5}
    ],
    "lane_occupancies": [
      {"lane": 1, "carIds": [1]},
      {"lane": 2, "carIds": [0]}
    ],
    "metadata": {}
  },
  "reward": 1.45,
  "done": false
}
```

#### Text fields (for the LLM)

| Field | Type | Description |
|-------|------|-------------|
| `scene_description` | `str` | Multi-line text describing all cars. This is what the LLM reads. |
| `incident_report` | `str` | Observer output. Either `"Observer: No incidents this step."` or a list of CRASH/NEAR MISS events. |

#### Structured fields (for the frontend β€” compatible with Overflow frontend types)

| Field | Type | Frontend equivalent |
|-------|------|---------------------|
| `cars` | `CarStateData[]` | `CarState[]` β€” `{carId, lane, position: {x, y}, speed, acceleration}` |
| `proximities` | `ProximityData[]` | `{carA, carB, distance}[]` β€” pairwise distances for close cars |
| `lane_occupancies` | `LaneOccupancyData[]` | `{lane, carIds}[]` β€” which cars are in each lane |

Position `y` is computed as `lane * 3.7` (lane width in metres), matching the frontend's `makeCar` convention.

#### Common fields

| Field | Type | Description |
|-------|------|-------------|
| `done` | `bool` | `true` if episode ended (crash, goal reached, or max steps). |
| `reward` | `float` | Scalar reward for this step. Sum of all reward components. |

The `reward` and `done` appear both inside `observation` and at the top level of the response (OpenEnv convention).

### OverflowState β€” What `/state` returns

```json
{
  "episode_id": "a1b2c3d4-...",
  "step_count": 17,
  "crash_count": 0,
  "near_miss_count": 23,
  "cars_reached_goal": 1,
  "total_cars": 5
}
```

| Field | Type | Description |
|-------|------|-------------|
| `episode_id` | `str` | UUID for this episode. Set on `reset()`. |
| `step_count` | `int` | How many `step()` calls have been made. |
| `crash_count` | `int` | Cumulative crash events (each pair counts as 1). |
| `near_miss_count` | `int` | Cumulative near-miss events (each pair counts as 1). |
| `cars_reached_goal` | `int` | How many cars (including scripted) reached their goal. |
| `total_cars` | `int` | Always 5. |

---

## 4. Simulation Internals

### The Road

- 3 lanes, numbered 1, 2, 3 (1 = leftmost, 3 = rightmost)
- Road length: ~200 position units
- No wrapping β€” cars move forward from low positions toward high positions
- Lanes are conceptually 10 units apart for distance calculations

### Car State

Each car is a `Car` dataclass:

```python
@dataclass
class Car:
    car_id: int          # 0 = agent, 1–4 = scripted
    lane: int            # 1, 2, or 3
    position: float      # 0.0 to ~200.0 (along the road)
    speed: float         # 20.0 to 90.0
    goal_position: float # 160.0 to 195.0
    is_agent: bool       # True only for car 0
    reached_goal: bool   # True once position >= goal_position
```

### Initialization (reset)

On `reset(seed=N)`:
1. A `random.Random(seed)` RNG is created (deterministic replays if same seed).
2. 5 cars are spawned:
   - **Lane**: random 1–3
   - **Position**: random 10–80 (spread across the first half of the road)
   - **Speed**: random 40–70
   - **Goal**: random 160–195
3. No two cars occupy the same 10-unit segment in the same lane at spawn (deconflicted via `(lane, position // 10)` hash).
4. Car 0 is the agent. Cars 1–4 are scripted.

### Movement

Each step, every active (non-goal-reached) car moves forward:

```
car.position += car.speed * 0.1
```

This means a car at speed 60 moves 6.0 units per step. At that rate, traversing the ~120-unit gap from starting zone (10–80) to goal zone (160–195) takes roughly 20 steps. Faster cars (speed 90) move 9.0 units/step and reach goals sooner.

---

## 5. Step-by-Step Execution Pipeline

When `step(action)` is called, the following happens **in this exact order**:

```
1. GUARD: if episode is already done β†’ return stale observation with reward=0.0
2. INCREMENT step_count
3. PARSE the agent's action β†’ one of {accelerate, brake, lane_change_left, lane_change_right, maintain}
4. APPLY action to Car 0 (mutate speed or lane)
5. COMPUTE scripted actions for Cars 1–4 and APPLY them
6. MOVE all active cars forward: position += speed * 0.1
7. COLLISION DETECTION (pairwise over all active cars):
   - distance < 5.0 β†’ CRASH (reward -5.0, episode ends)
   - distance < 15.0 β†’ NEAR MISS (reward -1.0 per pair)
8. If no crash:
   a. Check if Car 0 reached its goal β†’ reward +3.0, episode ends
   b. Check if scripted cars reached their goals (state tracking only)
   c. If episode not ending β†’ SAFE STEP bonus: reward +0.5
9. REASONING BONUS: score the reasoning text β†’ reward +0.0 to +2.0
10. MAX STEPS CHECK: if step_count >= 100 β†’ episode ends
11. BUILD observation text and incident report
12. RETURN OverflowObservation(scene_description, incident_report, done, reward)
```

**Important ordering detail**: Actions are applied (step 4–5) **before** movement (step 6). This means the agent's speed/lane change takes effect for this step's movement. Collision detection (step 7) happens **after** movement, on the new positions.

**Reward accumulation within a step**: A single step's reward is the **sum** of all applicable components. For example, if there are 2 near-miss pairs and the agent is still alive with good reasoning, the reward could be: `(-1.0 * 2) + 0.5 + 1.5 = -1.0`.

---

## 6. Distance and Collision Model

Distance between two cars uses a weighted Euclidean formula:

```python
def distance_to(self, other):
    lane_diff = abs(self.lane - other.lane) * 10.0
    pos_diff = abs(self.position - other.position)
    return sqrt(lane_diff**2 + pos_diff**2)
```

**Implications**:
- Two cars in the **same lane** at positions 45 and 50: distance = 5.0 (exactly at crash threshold)
- Two cars in **adjacent lanes** (e.g., lane 1 and lane 2) at the same position: distance = 10.0 (near miss, not crash)
- Two cars **two lanes apart** at the same position: distance = 20.0 (safe, no incident)
- Two cars in adjacent lanes, 10 units apart longitudinally: distance = sqrt(100 + 100) β‰ˆ 14.1 (near miss)

**Key insight for the agent**: Lane changes provide safety via the 10-unit lane multiplier. Staying in the same lane as another car is the primary crash risk. The agent should use lane changes proactively to maintain distance from cars in its lane.

### Collision detection scope

Detection is **pairwise over ALL active cars**, not just agent-involving pairs. If Car 2 and Car 3 crash, the episode still ends with -5.0 reward. This means the agent is implicitly responsible for the overall traffic flow β€” it should avoid creating situations where its actions cause chain reactions among scripted cars.

---

## 7. Reward Function β€” Complete Breakdown

### Per-step reward components

| Component | Value | Condition | Stacks? |
|-----------|-------|-----------|---------|
| **Crash** | -5.0 | Any pair distance < 5.0 | Once (episode ends) |
| **Near miss** | -1.0 | Per pair with distance < 15.0 | Yes, per pair (can be -2.0, -3.0, etc.) |
| **Safe step** | +0.5 | No crash and episode not ending this step | Once per step |
| **Goal reached** | +3.0 | Car 0's position >= goal_position | Once (episode ends) |
| **Reasoning bonus** | +0.0 to +2.0 | Based on reasoning text quality | Once per step |

### Reasoning bonus scoring

The bonus has three sub-components capped at 2.0 total:

**Length bonus** (up to 0.5):
- `len > 20` chars β†’ +0.2
- `len > 50` chars β†’ +0.15
- `len > 100` chars β†’ +0.15

**Keyword awareness** (up to 1.0):
Each keyword found β†’ +0.2, capped at 1.0. Keywords: `ahead`, `behind`, `lane`, `speed`, `distance`, `safe`, `danger`, `collision`, `brake`, `gap`, `close`, `slow`, `fast`, `goal`, `position`.

**Structure bonus** (up to 0.5):
- Contains `<think>` or `because` β†’ +0.25
- Contains `therefore`, `so i should`, `best option`, or `i will` β†’ +0.25

### Typical reward ranges per step

| Scenario | Typical reward |
|----------|---------------|
| Safe step, no reasoning | +0.5 |
| Safe step, decent reasoning | +1.0 to +2.0 |
| Safe step, excellent reasoning | +2.0 to +2.5 |
| 1 near miss, decent reasoning | -0.5 to +0.5 |
| 2 near misses, decent reasoning | -1.5 to -0.5 |
| Crash (any) | -5.0 + reasoning bonus |
| Goal reached, good reasoning | +3.0 + reasoning bonus |

### Episode return (total reward) characteristics

Based on testing with seed=42:
- A "maintain" strategy with decent reasoning gets ~1.1 per step Γ— ~17 steps β‰ˆ 18.7 total, minus near-miss penalties
- Aggressive "accelerate" strategies reach the goal faster but accumulate more near misses
- Smart strategies that use lane changes and braking to avoid near misses can maximize total reward

---

## 8. Scripted Car AI

Cars 1–4 use `_scripted_car_action(car, all_cars, rng)`:

```
1. Find the nearest car AHEAD in the SAME LANE
2. If that car is < 20 units ahead β†’ "brake"
3. Else if speed < 60 and 10% random chance β†’ "accelerate"
4. Else if 5% random chance β†’ lane change (random left/right, respecting boundaries)
5. Else β†’ "maintain"
```

**Characteristics**:
- Scripted cars are mostly passive β€” they maintain speed
- They brake reactively when blocked (but only for same-lane, ahead)
- They rarely change lanes (5% per step), making their behavior somewhat predictable
- They never intentionally avoid the agent β€” only react to cars directly ahead
- They can accumulate near misses and crashes among themselves

This creates an environment where a smart agent can learn to navigate around largely predictable but occasionally erratic traffic.

---

## 9. Action Parsing β€” How LLM Output Becomes a Decision

The parser `_parse_decision(action)` is intentionally forgiving. It tries three strategies in order:

### Strategy 1: Direct field match
```python
decision = action.decision.strip().lower().replace(" ", "_")
# If it's one of {accelerate, brake, lane_change_left, lane_change_right, maintain} β†’ use it
```

### Strategy 2: XML tag extraction
```python
text = f"{action.decision} {action.reasoning}".lower()
match = re.search(r"<action>\s*(\w+)\s*</action>", text)
# If found and valid β†’ use it
```

This handles LLM outputs like:
```
decision: "think about it"
reasoning: "<think>Car ahead is close</think><action>brake</action>"
```

### Strategy 3: Keyword scan
```python
for v in {"accelerate", "brake", "lane_change_left", "lane_change_right", "maintain"}:
    if v in text:
        return v
```

This handles outputs like `decision: "I want to accelerate now"`.

### Fallback
If nothing matches β†’ `"maintain"` (safe default).

**For training scripts**: The cleanest format is to put the exact decision string in the `decision` field. The tolerant parsing is there so that LLMs in early training (before they learn the format) still produce valid actions rather than crashing.

---

## 10. Observation Text Format

The `scene_description` field is a multi-line string that the LLM reads as its input. Example:

```
You are Car 0 in lane 2, position 45, speed 60.
Goal: reach position 180.
Nearby cars:
- Car 1: lane 1, position 43, speed 55
- Car 2: lane 3, position 48, speed 70
- Car 3: lane 2, position 65, speed 50 [AHEAD IN YOUR LANE - 20 units away]
- Car 4: lane 1, position 30, speed 65
```

**Annotations added**:
- `[AHEAD IN YOUR LANE - N units away]` β€” same lane, ahead of agent
- `[BEHIND IN YOUR LANE - N units away]` β€” same lane, behind agent
- `[REACHED GOAL]` β€” car has finished

The `incident_report` is separate:
- No incidents: `"Observer: No incidents this step."`
- With incidents: One line per event, e.g.:
  ```
  NEAR MISS between Car 0 and Car 3 (distance: 12.5)
  Car 0 reached its goal at position 180!
  ```

---

## 11. Server Protocol β€” What Training Scripts Must Send

### WebSocket Protocol (Primary β€” for training)

Connect to `ws://host:8000/ws`. All messages are JSON.

#### Reset

**Send:**
```json
{"type": "reset", "data": {"seed": 42}}
```

`data` can include `seed` (int) and/or `episode_id` (str). Both are optional.

**Receive:**
```json
{
  "type": "observation",
  "data": {
    "observation": {
      "scene_description": "You are Car 0 in lane 3, position 24, speed 40.\n...",
      "incident_report": "",
      "done": false,
      "reward": 0.0,
      "metadata": {}
    },
    "reward": 0.0,
    "done": false
  }
}
```

#### Step

**Send:**
```json
{
  "type": "step",
  "data": {
    "decision": "brake",
    "reasoning": "Car ahead is close, braking to maintain safe distance."
  }
}
```

**Receive:**
```json
{
  "type": "observation",
  "data": {
    "observation": {
      "scene_description": "You are Car 0 in lane 3, position 27, speed 35.\n...",
      "incident_report": "Observer: No incidents this step.",
      "done": false,
      "reward": 2.25,
      "metadata": {}
    },
    "reward": 2.25,
    "done": false
  }
}
```

#### State

**Send:**
```json
{"type": "state"}
```

**Receive:**
```json
{
  "type": "state",
  "data": {
    "episode_id": "a1b2c3d4-...",
    "step_count": 7,
    "crash_count": 0,
    "near_miss_count": 3,
    "cars_reached_goal": 0,
    "total_cars": 5
  }
}
```

#### Close

**Send:**
```json
{"type": "close"}
```

### HTTP Protocol (Fallback β€” for simple testing)

Note: The HTTP API creates a **new environment instance per endpoint** in factory mode. The `/reset` and `/step` calls hit separate instances. Use WebSocket for stateful multi-step episodes.

```
POST /reset     Body: {"seed": 42}              β†’ {"observation": {...}, "reward": 0.0, "done": false}
POST /step      Body: {"action": {"decision": "brake", "reasoning": "..."}}  β†’ {"observation": {...}, "reward": ..., "done": ...}
GET  /state     β†’ {"episode_id": ..., "step_count": ..., ...}
GET  /health    β†’ {"status": "healthy"}
GET  /schema    β†’ {"action": {...}, "observation": {...}, "state": {...}}
```

### Using the Python Client

```python
from overflow_env import OverflowEnv, OverflowAction

with OverflowEnv(base_url="http://localhost:8000") as env:
    result = env.reset(seed=42)
    # result is StepResult[OverflowObservation]
    # result.observation.scene_description  β€” the text for the LLM
    # result.observation.incident_report    β€” observer output
    # result.reward                         β€” float
    # result.done                           β€” bool

    while not result.done:
        # Feed scene_description to LLM, get decision + reasoning back
        llm_decision, llm_reasoning = call_llm(result.observation.scene_description)

        action = OverflowAction(decision=llm_decision, reasoning=llm_reasoning)
        result = env.step(action)

    # Episode over
    state = env.state()
    print(f"Steps: {state.step_count}, Crashes: {state.crash_count}")
```

---

## 12. Training Integration β€” GRPO / TRL

### System prompt for the LLM

The training script should set a system prompt like:

```
You are an autonomous vehicle controller. Each turn you receive a traffic scene description.
You must output a driving decision and your reasoning.

Available decisions: accelerate, brake, lane_change_left, lane_change_right, maintain

Output format:
<think>Your reasoning about the traffic situation</think>
<action>your_decision</action>
```

### What the training loop does each episode

```python
# 1. Reset environment
result = env.reset(seed=episode_seed)

# 2. Build initial prompt
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": result.observation.scene_description}
]

trajectory_rewards = []

# 3. Loop until done
while not result.done:
    # 3a. Get LLM completion
    completion = model.generate(messages)  # the text the LLM produces

    # 3b. Parse LLM output into action
    #     The environment's parser is tolerant, but for clean training
    #     you might also parse on the client side
    action = OverflowAction(
        decision=extract_decision(completion),
        reasoning=completion  # pass full text as reasoning
    )

    # 3c. Step
    result = env.step(action)
    trajectory_rewards.append(result.reward)

    # 3d. Append to conversation for next turn
    messages.append({"role": "assistant", "content": completion})
    messages.append({"role": "user", "content": (
        result.observation.scene_description + "\n" +
        result.observation.incident_report
    )})

# 4. Compute episode return for GRPO
episode_return = sum(trajectory_rewards)
```

### GRPO reward signal

For GRPO (Group Relative Policy Optimization), the reward signal is the **episode return** β€” the sum of all per-step rewards across the episode. The environment is designed so that:

- **Positive episode returns** (agent reached goal safely with good reasoning) indicate good behavior
- **Negative episode returns** (crashes, many near misses) indicate bad behavior
- The **reasoning bonus** provides per-step reward shaping that encourages the LLM to explain its thinking, which improves interpretability and can speed up learning

### Constructing the reward for TRL

If using TRL's `OnlineDPOTrainer` or `GRPOTrainer`:

```python
# Per-step reward is already in result.reward
# For token-level reward (assign to last token of each turn):
rewards_per_turn = trajectory_rewards  # list of floats, one per step

# For episode-level reward (assign to last token of episode):
episode_reward = sum(trajectory_rewards)
```

---

## 13. Episode Dynamics and RL Characteristics

### Episode length distribution

| Scenario | Typical length |
|----------|---------------|
| Aggressive accelerate β†’ goal | 12–20 steps |
| Moderate maintain β†’ goal | 18–30 steps |
| Conservative braking | 30–50+ steps |
| Crash (bad luck or bad driving) | 5–15 steps |
| Max steps timeout | 100 steps |

### What makes this environment learnable

1. **Clear signal**: Crashes give -5.0, goals give +3.0. The agent quickly learns that crashing is bad and reaching the goal is good.

2. **Gradual improvement**: Near misses (-1.0 each) provide intermediate signal. An agent that learns to avoid near misses gets higher returns than one that just avoids crashes.

3. **Speed-accuracy tradeoff**: Accelerating reaches the goal faster (more +3.0 episodes) but increases crash/near-miss risk. The optimal policy is to accelerate when safe and brake/change lanes when needed.

4. **Reasoning is rewarded**: The reasoning bonus (up to +2.0/step) means that over a 20-step episode, reasoning alone can contribute up to +40.0. This incentivizes the LLM to produce structured, situation-aware reasoning.

5. **Stochasticity**: Scripted cars have random elements (10% accelerate, 5% lane change). This means the same seed produces the same episode, but different seeds produce different traffic patterns, forcing the agent to generalize.

6. **All-pairs collision**: The agent is rewarded/punished for the entire traffic system, not just its own car. This means the agent must be aware of the overall traffic flow.

### Typical learning progression

1. **Random policy**: Mostly "maintain", occasional random actions. Episode return: 0 to 15 (depending on luck).
2. **Basic safety**: Agent learns to brake when car ahead is close. Fewer crashes, more goals. Episode return: 10 to 25.
3. **Strategic driving**: Agent learns to change lanes proactively, accelerate when clear, brake early. Episode return: 20 to 40.
4. **Optimized reasoning**: Agent produces structured reasoning with relevant keywords, maximizing the reasoning bonus. Episode return: 30 to 60.

### Reproducibility

Passing `seed=N` to `reset()` produces deterministic initial conditions and scripted car behavior (since the `random.Random` instance is seeded). The same seed + same agent actions = same trajectory. This is critical for GRPO, which compares multiple rollouts of the same prompt.

---

## 14. Configuration Constants

All constants are defined at the top of `server/overflow_environment.py`:

```python
NUM_LANES = 3              # Number of road lanes
ROAD_LENGTH = 200          # Conceptual road length (units)
NUM_CARS = 5               # Total cars (1 agent + 4 scripted)
MAX_STEPS = 100            # Maximum steps before forced termination
CRASH_DISTANCE = 5.0       # Distance threshold for crash
NEAR_MISS_DISTANCE = 15.0  # Distance threshold for near miss

REWARD_CRASH = -5.0        # Reward for any crash
REWARD_NEAR_MISS = -1.0    # Reward per near-miss pair
REWARD_SAFE_STEP = 0.5     # Reward for surviving a step
REWARD_REACHED_GOAL = 3.0  # Reward for reaching goal
REWARD_REASONING_MAX = 2.0 # Maximum reasoning quality bonus

MIN_SPEED = 20             # Minimum car speed
MAX_SPEED = 90             # Maximum car speed
SPEED_DELTA = 5            # Speed change per accelerate/brake
```

To tune difficulty:
- **Easier**: Increase `CRASH_DISTANCE` and `NEAR_MISS_DISTANCE`, decrease `NUM_CARS`, widen starting positions
- **Harder**: Decrease distances, increase `NUM_CARS`, narrow starting positions, increase `MAX_SPEED`
- **Longer episodes**: Increase `ROAD_LENGTH` or decrease starting speeds
- **More reasoning incentive**: Increase `REWARD_REASONING_MAX`

---

## 15. Docker and Deployment

### Local development

```bash
uvicorn overflow_env.server.app:app --host 0.0.0.0 --port 8000 --reload
```

### Docker build

```bash
# From the overflow_env/ directory:
docker build -t overflow-env:latest -f server/Dockerfile .
docker run -p 8000:8000 overflow-env:latest
```

The Dockerfile uses a multi-stage build:
1. **Builder stage**: Installs dependencies with `uv sync` into a `.venv`
2. **Runtime stage**: Copies the `.venv` and source code, runs uvicorn

Base image: `ghcr.io/meta-pytorch/openenv-base:latest`

### Push to HuggingFace Spaces

```bash
openenv push --repo-id username/overflow-env
```

### Connect from training script

```python
# Local
env = OverflowEnv(base_url="http://localhost:8000")

# Docker
env = OverflowEnv.from_docker_image("overflow-env:latest")

# HuggingFace Space
env = OverflowEnv.from_env("username/overflow-env")
```

### openenv.yaml manifest

```yaml
spec_version: 1
name: overflow_env
type: space
runtime: fastapi
app: server.app:app
port: 8000
```

This tells OpenEnv tooling how to find and run the environment.