overflow-openenv

Runtime error

App Files Files Community

aparekh02 commited on Mar 8

Commit

cb054fe

verified ·

1 Parent(s): 1c6e7aa

initial push: overflow_env with Gradio RL demo UI

Browse files

Files changed (38) hide show

DESIGN.md +791 -0
Dockerfile +22 -0
README.md +64 -6
__init__.py +27 -0
app.py +424 -0
client.py +92 -0
models.py +134 -0
openenv.yaml +6 -0
policies/__init__.py +5 -0
policies/__pycache__/__init__.cpython-314.pyc +0 -0
policies/__pycache__/base_policy.cpython-314.pyc +0 -0
policies/__pycache__/flat_mlp_policy.cpython-314.pyc +0 -0
policies/__pycache__/policy_spec.cpython-314.pyc +0 -0
policies/__pycache__/ticket_attention_policy.cpython-314.pyc +0 -0
policies/base_policy.py +66 -0
policies/flat_mlp_policy.py +50 -0
policies/policy_spec.py +409 -0
policies/ticket_attention_policy.py +227 -0
pyproject.toml +33 -0
requirements.txt +8 -0
server/__init__.py +0 -0
server/__pycache__/__init__.cpython-314.pyc +0 -0
server/__pycache__/overflow_environment.cpython-314.pyc +0 -0
server/__pycache__/policy_adapter.cpython-314.pyc +0 -0
server/app.py +46 -0
server/overflow_environment.py +497 -0
server/policy_adapter.py +80 -0
server/requirements.txt +8 -0
training/__init__.py +0 -0
training/__pycache__/__init__.cpython-314.pyc +0 -0
training/__pycache__/curriculum.cpython-314.pyc +0 -0
training/__pycache__/overflow_gym_env.cpython-314.pyc +0 -0
training/__pycache__/ppo_trainer.cpython-314.pyc +0 -0
training/__pycache__/reward.cpython-314.pyc +0 -0
training/curriculum.py +99 -0
training/overflow_gym_env.py +170 -0
training/ppo_trainer.py +329 -0
training/reward.py +94 -0

DESIGN.md ADDED Viewed

	@@ -0,0 +1,791 @@

+# Overflow Environment — Low-Level Design Document
+## Table of Contents
+1. [Architecture Overview](#1-architecture-overview)
+2. [File-by-File Breakdown](#2-file-by-file-breakdown)
+3. [Data Models (Wire Format)](#3-data-models-wire-format)
+4. [Simulation Internals](#4-simulation-internals)
+5. [Step-by-Step Execution Pipeline](#5-step-by-step-execution-pipeline)
+6. [Distance and Collision Model](#6-distance-and-collision-model)
+7. [Reward Function — Complete Breakdown](#7-reward-function--complete-breakdown)
+8. [Scripted Car AI](#8-scripted-car-ai)
+9. [Action Parsing — How LLM Output Becomes a Decision](#9-action-parsing--how-llm-output-becomes-a-decision)
+10. [Observation Text Format](#10-observation-text-format)
+11. [Server Protocol — What Training Scripts Must Send](#11-server-protocol--what-training-scripts-must-send)
+12. [Training Integration — GRPO / TRL](#12-training-integration--grpo--trl)
+13. [Episode Dynamics and RL Characteristics](#13-episode-dynamics-and-rl-characteristics)
+14. [Configuration Constants](#14-configuration-constants)
+15. [Docker and Deployment](#15-docker-and-deployment)
+---
+## 1. Architecture Overview
+```
+┌─────────────────────────────────────────────────────────┐
+│                   Training Script (GRPO)                │
+│  calls reset(), reads observation, calls step(action)   │
+└────────────────────────┬────────────────────────────────┘
+                         │ WebSocket (persistent session)
+                         │ JSON messages over ws://host:8000/ws
+                         ▼
+┌─────────────────────────────────────────────────────────┐
+│              FastAPI Server (app.py)                     │
+│  create_app(OverflowEnvironment, OverflowAction,        │
+│             OverflowObservation)                         │
+│                                                         │
+│  Endpoints:                                             │
+│    WS  /ws       ← primary (stateful session)           │
+│    POST /reset   ← HTTP fallback                        │
+│    POST /step    ← HTTP fallback                        │
+│    GET  /state   ← HTTP fallback                        │
+│    GET  /health  ← health check                         │
+│    GET  /schema  ← JSON schemas for action/obs/state    │
+└────────────────────────┬────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────┐
+│         OverflowEnvironment (pure Python)                │
+│                                                         │
+│  Internal state:                                        │
+│    _cars: List[Car]     (5 cars, car 0 = agent)         │
+│    _state: OverflowState (episode tracking)             │
+│    _rng: random.Random  (seeded per episode)            │
+│    _done: bool                                          │
+│                                                         │
+│  Methods:                                               │
+│    reset(seed, episode_id) → OverflowObservation        │
+│    step(OverflowAction)    → OverflowObservation        │
+│    state (property)        → OverflowState              │
+└─────────────────────────────────────────────────────────┘
+```
+**Key invariant**: The training loop calls `reset()`. The LLM agent only calls `step()` via the training harness. Agents can never reset — if they could undo consequences, training breaks.
+**Session model**: Each WebSocket connection gets its own `OverflowEnvironment` instance. The `create_app` function receives the class (factory), not an instance. When a WebSocket connects, the server instantiates a fresh environment for that session.
+---
+## 2. File-by-File Breakdown
+### `models.py` — Pydantic data models
+Defines three classes inheriting from OpenEnv core types:
+| Class | Parent | Purpose |
+|-------|--------|---------|
+| `OverflowAction(Action)` | `openenv.core.env_server.types.Action` | What the LLM sends each step |
+| `OverflowObservation(Observation)` | `openenv.core.env_server.types.Observation` | What the environment returns |
+| `OverflowState(State)` | `openenv.core.env_server.types.State` | Internal state exposed via `/state` |
+All three are Pydantic `BaseModel` subclasses. The parent classes provide `metadata: Dict[str, Any]` (on Action and Observation) and `episode_id: str`, `step_count: int` (on State). The parent `Observation` provides `done: bool` and `reward: float | None`.
+### `server/overflow_environment.py` — All game logic
+Contains:
+- `Car` dataclass — per-car state (id, lane, position, speed, goal, is_agent, reached_goal)
+- `_parse_decision()` — tolerant action parser
+- `_compute_reasoning_bonus()` — reasoning quality scorer
+- `_scripted_car_action()` — NPC car AI
+- `_apply_action()` — mutates a car's speed/lane
+- `_generate_scene_description()` — builds the text observation
+- `OverflowEnvironment(Environment)` — the main class with `reset()`, `step()`, `state`
+### `server/app.py` — FastAPI wiring
+Introspects `create_app` to determine if it expects a factory (class) or an instance. Passes `OverflowEnvironment`, `OverflowAction`, `OverflowObservation` to `create_app`. The resulting `app` object is what uvicorn serves.
+### `client.py` — WebSocket client
+`OverflowEnv(EnvClient[OverflowAction, OverflowObservation, OverflowState])` with three required methods:
+- `_step_payload(action)` — serializes `OverflowAction` to `{"decision": ..., "reasoning": ...}`
+- `_parse_result(payload)` — deserializes server JSON into `StepResult[OverflowObservation]`
+- `_parse_state(payload)` — deserializes server JSON into `OverflowState`
+### `__init__.py` — Public API
+Exports: `OverflowAction`, `OverflowObservation`, `OverflowState`, `OverflowEnv`.
+---
+## 3. Data Models (Wire Format)
+### OverflowAction — What the training script sends to `/step`
+```json
+{
+  "action": {
+    "decision": "brake",
+    "reasoning": "Car 3 is ahead in my lane, 15 units away, going slower. I should brake."
+  }
+}
+```
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `decision` | `str` | No | `"maintain"` | One of: `accelerate`, `brake`, `lane_change_left`, `lane_change_right`, `maintain` |
+| `reasoning` | `str` | No | `""` | Free-text chain-of-thought. Affects reward via reasoning bonus (0.0–2.0). |
+The `decision` field is parsed tolerantly — see Section 9.
+### OverflowObservation — What the server returns
+Each observation carries **both** text (for the LLM) and structured data (for the frontend/viz).
+```json
+{
+  "observation": {
+    "scene_description": "You are Car 0 in lane 2, position 45, speed 60.\n...",
+    "incident_report": "Observer: No incidents this step.",
+    "done": false,
+    "reward": 1.45,
+    "cars": [
+      {"carId": 0, "lane": 2, "position": {"x": 45.0, "y": 7.4}, "speed": 60.0, "acceleration": 5.0},
+      {"carId": 1, "lane": 1, "position": {"x": 43.0, "y": 3.7}, "speed": 55.0, "acceleration": 0.0}
+    ],
+    "proximities": [
+      {"carA": 0, "carB": 1, "distance": 10.5}
+    ],
+    "lane_occupancies": [
+      {"lane": 1, "carIds": [1]},
+      {"lane": 2, "carIds": [0]}
+    ],
+    "metadata": {}
+  },
+  "reward": 1.45,
+  "done": false
+}
+```
+#### Text fields (for the LLM)
+| Field | Type | Description |
+|-------|------|-------------|
+| `scene_description` | `str` | Multi-line text describing all cars. This is what the LLM reads. |
+| `incident_report` | `str` | Observer output. Either `"Observer: No incidents this step."` or a list of CRASH/NEAR MISS events. |
+#### Structured fields (for the frontend — compatible with Overflow frontend types)
+| Field | Type | Frontend equivalent |
+|-------|------|---------------------|
+| `cars` | `CarStateData[]` | `CarState[]` — `{carId, lane, position: {x, y}, speed, acceleration}` |
+| `proximities` | `ProximityData[]` | `{carA, carB, distance}[]` — pairwise distances for close cars |
+| `lane_occupancies` | `LaneOccupancyData[]` | `{lane, carIds}[]` — which cars are in each lane |
+Position `y` is computed as `lane * 3.7` (lane width in metres), matching the frontend's `makeCar` convention.
+#### Common fields
+| Field | Type | Description |
+|-------|------|-------------|
+| `done` | `bool` | `true` if episode ended (crash, goal reached, or max steps). |
+| `reward` | `float` | Scalar reward for this step. Sum of all reward components. |
+The `reward` and `done` appear both inside `observation` and at the top level of the response (OpenEnv convention).
+### OverflowState — What `/state` returns
+```json
+{
+  "episode_id": "a1b2c3d4-...",
+  "step_count": 17,
+  "crash_count": 0,
+  "near_miss_count": 23,
+  "cars_reached_goal": 1,
+  "total_cars": 5
+}
+```
+| Field | Type | Description |
+|-------|------|-------------|
+| `episode_id` | `str` | UUID for this episode. Set on `reset()`. |
+| `step_count` | `int` | How many `step()` calls have been made. |
+| `crash_count` | `int` | Cumulative crash events (each pair counts as 1). |
+| `near_miss_count` | `int` | Cumulative near-miss events (each pair counts as 1). |
+| `cars_reached_goal` | `int` | How many cars (including scripted) reached their goal. |
+| `total_cars` | `int` | Always 5. |
+---
+## 4. Simulation Internals
+### The Road
+- 3 lanes, numbered 1, 2, 3 (1 = leftmost, 3 = rightmost)
+- Road length: ~200 position units
+- No wrapping — cars move forward from low positions toward high positions
+- Lanes are conceptually 10 units apart for distance calculations
+### Car State
+Each car is a `Car` dataclass:
+```python
+@dataclass
+class Car:
+    car_id: int          # 0 = agent, 1–4 = scripted
+    lane: int            # 1, 2, or 3
+    position: float      # 0.0 to ~200.0 (along the road)
+    speed: float         # 20.0 to 90.0
+    goal_position: float # 160.0 to 195.0
+    is_agent: bool       # True only for car 0
+    reached_goal: bool   # True once position >= goal_position
+```
+### Initialization (reset)
+On `reset(seed=N)`:
+1. A `random.Random(seed)` RNG is created (deterministic replays if same seed).
+2. 5 cars are spawned:
+   - **Lane**: random 1–3
+   - **Position**: random 10–80 (spread across the first half of the road)
+   - **Speed**: random 40–70
+   - **Goal**: random 160–195
+3. No two cars occupy the same 10-unit segment in the same lane at spawn (deconflicted via `(lane, position // 10)` hash).
+4. Car 0 is the agent. Cars 1–4 are scripted.
+### Movement
+Each step, every active (non-goal-reached) car moves forward:
+```
+car.position += car.speed * 0.1
+```
+This means a car at speed 60 moves 6.0 units per step. At that rate, traversing the ~120-unit gap from starting zone (10–80) to goal zone (160–195) takes roughly 20 steps. Faster cars (speed 90) move 9.0 units/step and reach goals sooner.
+---
+## 5. Step-by-Step Execution Pipeline
+When `step(action)` is called, the following happens **in this exact order**:
+```
+1. GUARD: if episode is already done → return stale observation with reward=0.0
+2. INCREMENT step_count
+3. PARSE the agent's action → one of {accelerate, brake, lane_change_left, lane_change_right, maintain}
+4. APPLY action to Car 0 (mutate speed or lane)
+5. COMPUTE scripted actions for Cars 1–4 and APPLY them
+6. MOVE all active cars forward: position += speed * 0.1
+7. COLLISION DETECTION (pairwise over all active cars):
+   - distance < 5.0 → CRASH (reward -5.0, episode ends)
+   - distance < 15.0 → NEAR MISS (reward -1.0 per pair)
+8. If no crash:
+   a. Check if Car 0 reached its goal → reward +3.0, episode ends
+   b. Check if scripted cars reached their goals (state tracking only)
+   c. If episode not ending → SAFE STEP bonus: reward +0.5
+9. REASONING BONUS: score the reasoning text → reward +0.0 to +2.0
+10. MAX STEPS CHECK: if step_count >= 100 → episode ends
+11. BUILD observation text and incident report
+12. RETURN OverflowObservation(scene_description, incident_report, done, reward)
+```
+**Important ordering detail**: Actions are applied (step 4–5) **before** movement (step 6). This means the agent's speed/lane change takes effect for this step's movement. Collision detection (step 7) happens **after** movement, on the new positions.
+**Reward accumulation within a step**: A single step's reward is the **sum** of all applicable components. For example, if there are 2 near-miss pairs and the agent is still alive with good reasoning, the reward could be: `(-1.0 * 2) + 0.5 + 1.5 = -1.0`.
+---
+## 6. Distance and Collision Model
+Distance between two cars uses a weighted Euclidean formula:
+```python
+def distance_to(self, other):
+    lane_diff = abs(self.lane - other.lane) * 10.0
+    pos_diff = abs(self.position - other.position)
+    return sqrt(lane_diff**2 + pos_diff**2)
+```
+**Implications**:
+- Two cars in the **same lane** at positions 45 and 50: distance = 5.0 (exactly at crash threshold)
+- Two cars in **adjacent lanes** (e.g., lane 1 and lane 2) at the same position: distance = 10.0 (near miss, not crash)
+- Two cars **two lanes apart** at the same position: distance = 20.0 (safe, no incident)
+- Two cars in adjacent lanes, 10 units apart longitudinally: distance = sqrt(100 + 100) ≈ 14.1 (near miss)
+**Key insight for the agent**: Lane changes provide safety via the 10-unit lane multiplier. Staying in the same lane as another car is the primary crash risk. The agent should use lane changes proactively to maintain distance from cars in its lane.
+### Collision detection scope
+Detection is **pairwise over ALL active cars**, not just agent-involving pairs. If Car 2 and Car 3 crash, the episode still ends with -5.0 reward. This means the agent is implicitly responsible for the overall traffic flow — it should avoid creating situations where its actions cause chain reactions among scripted cars.
+---
+## 7. Reward Function — Complete Breakdown
+### Per-step reward components
+| Component | Value | Condition | Stacks? |
+|-----------|-------|-----------|---------|
+| **Crash** | -5.0 | Any pair distance < 5.0 | Once (episode ends) |
+| **Near miss** | -1.0 | Per pair with distance < 15.0 | Yes, per pair (can be -2.0, -3.0, etc.) |
+| **Safe step** | +0.5 | No crash and episode not ending this step | Once per step |
+| **Goal reached** | +3.0 | Car 0's position >= goal_position | Once (episode ends) |
+| **Reasoning bonus** | +0.0 to +2.0 | Based on reasoning text quality | Once per step |
+### Reasoning bonus scoring
+The bonus has three sub-components capped at 2.0 total:
+**Length bonus** (up to 0.5):
+- `len > 20` chars → +0.2
+- `len > 50` chars → +0.15
+- `len > 100` chars → +0.15
+**Keyword awareness** (up to 1.0):
+Each keyword found → +0.2, capped at 1.0. Keywords: `ahead`, `behind`, `lane`, `speed`, `distance`, `safe`, `danger`, `collision`, `brake`, `gap`, `close`, `slow`, `fast`, `goal`, `position`.
+**Structure bonus** (up to 0.5):
+- Contains `<think>` or `because` → +0.25
+- Contains `therefore`, `so i should`, `best option`, or `i will` → +0.25
+### Typical reward ranges per step
+| Scenario | Typical reward |
+|----------|---------------|
+| Safe step, no reasoning | +0.5 |
+| Safe step, decent reasoning | +1.0 to +2.0 |
+| Safe step, excellent reasoning | +2.0 to +2.5 |
+| 1 near miss, decent reasoning | -0.5 to +0.5 |
+| 2 near misses, decent reasoning | -1.5 to -0.5 |
+| Crash (any) | -5.0 + reasoning bonus |
+| Goal reached, good reasoning | +3.0 + reasoning bonus |
+### Episode return (total reward) characteristics
+Based on testing with seed=42:
+- A "maintain" strategy with decent reasoning gets ~1.1 per step × ~17 steps ≈ 18.7 total, minus near-miss penalties
+- Aggressive "accelerate" strategies reach the goal faster but accumulate more near misses
+- Smart strategies that use lane changes and braking to avoid near misses can maximize total reward
+---
+## 8. Scripted Car AI
+Cars 1–4 use `_scripted_car_action(car, all_cars, rng)`:
+```
+1. Find the nearest car AHEAD in the SAME LANE
+2. If that car is < 20 units ahead → "brake"
+3. Else if speed < 60 and 10% random chance → "accelerate"
+4. Else if 5% random chance → lane change (random left/right, respecting boundaries)
+5. Else → "maintain"
+```
+**Characteristics**:
+- Scripted cars are mostly passive — they maintain speed
+- They brake reactively when blocked (but only for same-lane, ahead)
+- They rarely change lanes (5% per step), making their behavior somewhat predictable
+- They never intentionally avoid the agent — only react to cars directly ahead
+- They can accumulate near misses and crashes among themselves
+This creates an environment where a smart agent can learn to navigate around largely predictable but occasionally erratic traffic.
+---
+## 9. Action Parsing — How LLM Output Becomes a Decision
+The parser `_parse_decision(action)` is intentionally forgiving. It tries three strategies in order:
+### Strategy 1: Direct field match
+```python
+decision = action.decision.strip().lower().replace(" ", "_")
+# If it's one of {accelerate, brake, lane_change_left, lane_change_right, maintain} → use it
+```
+### Strategy 2: XML tag extraction
+```python
+text = f"{action.decision} {action.reasoning}".lower()
+match = re.search(r"<action>\s*(\w+)\s*</action>", text)
+# If found and valid → use it
+```
+This handles LLM outputs like:
+```
+decision: "think about it"
+reasoning: "<think>Car ahead is close</think><action>brake</action>"
+```
+### Strategy 3: Keyword scan
+```python
+for v in {"accelerate", "brake", "lane_change_left", "lane_change_right", "maintain"}:
+    if v in text:
+        return v
+```
+This handles outputs like `decision: "I want to accelerate now"`.
+### Fallback
+If nothing matches → `"maintain"` (safe default).
+**For training scripts**: The cleanest format is to put the exact decision string in the `decision` field. The tolerant parsing is there so that LLMs in early training (before they learn the format) still produce valid actions rather than crashing.
+---
+## 10. Observation Text Format
+The `scene_description` field is a multi-line string that the LLM reads as its input. Example:
+```
+You are Car 0 in lane 2, position 45, speed 60.
+Goal: reach position 180.
+Nearby cars:
+- Car 1: lane 1, position 43, speed 55
+- Car 2: lane 3, position 48, speed 70
+- Car 3: lane 2, position 65, speed 50 [AHEAD IN YOUR LANE - 20 units away]
+- Car 4: lane 1, position 30, speed 65
+```
+**Annotations added**:
+- `[AHEAD IN YOUR LANE - N units away]` — same lane, ahead of agent
+- `[BEHIND IN YOUR LANE - N units away]` — same lane, behind agent
+- `[REACHED GOAL]` — car has finished
+The `incident_report` is separate:
+- No incidents: `"Observer: No incidents this step."`
+- With incidents: One line per event, e.g.:
+  ```
+  NEAR MISS between Car 0 and Car 3 (distance: 12.5)
+  Car 0 reached its goal at position 180!
+  ```
+---
+## 11. Server Protocol — What Training Scripts Must Send
+### WebSocket Protocol (Primary — for training)
+Connect to `ws://host:8000/ws`. All messages are JSON.
+#### Reset
+**Send:**
+```json
+{"type": "reset", "data": {"seed": 42}}
+```
+`data` can include `seed` (int) and/or `episode_id` (str). Both are optional.
+**Receive:**
+```json
+{
+  "type": "observation",
+  "data": {
+    "observation": {
+      "scene_description": "You are Car 0 in lane 3, position 24, speed 40.\n...",
+      "incident_report": "",
+      "done": false,
+      "reward": 0.0,
+      "metadata": {}
+    },
+    "reward": 0.0,
+    "done": false
+  }
+}
+```
+#### Step
+**Send:**
+```json
+{
+  "type": "step",
+  "data": {
+    "decision": "brake",
+    "reasoning": "Car ahead is close, braking to maintain safe distance."
+  }
+}
+```
+**Receive:**
+```json
+{
+  "type": "observation",
+  "data": {
+    "observation": {
+      "scene_description": "You are Car 0 in lane 3, position 27, speed 35.\n...",
+      "incident_report": "Observer: No incidents this step.",
+      "done": false,
+      "reward": 2.25,
+      "metadata": {}
+    },
+    "reward": 2.25,
+    "done": false
+  }
+}
+```
+#### State
+**Send:**
+```json
+{"type": "state"}
+```
+**Receive:**
+```json
+{
+  "type": "state",
+  "data": {
+    "episode_id": "a1b2c3d4-...",
+    "step_count": 7,
+    "crash_count": 0,
+    "near_miss_count": 3,
+    "cars_reached_goal": 0,
+    "total_cars": 5
+  }
+}
+```
+#### Close
+**Send:**
+```json
+{"type": "close"}
+```
+### HTTP Protocol (Fallback — for simple testing)
+Note: The HTTP API creates a **new environment instance per endpoint** in factory mode. The `/reset` and `/step` calls hit separate instances. Use WebSocket for stateful multi-step episodes.
+```
+POST /reset     Body: {"seed": 42}              → {"observation": {...}, "reward": 0.0, "done": false}
+POST /step      Body: {"action": {"decision": "brake", "reasoning": "..."}}  → {"observation": {...}, "reward": ..., "done": ...}
+GET  /state     → {"episode_id": ..., "step_count": ..., ...}
+GET  /health    → {"status": "healthy"}
+GET  /schema    → {"action": {...}, "observation": {...}, "state": {...}}
+```
+### Using the Python Client
+```python
+from overflow_env import OverflowEnv, OverflowAction
+with OverflowEnv(base_url="http://localhost:8000") as env:
+    result = env.reset(seed=42)
+    # result is StepResult[OverflowObservation]
+    # result.observation.scene_description  — the text for the LLM
+    # result.observation.incident_report    — observer output
+    # result.reward                         — float
+    # result.done                           — bool
+    while not result.done:
+        # Feed scene_description to LLM, get decision + reasoning back
+        llm_decision, llm_reasoning = call_llm(result.observation.scene_description)
+        action = OverflowAction(decision=llm_decision, reasoning=llm_reasoning)
+        result = env.step(action)
+    # Episode over
+    state = env.state()
+    print(f"Steps: {state.step_count}, Crashes: {state.crash_count}")
+```
+---
+## 12. Training Integration — GRPO / TRL
+### System prompt for the LLM
+The training script should set a system prompt like:
+```
+You are an autonomous vehicle controller. Each turn you receive a traffic scene description.
+You must output a driving decision and your reasoning.
+Available decisions: accelerate, brake, lane_change_left, lane_change_right, maintain
+Output format:
+<think>Your reasoning about the traffic situation</think>
+<action>your_decision</action>
+```
+### What the training loop does each episode
+```python
+# 1. Reset environment
+result = env.reset(seed=episode_seed)
+# 2. Build initial prompt
+messages = [
+    {"role": "system", "content": SYSTEM_PROMPT},
+    {"role": "user", "content": result.observation.scene_description}
+]
+trajectory_rewards = []
+# 3. Loop until done
+while not result.done:
+    # 3a. Get LLM completion
+    completion = model.generate(messages)  # the text the LLM produces
+    # 3b. Parse LLM output into action
+    #     The environment's parser is tolerant, but for clean training
+    #     you might also parse on the client side
+    action = OverflowAction(
+        decision=extract_decision(completion),
+        reasoning=completion  # pass full text as reasoning
+    )
+    # 3c. Step
+    result = env.step(action)
+    trajectory_rewards.append(result.reward)
+    # 3d. Append to conversation for next turn
+    messages.append({"role": "assistant", "content": completion})
+    messages.append({"role": "user", "content": (
+        result.observation.scene_description + "\n" +
+        result.observation.incident_report
+    )})
+# 4. Compute episode return for GRPO
+episode_return = sum(trajectory_rewards)
+```
+### GRPO reward signal
+For GRPO (Group Relative Policy Optimization), the reward signal is the **episode return** — the sum of all per-step rewards across the episode. The environment is designed so that:
+- **Positive episode returns** (agent reached goal safely with good reasoning) indicate good behavior
+- **Negative episode returns** (crashes, many near misses) indicate bad behavior
+- The **reasoning bonus** provides per-step reward shaping that encourages the LLM to explain its thinking, which improves interpretability and can speed up learning
+### Constructing the reward for TRL
+If using TRL's `OnlineDPOTrainer` or `GRPOTrainer`:
+```python
+# Per-step reward is already in result.reward
+# For token-level reward (assign to last token of each turn):
+rewards_per_turn = trajectory_rewards  # list of floats, one per step
+# For episode-level reward (assign to last token of episode):
+episode_reward = sum(trajectory_rewards)
+```
+---
+## 13. Episode Dynamics and RL Characteristics
+### Episode length distribution
+| Scenario | Typical length |
+|----------|---------------|
+| Aggressive accelerate → goal | 12–20 steps |
+| Moderate maintain → goal | 18–30 steps |
+| Conservative braking | 30–50+ steps |
+| Crash (bad luck or bad driving) | 5–15 steps |
+| Max steps timeout | 100 steps |
+### What makes this environment learnable
+1. **Clear signal**: Crashes give -5.0, goals give +3.0. The agent quickly learns that crashing is bad and reaching the goal is good.
+2. **Gradual improvement**: Near misses (-1.0 each) provide intermediate signal. An agent that learns to avoid near misses gets higher returns than one that just avoids crashes.
+3. **Speed-accuracy tradeoff**: Accelerating reaches the goal faster (more +3.0 episodes) but increases crash/near-miss risk. The optimal policy is to accelerate when safe and brake/change lanes when needed.
+4. **Reasoning is rewarded**: The reasoning bonus (up to +2.0/step) means that over a 20-step episode, reasoning alone can contribute up to +40.0. This incentivizes the LLM to produce structured, situation-aware reasoning.
+5. **Stochasticity**: Scripted cars have random elements (10% accelerate, 5% lane change). This means the same seed produces the same episode, but different seeds produce different traffic patterns, forcing the agent to generalize.
+6. **All-pairs collision**: The agent is rewarded/punished for the entire traffic system, not just its own car. This means the agent must be aware of the overall traffic flow.
+### Typical learning progression
+1. **Random policy**: Mostly "maintain", occasional random actions. Episode return: 0 to 15 (depending on luck).
+2. **Basic safety**: Agent learns to brake when car ahead is close. Fewer crashes, more goals. Episode return: 10 to 25.
+3. **Strategic driving**: Agent learns to change lanes proactively, accelerate when clear, brake early. Episode return: 20 to 40.
+4. **Optimized reasoning**: Agent produces structured reasoning with relevant keywords, maximizing the reasoning bonus. Episode return: 30 to 60.
+### Reproducibility
+Passing `seed=N` to `reset()` produces deterministic initial conditions and scripted car behavior (since the `random.Random` instance is seeded). The same seed + same agent actions = same trajectory. This is critical for GRPO, which compares multiple rollouts of the same prompt.
+---
+## 14. Configuration Constants
+All constants are defined at the top of `server/overflow_environment.py`:
+```python
+NUM_LANES = 3              # Number of road lanes
+ROAD_LENGTH = 200          # Conceptual road length (units)
+NUM_CARS = 5               # Total cars (1 agent + 4 scripted)
+MAX_STEPS = 100            # Maximum steps before forced termination
+CRASH_DISTANCE = 5.0       # Distance threshold for crash
+NEAR_MISS_DISTANCE = 15.0  # Distance threshold for near miss
+REWARD_CRASH = -5.0        # Reward for any crash
+REWARD_NEAR_MISS = -1.0    # Reward per near-miss pair
+REWARD_SAFE_STEP = 0.5     # Reward for surviving a step
+REWARD_REACHED_GOAL = 3.0  # Reward for reaching goal
+REWARD_REASONING_MAX = 2.0 # Maximum reasoning quality bonus
+MIN_SPEED = 20             # Minimum car speed
+MAX_SPEED = 90             # Maximum car speed
+SPEED_DELTA = 5            # Speed change per accelerate/brake
+```
+To tune difficulty:
+- **Easier**: Increase `CRASH_DISTANCE` and `NEAR_MISS_DISTANCE`, decrease `NUM_CARS`, widen starting positions
+- **Harder**: Decrease distances, increase `NUM_CARS`, narrow starting positions, increase `MAX_SPEED`
+- **Longer episodes**: Increase `ROAD_LENGTH` or decrease starting speeds
+- **More reasoning incentive**: Increase `REWARD_REASONING_MAX`
+---
+## 15. Docker and Deployment
+### Local development
+```bash
+uvicorn overflow_env.server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+### Docker build
+```bash
+# From the overflow_env/ directory:
+docker build -t overflow-env:latest -f server/Dockerfile .
+docker run -p 8000:8000 overflow-env:latest
+```
+The Dockerfile uses a multi-stage build:
+1. **Builder stage**: Installs dependencies with `uv sync` into a `.venv`
+2. **Runtime stage**: Copies the `.venv` and source code, runs uvicorn
+Base image: `ghcr.io/meta-pytorch/openenv-base:latest`
+### Push to HuggingFace Spaces
+```bash
+openenv push --repo-id username/overflow-env
+```
+### Connect from training script
+```python
+# Local
+env = OverflowEnv(base_url="http://localhost:8000")
+# Docker
+env = OverflowEnv.from_docker_image("overflow-env:latest")
+# HuggingFace Space
+env = OverflowEnv.from_env("username/overflow-env")
+```
+### openenv.yaml manifest
+```yaml
+spec_version: 1
+name: overflow_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+```
+This tells OpenEnv tooling how to find and run the environment.

Dockerfile ADDED Viewed

	@@ -0,0 +1,22 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system deps
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git curl && \
+    rm -rf /var/lib/apt/lists/*
+# Copy environment code into a proper package directory
+COPY . /app/overflow_env
+# Install dependencies via pip using requirements.txt
+RUN pip install --no-cache-dir -r /app/overflow_env/server/requirements.txt
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["uvicorn", "overflow_env.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,12 +1,70 @@
 ---
-title: Overflow Openenv
-emoji: 📊
-colorFrom: red
-colorTo: gray
 sdk: gradio
-sdk_version: 6.9.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Overflow OpenENV
+emoji: 🚗
+colorFrom: blue
+colorTo: green
 sdk: gradio
+sdk_version: 4.44.0
 app_file: app.py
 pinned: false
+tags:
+  - openenv
 ---
+# Overflow Environment
+An autonomous vehicle fleet oversight environment for [OpenEnv](https://github.com/meta-pytorch/OpenEnv).
+## Overview
+A 2D road grid with N cars. One car (Car 0) is controlled by an LLM agent, while other cars follow simple scripted driving rules. An observer detects crashes and near-misses each step and computes rewards based on safety.
+## Quick Start
+```bash
+# Install dependencies
+pip install -e .
+# Run the server
+uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+```python
+from overflow_env import OverflowEnv, OverflowAction
+async with OverflowEnv(base_url="http://localhost:8000") as env:
+    result = await env.reset()
+    print(result.observation.scene_description)
+    action = OverflowAction(decision="maintain", reasoning="Road is clear ahead.")
+    result = await env.step(action)
+    print(result.observation.incident_report)
+    print(f"Reward: {result.reward}, Done: {result.done}")
+```
+## Action Space
+| Decision | Effect |
+|----------|--------|
+| `accelerate` | Increase speed by 5 |
+| `brake` | Decrease speed by 5 |
+| `lane_change_left` | Move to left lane |
+| `lane_change_right` | Move to right lane |
+| `maintain` | Keep current speed and lane |
+## Reward Structure
+| Event | Reward |
+|-------|--------|
+| Crash (distance < 5) | -5.0 |
+| Near miss (distance < 15) | -1.0 |
+| Safe step toward goal | +0.5 |
+| Reached goal | +3.0 |
+| Reasoning quality bonus | +0.0 to +0.3 |
+## Environment Details
+- **Road**: 3 lanes, ~200 units long
+- **Cars**: 5 total (1 agent + 4 scripted)
+- **Max steps**: 100 per episode
+- **Speed range**: 20–90 units

__init__.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""Overflow Environment — Autonomous vehicle fleet oversight for OpenEnv."""
+try:
+    from .client import OverflowEnv
+except ImportError:
+    OverflowEnv = None  # openenv-core not installed; training-only mode
+from .models import (
+    CarStateData,
+    LaneOccupancyData,
+    OverflowAction,
+    OverflowObservation,
+    OverflowState,
+    Position,
+    ProximityData,
+)
+__all__ = [
+    "OverflowAction",
+    "OverflowObservation",
+    "OverflowState",
+    "OverflowEnv",
+    "CarStateData",
+    "Position",
+    "ProximityData",
+    "LaneOccupancyData",
+]

app.py ADDED Viewed

	@@ -0,0 +1,424 @@

+"""
+OpenENV RL Demo — Gradio UI entrypoint for HuggingFace Spaces.
+Runs inside the overflow_env package root. All imports use absolute paths
+so they work both as a package (installed) and as a Space (flat root).
+"""
+import sys, os
+# When running as HF Space, make server/ importable with absolute paths
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import math, time, threading
+import numpy as np
+import torch
+import torch.optim as optim
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import matplotlib.patches as patches
+import gradio as gr
+from server.overflow_environment import OverflowEnvironment
+from models import OverflowAction
+from policies.flat_mlp_policy import FlatMLPPolicy
+from policies.policy_spec import build_obs, build_ticket_vector, OBS_DIM
+STEPS_PER_EPISODE = 20
+NUM_LANES = 3
+ROAD_LENGTH = 200
+# ── Observation adapter ───────────────────────────────────────────────────────
+def obs_to_vec(overflow_obs) -> np.ndarray:
+    cars = overflow_obs.cars
+    if not cars:
+        return np.zeros(OBS_DIM, dtype=np.float32)
+    ego = next((c for c in cars if c.carId == 0), cars[0])
+    ego_spd = ego.speed / 4.5
+    ego_x   = ego.position.x
+    ego_y   = (ego.lane - 2) * 3.7
+    tickets = []
+    for car in cars:
+        if car.carId == 0:
+            continue
+        rx = car.position.x - ego.position.x
+        ry = (car.lane - ego.lane) * 3.7
+        cs = car.speed / 4.5
+        d  = math.sqrt(rx**2 + ry**2)
+        if d > 80:
+            continue
+        cl = max(ego_spd - cs * math.copysign(1, max(rx, 0.01)), 0.1)
+        tickets.append(build_ticket_vector(
+            severity_weight=1.0 if d < 8 else 0.75 if d < 15 else 0.5,
+            ttl=5.0, pos_x=rx, pos_y=ry, pos_z=0.0,
+            vel_x=cs, vel_y=0.0, vel_z=0.0, heading=0.0,
+            size_length=4.0, size_width=2.0, size_height=1.5,
+            distance=d, time_to_collision=min(d / cl, 30.0),
+            bearing=math.atan2(ry, max(rx, 0.01)),
+            ticket_type="collision_risk", entity_type="vehicle", confidence=1.0,
+        ))
+    tv = np.array(tickets, dtype=np.float32) if tickets else None
+    return build_obs(ego_x=ego_x, ego_y=ego_y, ego_z=0.0,
+                     ego_vx=ego_spd, ego_vy=0.0,
+                     heading=0.0, speed=ego_spd,
+                     steer=0.0, throttle=0.5, brake=0.0,
+                     ticket_vectors=tv)
+def action_to_decision(a: np.ndarray) -> str:
+    s, t, b = float(a[0]), float(a[1]), float(a[2])
+    if abs(s) > 0.35: return "lane_change_left" if s < 0 else "lane_change_right"
+    if b > 0.25:      return "brake"
+    if t > 0.20:      return "accelerate"
+    return "maintain"
+# ── Global training state ─────────────────────────────────────────────────────
+policy    = FlatMLPPolicy(obs_dim=OBS_DIM)
+optimizer = optim.Adam(policy.parameters(), lr=3e-4, eps=1e-5)
+_buf_obs   = []
+_buf_acts  = []
+_buf_rews  = []
+_buf_logps = []
+_buf_vals  = []
+_buf_dones = []
+episode_history = []
+step_log        = []
+_running        = False
+_lock           = threading.Lock()
+def _ppo_mini_update():
+    if len(_buf_obs) < 2:
+        return
+    obs_t  = torch.tensor(np.array(_buf_obs),  dtype=torch.float32)
+    acts_t = torch.tensor(np.array(_buf_acts), dtype=torch.float32)
+    rews_t = torch.tensor(_buf_rews,            dtype=torch.float32)
+    logp_t = torch.tensor(_buf_logps,           dtype=torch.float32)
+    vals_t = torch.tensor(_buf_vals,            dtype=torch.float32)
+    done_t = torch.tensor(_buf_dones,           dtype=torch.float32)
+    gamma, lam = 0.99, 0.95
+    adv = torch.zeros_like(rews_t)
+    gae = 0.0
+    for t in reversed(range(len(rews_t))):
+        nv  = 0.0 if t == len(rews_t) - 1 else float(vals_t[t + 1])
+        d   = rews_t[t] + gamma * nv * (1 - done_t[t]) - vals_t[t]
+        gae = d + gamma * lam * (1 - done_t[t]) * gae
+        adv[t] = gae
+    ret = adv + vals_t
+    adv = (adv - adv.mean()) / (adv.std() + 1e-8)
+    policy.train()
+    act_mean, val = policy(obs_t)
+    val = val.squeeze(-1)
+    dist    = torch.distributions.Normal(act_mean, torch.ones_like(act_mean) * 0.3)
+    logp    = dist.log_prob(acts_t).sum(dim=-1)
+    entropy = dist.entropy().sum(dim=-1).mean()
+    ratio   = torch.exp(logp - logp_t)
+    pg      = torch.max(-adv * ratio, -adv * ratio.clamp(0.8, 1.2)).mean()
+    vf      = 0.5 * ((val - ret) ** 2).mean()
+    loss    = pg + 0.5 * vf - 0.02 * entropy
+    optimizer.zero_grad()
+    loss.backward()
+    torch.nn.utils.clip_grad_norm_(policy.parameters(), 0.5)
+    optimizer.step()
+def run_episodes_loop():
+    global _running
+    ep_num = 0
+    env    = OverflowEnvironment()
+    while _running:
+        ep_num += 1
+        obs    = env.reset()
+        ep_rew = 0.0
+        outcome = "timeout"
+        _buf_obs.clear(); _buf_acts.clear(); _buf_rews.clear()
+        _buf_logps.clear(); _buf_vals.clear(); _buf_dones.clear()
+        for step in range(1, STEPS_PER_EPISODE + 1):
+            if not _running:
+                break
+            obs_vec = obs_to_vec(obs)
+            policy.eval()
+            with torch.no_grad():
+                obs_t = torch.tensor(obs_vec, dtype=torch.float32).unsqueeze(0)
+                act_mean, val = policy(obs_t)
+                dist   = torch.distributions.Normal(act_mean.squeeze(0),
+                                                    torch.ones(3) * 0.3)
+                action = dist.sample().clamp(-1, 1)
+                logp   = dist.log_prob(action).sum()
+            decision = action_to_decision(action.numpy())
+            obs      = env.step(OverflowAction(decision=decision, reasoning=""))
+            reward   = float(obs.reward or 0.0)
+            done     = obs.done
+            ep_rew  += reward
+            _buf_obs.append(obs_vec)
+            _buf_acts.append(action.numpy())
+            _buf_rews.append(reward)
+            _buf_logps.append(float(logp))
+            _buf_vals.append(float(val.squeeze()))
+            _buf_dones.append(float(done))
+            with _lock:
+                step_log.append({
+                    "ep":        ep_num,
+                    "step":      step,
+                    "decision":  decision,
+                    "reward":    round(reward, 2),
+                    "ep_reward": round(ep_rew, 2),
+                    "incident":  obs.incident_report or "",
+                    "cars":      [(c.carId, c.lane, c.position.x, c.speed)
+                                  for c in obs.cars],
+                })
+            if done:
+                outcome = "CRASH" if "CRASH" in (obs.incident_report or "") else "GOAL"
+                break
+            time.sleep(0.6)
+        _ppo_mini_update()
+        with _lock:
+            episode_history.append({
+                "ep":      ep_num,
+                "steps":   step,
+                "reward":  round(ep_rew, 2),
+                "outcome": outcome,
+            })
+# ── Plot helpers ──────────────────────────────────────────────────────────────
+DECISION_COLORS = {
+    "accelerate":        "#22c55e",
+    "brake":             "#ef4444",
+    "lane_change_left":  "#f59e0b",
+    "lane_change_right": "#f59e0b",
+    "maintain":          "#60a5fa",
+}
+def render_road(cars_snapshot, last_decision, last_incident):
+    fig, ax = plt.subplots(figsize=(10, 2.8))
+    fig.patch.set_facecolor("#0f172a")
+    ax.set_facecolor("#1e293b")
+    ax.set_xlim(0, ROAD_LENGTH)
+    ax.set_ylim(0, NUM_LANES + 1)
+    ax.set_yticks([])
+    ax.set_xlabel("Position", color="#94a3b8", fontsize=9)
+    ax.tick_params(colors="#94a3b8")
+    for spine in ax.spines.values():
+        spine.set_edgecolor("#334155")
+    for lane in range(1, NUM_LANES):
+        ax.axhline(y=lane + 0.5, color="#334155", linewidth=1, linestyle="--", alpha=0.6)
+    for lane in range(1, NUM_LANES + 1):
+        ax.text(2, lane, f"L{lane}", color="#475569", fontsize=8, va="center")
+    ax.axvspan(160, ROAD_LENGTH, alpha=0.12, color="#22c55e")
+    ax.text(162, NUM_LANES + 0.6, "GOAL ZONE", color="#22c55e", fontsize=7, alpha=0.8)
+    car_w, car_h = 8, 0.55
+    for car_id, lane, pos_x, speed in cars_snapshot:
+        is_ego  = car_id == 0
+        color   = "#3b82f6" if is_ego else "#94a3b8"
+        outline = "#60a5fa" if is_ego else "#475569"
+        lw      = 2.0 if is_ego else 1.0
+        rect = patches.FancyBboxPatch(
+            (pos_x - car_w / 2, lane - car_h / 2),
+            car_w, car_h,
+            boxstyle="round,pad=0.05",
+            facecolor=color, edgecolor=outline, linewidth=lw, alpha=0.92,
+        )
+        ax.add_patch(rect)
+        label = f"{'EGO' if is_ego else f'C{car_id}'}\n{speed:.0f}"
+        ax.text(pos_x, lane, label, ha="center", va="center",
+                fontsize=6.5, color="white", fontweight="bold" if is_ego else "normal")
+    dec_color = DECISION_COLORS.get(last_decision, "#60a5fa")
+    ax.text(ROAD_LENGTH - 2, NUM_LANES + 0.65,
+            f"Action: {last_decision.replace('_', ' ').upper()}",
+            color=dec_color, fontsize=8, fontweight="bold", ha="right")
+    if "CRASH" in last_incident:
+        ax.text(ROAD_LENGTH / 2, NUM_LANES + 0.65, "CRASH",
+                color="#ef4444", fontsize=10, fontweight="bold", ha="center")
+    elif "NEAR MISS" in last_incident:
+        ax.text(ROAD_LENGTH / 2, NUM_LANES + 0.65, "NEAR MISS",
+                color="#f59e0b", fontsize=9, fontweight="bold", ha="center")
+    elif "GOAL" in last_incident:
+        ax.text(ROAD_LENGTH / 2, NUM_LANES + 0.65, "GOAL REACHED",
+                color="#22c55e", fontsize=10, fontweight="bold", ha="center")
+    plt.tight_layout(pad=0.3)
+    return fig
+def render_reward_curve(eps):
+    fig, ax = plt.subplots(figsize=(10, 2.8))
+    fig.patch.set_facecolor("#0f172a")
+    ax.set_facecolor("#1e293b")
+    for spine in ax.spines.values():
+        spine.set_edgecolor("#334155")
+    ax.tick_params(colors="#94a3b8")
+    ax.set_xlabel("Episode", color="#94a3b8", fontsize=9)
+    ax.set_ylabel("Total Reward", color="#94a3b8", fontsize=9)
+    if not eps:
+        ax.text(0.5, 0.5, "Waiting for episodes...", transform=ax.transAxes,
+                ha="center", va="center", color="#475569", fontsize=11)
+        plt.tight_layout(pad=0.3)
+        return fig
+    xs = [e["ep"] for e in eps]
+    ys = [e["reward"] for e in eps]
+    outcome_colors = {"CRASH": "#ef4444", "GOAL": "#22c55e", "timeout": "#60a5fa"}
+    for x, y, e in zip(xs, ys, eps):
+        ax.bar(x, y, color=outcome_colors.get(e["outcome"], "#60a5fa"), alpha=0.6, width=0.7)
+    if len(ys) >= 3:
+        w = min(5, len(ys))
+        smoothed = np.convolve(ys, np.ones(w) / w, mode="valid")
+        ax.plot(xs[w - 1:], smoothed, color="#f8fafc", linewidth=2)
+    ax.axhline(0, color="#334155", linewidth=0.8)
+    from matplotlib.patches import Patch
+    legend_els = [Patch(facecolor="#ef4444", label="crash"),
+                  Patch(facecolor="#22c55e", label="goal"),
+                  Patch(facecolor="#60a5fa", label="timeout")]
+    ax.legend(handles=legend_els, facecolor="#1e293b", labelcolor="#94a3b8",
+              fontsize=8, framealpha=0.6, edgecolor="#334155", loc="upper left")
+    plt.tight_layout(pad=0.3)
+    return fig
+# ── Gradio UI ─────────────────────────────────────────────────────────────────
+def start_training():
+    global _running
+    if not _running:
+        _running = True
+        step_log.clear()
+        episode_history.clear()
+        threading.Thread(target=run_episodes_loop, daemon=True).start()
+    return gr.update(value="Running...", interactive=False), gr.update(interactive=True)
+def stop_training():
+    global _running
+    _running = False
+    return gr.update(value="Start", interactive=True), gr.update(interactive=False)
+def get_updates():
+    with _lock:
+        logs = list(step_log[-20:])
+        eps  = list(episode_history[-50:])
+        last = step_log[-1] if step_log else None
+    road_fig   = render_road(last["cars"], last["decision"], last["incident"]) if last \
+                 else render_road([], "maintain", "")
+    reward_fig = render_reward_curve(eps)
+    lines = []
+    for e in reversed(logs):
+        flag = ""
+        if "CRASH" in e["incident"]:      flag = " 💥"
+        elif "GOAL" in e["incident"]:     flag = " ✓"
+        elif "NEAR MISS" in e["incident"]: flag = " ⚠"
+        lines.append(
+            f"ep {e['ep']:>3d} | step {e['step']:>2d} | "
+            f"{e['decision']:<20} | r={e['reward']:>+6.2f} | "
+            f"ep_total={e['ep_reward']:>7.2f}{flag}"
+        )
+    step_text = "\n".join(lines) if lines else "Waiting for first episode..."
+    ep_lines = ["Episode | Steps | Total Reward | Outcome", "-" * 44]
+    for e in reversed(eps[-15:]):
+        ep_lines.append(
+            f"  {e['ep']:>4d}  |  {e['steps']:>3d}  | "
+            f"  {e['reward']:>+8.2f}   | {e['outcome']}"
+        )
+    ep_text = "\n".join(ep_lines) if eps else "No episodes completed yet."
+    if len(eps) >= 2:
+        rewards = [e["reward"] for e in eps]
+        n    = len(rewards)
+        half = max(n // 2, 1)
+        early = sum(rewards[:half]) / half
+        late  = sum(rewards[half:]) / max(n - half, 1)
+        arrow = "↑ improving" if late > early else "↓ declining"
+        trend_text = f"Early {half} eps: {early:+.2f}  →  Last {n-half} eps: {late:+.2f}   {arrow}"
+    else:
+        trend_text = "Collecting data..."
+    status = "● RUNNING" if _running else "■ STOPPED"
+    return road_fig, reward_fig, step_text, ep_text, trend_text, status
+_EMPTY_ROAD   = render_road([], "maintain", "")
+_EMPTY_REWARD = render_reward_curve([])
+with gr.Blocks(title="OpenENV RL Demo", theme=gr.themes.Base()) as demo:
+    gr.Markdown(
+        "# OpenENV RL — Live Policy Training\n"
+        "**FlatMLPPolicy** drives Car 0 on a 3-lane road for 20 steps per episode. "
+        "PPO mini-update after each episode — watch rewards trend upward over time."
+    )
+    with gr.Row():
+        start_btn  = gr.Button("Start", variant="primary", scale=1)
+        stop_btn   = gr.Button("Stop",  variant="stop", interactive=False, scale=1)
+        status_box = gr.Textbox(value="■ STOPPED", label="Status",
+                                interactive=False, scale=0, min_width=130)
+    gr.Markdown("### Road View")
+    road_plot = gr.Plot(value=_EMPTY_ROAD, show_label=False)
+    gr.Markdown("### Episode Reward Curve")
+    reward_plot = gr.Plot(value=_EMPTY_REWARD, show_label=False)
+    gr.Markdown("### Live Step Feed (last 20 steps)")
+    step_display = gr.Textbox(
+        value="Press Start to begin...",
+        lines=14, max_lines=14, interactive=False,
+    )
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### Episode History")
+            ep_display = gr.Textbox(lines=10, interactive=False)
+        with gr.Column():
+            gr.Markdown("### Reward Trend")
+            trend_display = gr.Textbox(lines=3, interactive=False)
+    timer = gr.Timer(value=1.0)
+    timer.tick(
+        fn=get_updates,
+        outputs=[road_plot, reward_plot, step_display, ep_display, trend_display, status_box],
+    )
+    start_btn.click(fn=start_training, outputs=[start_btn, stop_btn])
+    stop_btn.click(fn=stop_training,   outputs=[start_btn, stop_btn])
+if __name__ == "__main__":
+    demo.launch()

client.py ADDED Viewed

	@@ -0,0 +1,92 @@

+"""
+Overflow Environment Client.
+Provides the client for connecting to an Overflow Environment server
+via WebSocket for persistent sessions.
+"""
+from typing import Any, Dict, List
+from openenv.core.client_types import StepResult
+from openenv.core.env_client import EnvClient
+from .models import (
+    CarStateData,
+    LaneOccupancyData,
+    OverflowAction,
+    OverflowObservation,
+    OverflowState,
+    Position,
+    ProximityData,
+)
+class OverflowEnv(EnvClient[OverflowAction, OverflowObservation, OverflowState]):
+    """
+    WebSocket client for the Overflow Environment.
+    Example:
+        >>> with OverflowEnv(base_url="http://localhost:8000") as env:
+        ...     result = env.reset()
+        ...     print(result.observation.scene_description)
+        ...     print(result.observation.cars)  # structured car data
+        ...     action = OverflowAction(decision="maintain", reasoning="Safe for now")
+        ...     result = env.step(action)
+    """
+    def _step_payload(self, action: OverflowAction) -> Dict[str, Any]:
+        """Convert OverflowAction to JSON payload for step request."""
+        return {
+            "decision": action.decision,
+            "reasoning": action.reasoning,
+        }
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[OverflowObservation]:
+        """Parse server response into StepResult[OverflowObservation]."""
+        obs_data = payload.get("observation", {})
+        # Parse structured car data
+        cars = [
+            CarStateData(
+                carId=c["carId"],
+                lane=c["lane"],
+                position=Position(**c["position"]),
+                speed=c["speed"],
+                acceleration=c.get("acceleration", 0.0),
+            )
+            for c in obs_data.get("cars", [])
+        ]
+        proximities = [
+            ProximityData(**p) for p in obs_data.get("proximities", [])
+        ]
+        lane_occupancies = [
+            LaneOccupancyData(**lo) for lo in obs_data.get("lane_occupancies", [])
+        ]
+        observation = OverflowObservation(
+            scene_description=obs_data.get("scene_description", ""),
+            incident_report=obs_data.get("incident_report", ""),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            cars=cars,
+            proximities=proximities,
+            lane_occupancies=lane_occupancies,
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> OverflowState:
+        """Parse server response into OverflowState."""
+        return OverflowState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            crash_count=payload.get("crash_count", 0),
+            near_miss_count=payload.get("near_miss_count", 0),
+            cars_reached_goal=payload.get("cars_reached_goal", 0),
+            total_cars=payload.get("total_cars", 5),
+        )

models.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""
+Data models for the Overflow Environment.
+An autonomous vehicle fleet oversight environment where an LLM agent
+controls one car on a 2D road grid while other cars follow scripted rules.
+Structured observation fields (cars, proximities, lane_occupancies) are
+compatible with the Overflow frontend's CarState / AnomalyObservation types.
+"""
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+try:
+    from openenv.core.env_server.types import Action, Observation, State
+except ImportError:
+    class Action(BaseModel): pass
+    class Observation(BaseModel):
+        done: bool = False
+        reward: float = 0.0
+    class State(BaseModel):
+        episode_id: str = ""
+        step_count: int = 0
+# ── Structured sub-models (frontend-compatible) ─────────────────────────
+class Position(BaseModel):
+    """2D position on the road. x = longitudinal, y = lateral."""
+    x: float = 0.0
+    y: float = 0.0
+class CarStateData(BaseModel):
+    """
+    Structured per-car snapshot — matches the frontend CarState interface.
+    Frontend type:
+        interface CarState {
+            carId: number; lane: number;
+            position: { x: number; y: number };
+            speed: number; acceleration: number;
+        }
+    """
+    carId: int
+    lane: int
+    position: Position
+    speed: float
+    acceleration: float = 0.0
+class ProximityData(BaseModel):
+    """Pairwise distance between two cars."""
+    carA: int
+    carB: int
+    distance: float
+class LaneOccupancyData(BaseModel):
+    """Which cars are in a given lane."""
+    lane: int
+    carIds: List[int]
+# ── OpenEnv core models ─────────────────────────────────────────────────
+class OverflowAction(Action):
+    """
+    Action for the Overflow environment.
+    The LLM agent outputs a driving decision and optional reasoning.
+    """
+    decision: str = Field(
+        default="maintain",
+        description="Driving decision: accelerate, brake, lane_change_left, lane_change_right, maintain",
+    )
+    reasoning: str = Field(
+        default="",
+        description="The LLM's chain-of-thought reasoning for this decision",
+    )
+class OverflowObservation(Observation):
+    """
+    Observation from the Overflow environment.
+    Contains both:
+    - Text fields (scene_description, incident_report) for the LLM to read.
+    - Structured fields (cars, proximities, lane_occupancies) for the frontend
+      to render, matching the Overflow frontend AnomalyObservation shape.
+    """
+    # ── Text (for the LLM) ──
+    scene_description: str = Field(
+        default="", description="Text description of the traffic scene"
+    )
+    incident_report: str = Field(
+        default="", description="Observer's incident report, empty if no incident"
+    )
+    # ── Structured (for the frontend / viz) ──
+    cars: List[CarStateData] = Field(
+        default_factory=list, description="Structured state of every car"
+    )
+    proximities: List[ProximityData] = Field(
+        default_factory=list, description="Pairwise proximity measurements"
+    )
+    lane_occupancies: List[LaneOccupancyData] = Field(
+        default_factory=list, description="Per-lane vehicle occupancy"
+    )
+class OverflowState(State):
+    """
+    Internal state for the Overflow environment.
+    """
+    crash_count: int = Field(default=0, description="Number of crashes this episode")
+    near_miss_count: int = Field(
+        default=0, description="Number of near misses this episode"
+    )
+    cars_reached_goal: int = Field(
+        default=0, description="Number of cars that reached their goal"
+    )
+    total_cars: int = Field(
+        default=5, description="Total number of cars in the simulation"
+    )

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: overflow_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

policies/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+from .base_policy import BasePolicy
+from .flat_mlp_policy import FlatMLPPolicy
+from .ticket_attention_policy import TicketAttentionPolicy
+__all__ = ["BasePolicy", "FlatMLPPolicy", "TicketAttentionPolicy"]

policies/__pycache__/__init__.cpython-314.pyc ADDED Viewed

Binary file (381 Bytes). View file

policies/__pycache__/base_policy.cpython-314.pyc ADDED Viewed

Binary file (4.05 kB). View file

policies/__pycache__/flat_mlp_policy.cpython-314.pyc ADDED Viewed

Binary file (3.71 kB). View file

policies/__pycache__/policy_spec.cpython-314.pyc ADDED Viewed

Binary file (18.4 kB). View file

policies/__pycache__/ticket_attention_policy.cpython-314.pyc ADDED Viewed

Binary file (11.3 kB). View file

policies/base_policy.py ADDED Viewed

	@@ -0,0 +1,66 @@

+"""
+BasePolicy — abstract interface all policies implement.
+All policies expose the same predict() and train_step() API so the
+curriculum trainer can swap them out transparently.
+"""
+from __future__ import annotations
+import abc
+from typing import Any, Dict, Optional, Tuple
+import numpy as np
+import torch
+import torch.nn as nn
+class BasePolicy(nn.Module, abc.ABC):
+    """
+    Abstract base for all driving policies.
+    Subclasses implement:
+        forward(obs_tensor)  → action_tensor, value_tensor
+        encode_obs(obs_np)   → torch.Tensor
+    """
+    def __init__(self, obs_dim: int, action_dim: int = 3):
+        super().__init__()
+        self.obs_dim    = obs_dim
+        self.action_dim = action_dim
+    @abc.abstractmethod
+    def forward(
+        self, obs: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Returns:
+            action_mean  — shape (B, action_dim)
+            value        — shape (B, 1)
+        """
+        ...
+    def predict(
+        self,
+        obs: np.ndarray,
+        deterministic: bool = False,
+    ) -> np.ndarray:
+        """Numpy in, numpy out. Used by the env during rollout."""
+        self.eval()
+        with torch.no_grad():
+            t = torch.as_tensor(obs, dtype=torch.float32).unsqueeze(0)
+            mean, _ = self.forward(t)
+            if deterministic:
+                action = mean
+            else:
+                action = mean + torch.randn_like(mean) * 0.1
+        return action.squeeze(0).numpy()
+    @staticmethod
+    def _mlp(dims: list[int], activation=nn.Tanh) -> nn.Sequential:
+        layers = []
+        for i in range(len(dims) - 1):
+            layers.append(nn.Linear(dims[i], dims[i + 1]))
+            if i < len(dims) - 2:
+                layers.append(activation())
+        return nn.Sequential(*layers)

policies/flat_mlp_policy.py ADDED Viewed

	@@ -0,0 +1,50 @@

+"""
+FlatMLPPolicy — sanity-check baseline.
+Concatenates the full observation (ego + all tickets flattened) and passes
+it through a standard MLP. No attention, no structure.
+Use this to:
+  1. Verify the reward signal and environment are working
+  2. Establish a performance floor
+  3. Confirm that TicketAttentionPolicy actually improves over this
+If FlatMLPPolicy can't learn Stage 1 survival, the reward or env is broken.
+"""
+from __future__ import annotations
+import torch
+import torch.nn as nn
+from .base_policy import BasePolicy
+class FlatMLPPolicy(BasePolicy):
+    """Standard 3-layer MLP over the full flat observation."""
+    def __init__(self, obs_dim: int, hidden: int = 256):
+        super().__init__(obs_dim)
+        self.actor = nn.Sequential(
+            nn.Linear(obs_dim, hidden), nn.LayerNorm(hidden), nn.Tanh(),
+            nn.Linear(hidden, hidden),                          nn.Tanh(),
+            nn.Linear(hidden, hidden // 2),                     nn.Tanh(),
+            nn.Linear(hidden // 2, 3),                          nn.Tanh(),
+        )
+        self.critic = nn.Sequential(
+            nn.Linear(obs_dim, hidden),    nn.Tanh(),
+            nn.Linear(hidden, hidden // 2), nn.Tanh(),
+            nn.Linear(hidden // 2, 1),
+        )
+        self._init_weights()
+    def _init_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.orthogonal_(m.weight, gain=1.0)
+                nn.init.zeros_(m.bias)
+        nn.init.orthogonal_(self.actor[-2].weight, gain=0.01)
+    def forward(self, obs: torch.Tensor):
+        return self.actor(obs), self.critic(obs)

policies/policy_spec.py ADDED Viewed

	@@ -0,0 +1,409 @@

+"""
+Policy data input specifications — formal contracts for observation, action, and ticket data.
+This module defines the exact data shapes, normalization ranges, and semantic meaning
+of every field consumed by OpenENV policies. Use this as the reference when:
+  1. Building a new environment that targets these policies
+  2. Writing a bridge/adapter from a different simulator
+  3. Implementing a new policy that must interoperate with the existing set
+All policies share the same raw observation layout (EGO + ticket matrix).
+Specialized policies (ThreatAvoidance, SystemFailure) select subsets internally.
+Example usage:
+    from openenv.policies.policy_spec import ObsSpec, ActionSpec, validate_obs
+    spec = ObsSpec()
+    obs  = my_env.get_observation()
+    validate_obs(obs, spec)  # raises ValueError on shape/range mismatch
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Tuple
+import numpy as np
+# ── Ego state specification ──────────────────────────────────────────────────
+EGO_STATE_DIM = 11
+@dataclass(frozen=True)
+class EgoField:
+    """Description of a single ego state field."""
+    index:       int
+    name:        str
+    unit:        str
+    raw_range:   Tuple[float, float]   # physical range before normalization
+    norm_divisor: float                # obs_value = raw_value / norm_divisor
+    description: str
+EGO_FIELDS: List[EgoField] = [
+    EgoField(0,  "x",             "m",     (-5000, 5000),  1000.0,  "Forward displacement from episode start"),
+    EgoField(1,  "y",             "m",     (-6.0, 6.0),    3.7,     "Lateral displacement (0 = lane center, + = left)"),
+    EgoField(2,  "z",             "m",     (-10, 10),      10.0,    "Vertical position (flat road = 0)"),
+    EgoField(3,  "vx",            "m/s",   (-20, 20),      20.0,    "Forward velocity in world frame"),
+    EgoField(4,  "vy",            "m/s",   (-20, 20),      20.0,    "Lateral velocity in world frame"),
+    EgoField(5,  "vz",            "m/s",   (0, 0),         1.0,     "Vertical velocity (always 0 on flat road)"),
+    EgoField(6,  "heading_sin",   "rad",   (-1, 1),        1.0,     "sin(heading angle), 0 = forward"),
+    EgoField(7,  "heading_cos",   "rad",   (-1, 1),        1.0,     "cos(heading angle), 1 = forward"),
+    EgoField(8,  "speed",         "m/s",   (0, 20),        20.0,    "Scalar speed = sqrt(vx^2 + vy^2)"),
+    EgoField(9,  "steer",         "norm",  (-1, 1),        1.0,     "Current steering command [-1=full left, 1=full right]"),
+    EgoField(10, "net_drive",     "norm",  (-1, 1),        1.0,     "throttle - brake [-1=full brake, 1=full throttle]"),
+]
+# ── Ticket vector specification ──────────────────────────────────────────────
+TICKET_VECTOR_DIM = 37   # 18 fixed + 14 type one-hot + 5 entity one-hot
+MAX_TICKETS = 16
+# Ticket types (14 total) — one-hot encoded starting at index 18
+TICKET_TYPES = [
+    "collision_risk", "sudden_brake", "side_impact", "head_on",
+    "merge_cut", "rear_end_risk",
+    "pedestrian_crossing", "cyclist_lane",
+    "tire_blowout", "brake_fade", "steering_loss", "sensor_occlusion",
+    "road_hazard", "weather_visibility",
+]
+# Entity types (5 total) — one-hot encoded after ticket types
+ENTITY_TYPES = ["vehicle", "pedestrian", "cyclist", "obstacle", "system"]
+# Verify dimension
+assert 18 + len(TICKET_TYPES) + len(ENTITY_TYPES) == TICKET_VECTOR_DIM, (
+    f"Ticket vector dim mismatch: 18 + {len(TICKET_TYPES)} + {len(ENTITY_TYPES)} "
+    f"!= {TICKET_VECTOR_DIM}"
+)
+@dataclass(frozen=True)
+class TicketField:
+    """Description of a single ticket vector field."""
+    offset:       int            # index within the TICKET_VECTOR_DIM vector
+    length:       int            # number of floats
+    name:         str
+    unit:         str
+    raw_range:    Tuple[float, float]
+    norm_divisor: float
+    description:  str
+TICKET_FIELDS: List[TicketField] = [
+    TicketField(0,  1, "severity_weight",   "norm",   (0, 1),       1.0,   "Severity: 0.25=LOW, 0.5=MED, 0.75=HIGH, 1.0=CRITICAL"),
+    TicketField(1,  1, "ttl_norm",          "s",      (0, 10),     10.0,   "Time-to-live remaining, clamped to [0,1]"),
+    TicketField(2,  1, "pos_x",             "m",      (-100, 100), 100.0,  "Ego-relative X (forward positive)"),
+    TicketField(3,  1, "pos_y",             "m",      (-50, 50),    50.0,  "Ego-relative Y (left positive)"),
+    TicketField(4,  1, "pos_z",             "m",      (-10, 10),    10.0,  "Ego-relative Z (up positive)"),
+    TicketField(5,  1, "vel_x",             "m/s",    (-30, 30),    30.0,  "Entity velocity X in world frame"),
+    TicketField(6,  1, "vel_y",             "m/s",    (-30, 30),    30.0,  "Entity velocity Y in world frame"),
+    TicketField(7,  1, "vel_z",             "m/s",    (-10, 10),    10.0,  "Entity velocity Z in world frame"),
+    TicketField(8,  1, "heading_sin",       "rad",    (-1, 1),      1.0,   "sin(entity heading relative to ego)"),
+    TicketField(9,  1, "heading_cos",       "rad",    (-1, 1),      1.0,   "cos(entity heading relative to ego)"),
+    TicketField(10, 1, "size_length",       "m",      (0, 10),     10.0,   "Entity bounding box length"),
+    TicketField(11, 1, "size_width",        "m",      (0, 5),       5.0,   "Entity bounding box width"),
+    TicketField(12, 1, "size_height",       "m",      (0, 4),       4.0,   "Entity bounding box height"),
+    TicketField(13, 1, "distance_norm",     "m",      (0, 100),   100.0,   "Euclidean distance to ego, clamped to [0,1]"),
+    TicketField(14, 1, "ttc_norm",          "s",      (0, 30),     30.0,   "Time-to-collision, clamped to [0,1]. 1.0 = no collision"),
+    TicketField(15, 1, "bearing_sin",       "rad",    (-1, 1),      1.0,   "sin(bearing angle from ego forward axis)"),
+    TicketField(16, 1, "bearing_cos",       "rad",    (-1, 1),      1.0,   "cos(bearing angle from ego forward axis)"),
+    TicketField(17, 1, "confidence",        "norm",   (0, 1),       1.0,   "Perception confidence [0=unreliable, 1=certain]"),
+    TicketField(18, len(TICKET_TYPES), "type_onehot",   "bool", (0, 1), 1.0, "One-hot ticket type"),
+    TicketField(18 + len(TICKET_TYPES), len(ENTITY_TYPES), "entity_onehot", "bool", (0, 1), 1.0, "One-hot entity type"),
+]
+# ── Full observation specification ───────────────────────────────────────────
+OBS_DIM = EGO_STATE_DIM + MAX_TICKETS * TICKET_VECTOR_DIM   # 11 + 16*37 = 603
+@dataclass(frozen=True)
+class ObsSpec:
+    """Complete observation space specification."""
+    ego_dim:         int = EGO_STATE_DIM
+    ticket_dim:      int = TICKET_VECTOR_DIM
+    max_tickets:     int = MAX_TICKETS
+    total_dim:       int = OBS_DIM
+    dtype:           str = "float32"
+    value_range:     Tuple[float, float] = (-1.0, 1.0)
+    # Layout: obs[0:ego_dim] = ego state
+    #         obs[ego_dim:] reshaped to (max_tickets, ticket_dim)
+    # Tickets are sorted by severity desc, distance asc. Zero-padded rows = empty slots.
+# ── Action specification ─────────────────────────────────────────────────────
+@dataclass(frozen=True)
+class ActionField:
+    index:       int
+    name:        str
+    raw_range:   Tuple[float, float]
+    description: str
+ACTION_DIM = 3
+ACTION_FIELDS: List[ActionField] = [
+    ActionField(0, "steer",    (-1.0, 1.0), "Steering command. -1=full left, +1=full right. Scaled by MAX_STEER=0.6 rad"),
+    ActionField(1, "throttle", (-1.0, 1.0), "Throttle command. Only positive values used (clipped to [0,1]). Scaled by MAX_ACCEL=4.0 m/s^2"),
+    ActionField(2, "brake",    (-1.0, 1.0), "Brake command. Only positive values used (clipped to [0,1]). Scaled by MAX_BRAKE=8.0 m/s^2"),
+]
+@dataclass(frozen=True)
+class ActionSpec:
+    """Action space specification."""
+    dim:          int = ACTION_DIM
+    dtype:        str = "float32"
+    value_range:  Tuple[float, float] = (-1.0, 1.0)
+# ── Policy input requirements ────────────────────────────────────────────────
+@dataclass(frozen=True)
+class PolicyInputSpec:
+    """Describes what a specific policy reads from the observation."""
+    name:              str
+    reads_ego:         bool
+    ego_indices:       Tuple[int, ...]           # which ego fields are used
+    reads_tickets:     bool
+    ticket_filter:     Optional[str]             # None = all, or "kinematic" / "failure"
+    max_tickets_used:  int                       # how many ticket slots the policy actually reads
+    requires_history:  bool                      # whether GRU/recurrent hidden state is needed
+    description:       str
+POLICY_SPECS: Dict[str, PolicyInputSpec] = {
+    "SurvivalPolicy": PolicyInputSpec(
+        name="SurvivalPolicy",
+        reads_ego=True,
+        ego_indices=tuple(range(EGO_STATE_DIM)),
+        reads_tickets=False,
+        ticket_filter=None,
+        max_tickets_used=0,
+        requires_history=False,
+        description="Stage 1 baseline. Reads only ego state (first 11 dims). "
+                    "Ticket portion of obs is ignored entirely.",
+    ),
+    "FlatMLPPolicy": PolicyInputSpec(
+        name="FlatMLPPolicy",
+        reads_ego=True,
+        ego_indices=tuple(range(EGO_STATE_DIM)),
+        reads_tickets=True,
+        ticket_filter=None,
+        max_tickets_used=MAX_TICKETS,
+        requires_history=False,
+        description="Sanity-check baseline. Reads full flat observation (ego + all tickets "
+                    "concatenated). No attention or structure.",
+    ),
+    "TicketAttentionPolicy": PolicyInputSpec(
+        name="TicketAttentionPolicy",
+        reads_ego=True,
+        ego_indices=tuple(range(EGO_STATE_DIM)),
+        reads_tickets=True,
+        ticket_filter=None,
+        max_tickets_used=MAX_TICKETS,
+        requires_history=False,
+        description="Main policy (Stage 2+). Cross-attention: ego queries ticket set. "
+                    "Order-invariant over tickets. Padding mask on zero-rows.",
+    ),
+    "ThreatAvoidancePolicy": PolicyInputSpec(
+        name="ThreatAvoidancePolicy",
+        reads_ego=True,
+        ego_indices=tuple(range(EGO_STATE_DIM)),
+        reads_tickets=True,
+        ticket_filter="kinematic",
+        max_tickets_used=1,
+        requires_history=False,
+        description="Specialist for kinematic threats (collision_risk, sudden_brake, "
+                    "side_impact, head_on, merge_cut, rear_end_risk). Extracts the "
+                    "highest-severity kinematic ticket and gates between brake/evade branches.",
+    ),
+    "SystemFailurePolicy": PolicyInputSpec(
+        name="SystemFailurePolicy",
+        reads_ego=True,
+        ego_indices=tuple(range(EGO_STATE_DIM)),
+        reads_tickets=True,
+        ticket_filter="failure",
+        max_tickets_used=1,
+        requires_history=False,
+        description="Specialist for onboard failures (tire_blowout, brake_fade, steering_loss). "
+                    "Mixture-of-experts with one expert per failure type. Initialized with "
+                    "domain-correct response priors.",
+    ),
+    "RecurrentPolicy": PolicyInputSpec(
+        name="RecurrentPolicy",
+        reads_ego=True,
+        ego_indices=tuple(range(EGO_STATE_DIM)),
+        reads_tickets=True,
+        ticket_filter=None,
+        max_tickets_used=MAX_TICKETS,
+        requires_history=True,
+        description="GRU-based policy for partial observability (Stage 4+). Carries hidden "
+                    "state across timesteps. Requires h_prev to be tracked by caller.",
+    ),
+}
+# ── Validation helpers ───────────────────────────────────────────────────────
+def validate_obs(obs: np.ndarray, spec: Optional[ObsSpec] = None) -> None:
+    """
+    Validate an observation array against the spec.
+    Raises ValueError with a descriptive message on any mismatch.
+    """
+    spec = spec or ObsSpec()
+    if obs.ndim != 1:
+        raise ValueError(f"Observation must be 1D, got shape {obs.shape}")
+    if obs.shape[0] != spec.total_dim:
+        raise ValueError(
+            f"Observation dim mismatch: expected {spec.total_dim}, got {obs.shape[0]}. "
+            f"Check ego_dim ({spec.ego_dim}) + max_tickets ({spec.max_tickets}) "
+            f"* ticket_dim ({spec.ticket_dim})"
+        )
+    if obs.dtype != np.float32:
+        raise ValueError(f"Observation dtype must be float32, got {obs.dtype}")
+def validate_action(action: np.ndarray) -> None:
+    """Validate an action array."""
+    if action.shape != (ACTION_DIM,):
+        raise ValueError(f"Action shape mismatch: expected ({ACTION_DIM},), got {action.shape}")
+    if np.any(action < -1.0) or np.any(action > 1.0):
+        raise ValueError(f"Action values must be in [-1, 1], got min={action.min()}, max={action.max()}")
+def build_obs(
+    ego_x: float, ego_y: float, ego_z: float,
+    ego_vx: float, ego_vy: float,
+    heading: float, speed: float,
+    steer: float, throttle: float, brake: float,
+    ticket_vectors: Optional[np.ndarray] = None,
+    max_tickets: int = MAX_TICKETS,
+) -> np.ndarray:
+    """
+    Build a valid observation vector from raw values.
+    This is the primary entry point for external environments that want to
+    produce observations compatible with OpenENV policies.
+    Parameters
+    ----------
+    ego_x       : forward displacement from episode start (metres)
+    ego_y       : lateral displacement from lane center (metres, + = left)
+    ego_z       : vertical position (metres)
+    ego_vx      : forward velocity (m/s)
+    ego_vy      : lateral velocity (m/s)
+    heading     : heading angle (radians, 0 = forward)
+    speed       : scalar speed (m/s)
+    steer       : current steering command [-1, 1]
+    throttle    : current throttle command [0, 1]
+    brake       : current brake command [0, 1]
+    ticket_vectors : (N, TICKET_VECTOR_DIM) array of ticket vectors, or None.
+                     Use EventTicket.to_vector() or build_ticket_vector() to create these.
+    max_tickets : number of ticket slots (must match policy expectation, default 16)
+    Returns
+    -------
+    obs : np.ndarray of shape (EGO_STATE_DIM + max_tickets * TICKET_VECTOR_DIM,)
+    """
+    import math
+    ego = np.array([
+        ego_x / 1000.0,
+        ego_y / 3.7,        # ROAD_HALF_WIDTH
+        ego_z / 10.0,
+        ego_vx / 20.0,      # MAX_SPEED
+        ego_vy / 20.0,
+        0.0,                 # vz (flat road)
+        math.sin(heading),
+        math.cos(heading),
+        speed / 20.0,
+        steer,
+        throttle - brake,    # net drive signal
+    ], dtype=np.float32)
+    ticket_matrix = np.zeros((max_tickets, TICKET_VECTOR_DIM), dtype=np.float32)
+    if ticket_vectors is not None:
+        n = min(len(ticket_vectors), max_tickets)
+        ticket_matrix[:n] = ticket_vectors[:n]
+    return np.concatenate([ego, ticket_matrix.flatten()])
+def build_ticket_vector(
+    severity_weight: float,
+    ttl: float,
+    pos_x: float, pos_y: float, pos_z: float,
+    vel_x: float, vel_y: float, vel_z: float,
+    heading: float,
+    size_length: float, size_width: float, size_height: float,
+    distance: float,
+    time_to_collision: Optional[float],
+    bearing: float,
+    ticket_type: str,
+    entity_type: str,
+    confidence: float = 1.0,
+) -> np.ndarray:
+    """
+    Build a single ticket vector from raw values without needing the full
+    EventTicket class. Use this when adapting a different simulator.
+    Parameters
+    ----------
+    severity_weight    : 0.25 (LOW), 0.5 (MEDIUM), 0.75 (HIGH), 1.0 (CRITICAL)
+    ttl                : seconds remaining until ticket expires
+    pos_x/y/z          : ego-relative position (metres)
+    vel_x/y/z          : entity velocity in world frame (m/s)
+    heading            : entity heading relative to ego (radians)
+    size_length/width/height : entity bounding box (metres)
+    distance           : euclidean distance to ego (metres)
+    time_to_collision  : seconds until collision, or None if no collision course
+    bearing            : angle from ego forward axis (radians)
+    ticket_type        : one of TICKET_TYPES (e.g., "collision_risk")
+    entity_type        : one of ENTITY_TYPES (e.g., "vehicle")
+    confidence         : perception confidence [0, 1]
+    Returns
+    -------
+    vec : np.ndarray of shape (TICKET_VECTOR_DIM,) = (37,)
+    """
+    import math
+    ttc_norm = min((time_to_collision if time_to_collision is not None else 30.0) / 30.0, 1.0)
+    type_oh = [0.0] * len(TICKET_TYPES)
+    entity_oh = [0.0] * len(ENTITY_TYPES)
+    if ticket_type in TICKET_TYPES:
+        type_oh[TICKET_TYPES.index(ticket_type)] = 1.0
+    else:
+        raise ValueError(f"Unknown ticket_type '{ticket_type}'. Must be one of {TICKET_TYPES}")
+    if entity_type in ENTITY_TYPES:
+        entity_oh[ENTITY_TYPES.index(entity_type)] = 1.0
+    else:
+        raise ValueError(f"Unknown entity_type '{entity_type}'. Must be one of {ENTITY_TYPES}")
+    vec = [
+        severity_weight,
+        min(ttl / 10.0, 1.0),
+        pos_x / 100.0,
+        pos_y / 50.0,
+        pos_z / 10.0,
+        vel_x / 30.0,
+        vel_y / 30.0,
+        vel_z / 10.0,
+        math.sin(heading),
+        math.cos(heading),
+        size_length / 10.0,
+        size_width / 5.0,
+        size_height / 4.0,
+        min(distance / 100.0, 1.0),
+        ttc_norm,
+        math.sin(bearing),
+        math.cos(bearing),
+        confidence,
+        *type_oh,
+        *entity_oh,
+    ]
+    return np.array(vec, dtype=np.float32)

policies/ticket_attention_policy.py ADDED Viewed

	@@ -0,0 +1,227 @@

+"""
+TicketAttentionPolicy — the main policy (Stage 2+).
+Architecture: two-pass "reflective" cross-attention.
+    Pass 1: ego queries tickets → raw threat context
+    Pass 2: (ego + raw context) queries tickets again → refined context
+    This forces the policy to "think twice" — first perceive, then plan.
+    [ego | refined_context] → steer head  → steer action
+                            → drive head  → throttle, brake
+                            → critic head → value
+Why two-pass:
+  The first pass gathers what threats exist. The second pass re-examines
+  tickets knowing what the overall threat picture looks like. This prevents
+  the impulsive single-shot responses that cause wild oscillation.
+Why separate heads:
+  Steering requires smooth, conservative output (off-road = death).
+  Throttle/brake can be more aggressive. Separate heads + separate
+  noise levels let each dimension learn at its own pace.
+"""
+from __future__ import annotations
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .base_policy import BasePolicy
+EGO_STATE_DIM = 11
+MAX_TICKETS = 16
+TICKET_VECTOR_DIM = 37
+class TicketAttentionPolicy(BasePolicy):
+    """
+    Two-pass reflective attention policy.
+    Pass 1: perceive — what threats exist?
+    Pass 2: plan    — given what I see, which threats matter most?
+    Output: separate steer head (conservative) + drive head (throttle/brake)
+    """
+    def __init__(
+        self,
+        obs_dim:      int,
+        ego_embed:    int = 64,
+        ticket_embed: int = 64,
+        n_heads:      int = 4,
+        hidden:       int = 256,
+    ):
+        super().__init__(obs_dim)
+        assert ego_embed % n_heads == 0
+        assert ticket_embed == ego_embed
+        self.ego_embed    = ego_embed
+        self.max_tickets  = MAX_TICKETS
+        self.ticket_dim   = TICKET_VECTOR_DIM
+        # ── Encoders ──────────────────────────────────────────────────────
+        self.ego_encoder = nn.Sequential(
+            nn.Linear(EGO_STATE_DIM, hidden // 2),
+            nn.LayerNorm(hidden // 2),
+            nn.Tanh(),
+            nn.Linear(hidden // 2, ego_embed),
+            nn.LayerNorm(ego_embed),
+        )
+        self.ticket_encoder = nn.Sequential(
+            nn.Linear(TICKET_VECTOR_DIM, hidden // 2),
+            nn.LayerNorm(hidden // 2),
+            nn.ReLU(),
+            nn.Linear(hidden // 2, ticket_embed),
+            nn.LayerNorm(ticket_embed),
+        )
+        # ── Pass 1: perceive (ego queries tickets) ───────────────────────
+        self.attn_pass1 = nn.MultiheadAttention(
+            embed_dim=ego_embed, num_heads=n_heads,
+            dropout=0.0, batch_first=True,
+        )
+        self.norm1 = nn.LayerNorm(ego_embed)
+        # ── Reflection gate: fuse ego + pass1 context for second query ───
+        self.reflect_proj = nn.Sequential(
+            nn.Linear(ego_embed * 2, ego_embed),
+            nn.LayerNorm(ego_embed),
+            nn.Tanh(),
+        )
+        # ── Pass 2: plan (refined query re-attends to tickets) ───────────
+        self.attn_pass2 = nn.MultiheadAttention(
+            embed_dim=ego_embed, num_heads=n_heads,
+            dropout=0.0, batch_first=True,
+        )
+        self.norm2 = nn.LayerNorm(ego_embed)
+        # ── Fused representation ─────────────────────────────────────────
+        fused_dim = ego_embed + ego_embed  # ego + refined context
+        # ── Steer head (conservative, smooth output) ─────────────────────
+        self.steer_head = nn.Sequential(
+            nn.Linear(fused_dim, hidden // 2),
+            nn.LayerNorm(hidden // 2),
+            nn.Tanh(),
+            nn.Linear(hidden // 2, hidden // 4),
+            nn.Tanh(),
+            nn.Linear(hidden // 4, 1),
+            nn.Tanh(),
+        )
+        # ── Drive head (throttle + brake) ────────────────────────────────
+        self.drive_head = nn.Sequential(
+            nn.Linear(fused_dim, hidden // 2),
+            nn.LayerNorm(hidden // 2),
+            nn.Tanh(),
+            nn.Linear(hidden // 2, hidden // 4),
+            nn.Tanh(),
+            nn.Linear(hidden // 4, 2),
+            nn.Tanh(),
+        )
+        # ── Critic head ──────────────────────────────────────────────────
+        self.critic = nn.Sequential(
+            nn.Linear(fused_dim, hidden),
+            nn.LayerNorm(hidden),
+            nn.Tanh(),
+            nn.Linear(hidden, hidden // 2),
+            nn.Tanh(),
+            nn.Linear(hidden // 2, 1),
+        )
+        self._init_weights()
+    def _init_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.orthogonal_(m.weight, gain=1.0)
+                if m.bias is not None:
+                    nn.init.zeros_(m.bias)
+        # Very small initial actions — start by doing almost nothing
+        nn.init.orthogonal_(self.steer_head[-2].weight, gain=0.01)
+        nn.init.orthogonal_(self.drive_head[-2].weight, gain=0.01)
+        # Critic starts near zero
+        nn.init.orthogonal_(self.critic[-1].weight, gain=0.1)
+    def _attend(self, attn_module, norm_module, query, tk_emb, is_padding, all_empty):
+        """Run one attention pass with NaN-safe masking."""
+        B = query.shape[0]
+        q = query if query.dim() == 3 else query.unsqueeze(1)
+        if all_empty.all():
+            return torch.zeros(B, self.ego_embed, device=query.device)
+        safe_mask = is_padding.clone()
+        safe_mask[all_empty, 0] = False
+        attn_out, _ = attn_module(
+            query=q, key=tk_emb, value=tk_emb,
+            key_padding_mask=safe_mask,
+        )
+        context = attn_out.squeeze(1)
+        context[all_empty] = 0.0
+        return norm_module(context)
+    def forward(self, obs: torch.Tensor):
+        B = obs.shape[0]
+        # Split observation
+        ego_raw = obs[:, :EGO_STATE_DIM]
+        tk_raw  = obs[:, EGO_STATE_DIM:].view(B, self.max_tickets, self.ticket_dim)
+        # Encode
+        ego_emb = self.ego_encoder(ego_raw)
+        tk_emb  = self.ticket_encoder(tk_raw)
+        # Padding mask
+        is_padding = (tk_raw.abs().sum(dim=-1) == 0)
+        all_empty  = is_padding.all(dim=-1)
+        # ── Pass 1: perceive ─────────────────────────────────────────────
+        ctx1 = self._attend(self.attn_pass1, self.norm1,
+                            ego_emb, tk_emb, is_padding, all_empty)
+        # ── Reflect: combine ego + initial context into refined query ────
+        reflected = self.reflect_proj(torch.cat([ego_emb, ctx1], dim=-1))
+        # ── Pass 2: plan (re-attend with richer query) ───────────────────
+        ctx2 = self._attend(self.attn_pass2, self.norm2,
+                            reflected, tk_emb, is_padding, all_empty)
+        # ── Fuse and decode ──────────────────────────────────────────────
+        fused = torch.cat([ego_emb, ctx2], dim=-1)
+        steer  = self.steer_head(fused)          # (B, 1)
+        drive  = self.drive_head(fused)           # (B, 2)
+        action = torch.cat([steer, drive], dim=-1)  # (B, 3)
+        value  = self.critic(fused)               # (B, 1)
+        return action, value
+    def get_attention_weights(self, obs: torch.Tensor) -> torch.Tensor:
+        """Returns pass-2 attention weights for interpretability."""
+        B = obs.shape[0]
+        ego_raw = obs[:, :EGO_STATE_DIM]
+        tk_raw  = obs[:, EGO_STATE_DIM:].view(B, self.max_tickets, self.ticket_dim)
+        ego_emb = self.ego_encoder(ego_raw)
+        tk_emb  = self.ticket_encoder(tk_raw)
+        is_padding = (tk_raw.abs().sum(dim=-1) == 0)
+        all_empty = is_padding.all(dim=-1)
+        # Pass 1
+        ctx1 = self._attend(self.attn_pass1, self.norm1,
+                            ego_emb, tk_emb, is_padding, all_empty)
+        reflected = self.reflect_proj(torch.cat([ego_emb, ctx1], dim=-1))
+        # Pass 2 — get weights
+        safe_mask = is_padding.clone()
+        safe_mask[all_empty, 0] = False
+        query = reflected.unsqueeze(1)
+        _, weights = self.attn_pass2(
+            query=query, key=tk_emb, value=tk_emb,
+            key_padding_mask=safe_mask,
+            need_weights=True, average_attn_weights=False,
+        )
+        weights[all_empty] = 0.0
+        return weights

pyproject.toml ADDED Viewed

	@@ -0,0 +1,33 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-overflow-env"
+version = "0.1.0"
+description = "Overflow Environment for OpenEnv — autonomous vehicle fleet oversight on a 2D road grid"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.1",
+    "fastapi>=0.115.0",
+    "pydantic>=2.0.0",
+    "uvicorn[standard]>=0.24.0",
+    "requests>=2.31.0",
+    "torch>=2.10.0",
+    "numpy>=2.2.6",
+    "pillow>=12.1.1",
+    "gymnasium>=1.2.3",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "overflow_env.server.app:main"
+[tool.setuptools]
+packages = ["overflow_env", "overflow_env.server"]
+package-dir = { "overflow_env" = ".", "overflow_env.server" = "server" }

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+--extra-index-url https://download.pytorch.org/whl/cpu
+torch==2.5.1+cpu
+numpy>=1.24.0
+pillow==10.4.0
+matplotlib>=3.8.0
+pydantic>=2.0.0
+requests>=2.31.0
+gymnasium>=0.29.0

server/__init__.py ADDED Viewed

File without changes

server/__pycache__/__init__.cpython-314.pyc ADDED Viewed

Binary file (167 Bytes). View file

server/__pycache__/overflow_environment.cpython-314.pyc ADDED Viewed

Binary file (23.6 kB). View file

server/__pycache__/policy_adapter.cpython-314.pyc ADDED Viewed

Binary file (4.89 kB). View file

server/app.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+FastAPI application for the Overflow Environment.
+Exposes the OverflowEnvironment over HTTP and WebSocket endpoints.
+Usage:
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+"""
+import inspect
+from openenv.core.env_server.http_server import create_app
+from ..models import OverflowAction, OverflowObservation
+from .overflow_environment import OverflowEnvironment
+def _create_overflow_app():
+    """Build app across create_app variants that may expect a factory or an instance."""
+    try:
+        first_param = next(iter(inspect.signature(create_app).parameters.values()))
+        annotation_text = str(first_param.annotation)
+    except (StopIteration, TypeError, ValueError):
+        annotation_text = "typing.Callable"
+    expects_instance = (
+        "Environment" in annotation_text and "Callable" not in annotation_text
+    )
+    env_arg = OverflowEnvironment() if expects_instance else OverflowEnvironment
+    return create_app(
+        env_arg, OverflowAction, OverflowObservation, env_name="overflow_env"
+    )
+app = _create_overflow_app()
+def main():
+    """Entry point for direct execution via uv run or python -m."""
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

server/overflow_environment.py ADDED Viewed

	@@ -0,0 +1,497 @@

+"""
+Overflow Environment Implementation.
+A 2D road grid with N cars. One car (Car 0) is the LLM agent, others follow
+scripted rules. An observer checks for collisions each step. The environment
+returns text observations describing the traffic scene and rewards based on safety.
+Observations carry both text (for the LLM) and structured data (for the frontend).
+"""
+import math
+import random
+import re
+from dataclasses import dataclass, field
+from typing import Any, List, Optional
+from uuid import uuid4
+try:
+    from openenv.core.env_server.interfaces import Environment
+    from openenv.core.env_server.types import State
+except ImportError:
+    class Environment:  # stub for training-only mode
+        pass
+    class State:
+        pass
+try:
+    from ..models import (
+        CarStateData, LaneOccupancyData, OverflowAction,
+        OverflowObservation, OverflowState, Position, ProximityData,
+    )
+    from ..policies.flat_mlp_policy import FlatMLPPolicy
+    from ..policies.ticket_attention_policy import TicketAttentionPolicy
+    from ..policies.policy_spec import OBS_DIM
+    from .policy_adapter import overflow_obs_to_policy_obs, policy_action_to_decision
+except ImportError:
+    from models import (
+        CarStateData, LaneOccupancyData, OverflowAction,
+        OverflowObservation, OverflowState, Position, ProximityData,
+    )
+    from policies.flat_mlp_policy import FlatMLPPolicy
+    from policies.ticket_attention_policy import TicketAttentionPolicy
+    from policies.policy_spec import OBS_DIM
+    from server.policy_adapter import overflow_obs_to_policy_obs, policy_action_to_decision
+# --- Constants ---
+NUM_LANES = 3
+ROAD_LENGTH = 200
+NUM_CARS = 5
+MAX_STEPS = 100
+CRASH_DISTANCE = 5.0
+NEAR_MISS_DISTANCE = 15.0
+LANE_WIDTH = 3.7  # metres — matches frontend's makeCar convention
+# Reward values
+REWARD_CRASH = -5.0
+REWARD_NEAR_MISS = -1.0
+REWARD_SAFE_STEP = 0.5
+REWARD_REACHED_GOAL = 3.0
+REWARD_REASONING_MAX = 0.3
+# Speed bounds
+MIN_SPEED = 20
+MAX_SPEED = 90
+SPEED_DELTA = 5
+@dataclass
+class Car:
+    """Represents a car on the road grid."""
+    car_id: int
+    lane: int  # 1-indexed: 1, 2, or 3
+    position: float
+    speed: float
+    goal_position: float
+    is_agent: bool = False
+    reached_goal: bool = False
+    prev_speed: float = 0.0  # speed last step, for acceleration calc
+    def distance_to(self, other: "Car") -> float:
+        """Euclidean-ish distance considering lane and position."""
+        lane_diff = abs(self.lane - other.lane) * 10.0  # lanes are ~10 units apart
+        pos_diff = abs(self.position - other.position)
+        return math.sqrt(lane_diff**2 + pos_diff**2)
+    @property
+    def acceleration(self) -> float:
+        """Speed delta since last step."""
+        return self.speed - self.prev_speed
+    def to_state_data(self) -> CarStateData:
+        """Convert to frontend-compatible CarStateData."""
+        return CarStateData(
+            carId=self.car_id,
+            lane=self.lane,
+            position=Position(x=self.position, y=self.lane * LANE_WIDTH),
+            speed=self.speed,
+            acceleration=self.acceleration,
+        )
+def _parse_decision(action: OverflowAction) -> str:
+    """Extract a valid decision from the action, being forgiving about format."""
+    valid = {"accelerate", "brake", "lane_change_left", "lane_change_right", "maintain"}
+    # Try the decision field directly
+    decision = action.decision.strip().lower().replace(" ", "_")
+    if decision in valid:
+        return decision
+    # Try to extract from free text (the LLM might wrap it in tags)
+    text = f"{action.decision} {action.reasoning}".lower()
+    # Check for <action>...</action> tags
+    match = re.search(r"<action>\s*(\w+)\s*</action>", text)
+    if match:
+        candidate = match.group(1).strip().replace(" ", "_")
+        if candidate in valid:
+            return candidate
+    # Check for keywords anywhere (ordered: most specific first to avoid ambiguity)
+    for v in ["lane_change_left", "lane_change_right", "accelerate", "brake", "maintain"]:
+        if v in text:
+            return v
+    return "maintain"
+def _compute_reasoning_bonus(reasoning: str) -> float:
+    """
+    Compute a small reasoning quality bonus (0.0 to 0.3).
+    Gives a minor reward for providing structured reasoning, kept low
+    so driving performance remains the dominant training signal.
+    """
+    if not reasoning:
+        return 0.0
+    score = 0.0
+    lower = reasoning.lower()
+    # Small bonus for providing any reasoning at all
+    if len(reasoning) > 20:
+        score += 0.1
+    # Bonus for structured reasoning (not just keyword stuffing)
+    if "<think>" in lower or "because" in lower:
+        score += 0.1
+    if any(word in lower for word in ["therefore", "so i should", "best option", "i will"]):
+        score += 0.1
+    return min(score, REWARD_REASONING_MAX)
+def _scripted_car_action(car: Car, all_cars: List[Car], rng: random.Random) -> str:
+    """
+    Simple scripted AI for non-agent cars.
+    Rules:
+    - If car ahead in same lane is close (< 20 units): brake
+    - If speed is low and random chance: accelerate
+    - Otherwise: maintain
+    """
+    # Find nearest car ahead in same lane
+    nearest_ahead_dist = float("inf")
+    for other in all_cars:
+        if other.car_id == car.car_id:
+            continue
+        if other.lane == car.lane and other.position > car.position:
+            dist = other.position - car.position
+            if dist < nearest_ahead_dist:
+                nearest_ahead_dist = dist
+    if nearest_ahead_dist < 20:
+        return "brake"
+    if car.speed < 60 and rng.random() < 0.1:
+        return "accelerate"
+    # Occasionally change lanes to make traffic more dynamic
+    if rng.random() < 0.05:
+        if car.lane > 1 and rng.random() < 0.5:
+            return "lane_change_left"
+        elif car.lane < NUM_LANES:
+            return "lane_change_right"
+    return "maintain"
+def _apply_action(car: Car, decision: str) -> None:
+    """Apply a driving decision to a car, mutating it in place."""
+    if decision == "accelerate":
+        car.speed = min(car.speed + SPEED_DELTA, MAX_SPEED)
+    elif decision == "brake":
+        car.speed = max(car.speed - SPEED_DELTA, MIN_SPEED)
+    elif decision == "lane_change_left":
+        if car.lane > 1:
+            car.lane -= 1
+    elif decision == "lane_change_right":
+        if car.lane < NUM_LANES:
+            car.lane += 1
+    # "maintain" — no change
+def _generate_scene_description(agent_car: Car, cars: List[Car]) -> str:
+    """Generate a text description of the current traffic scene."""
+    lines = [
+        f"You are Car 0 in lane {agent_car.lane}, position {agent_car.position:.0f}, speed {agent_car.speed:.0f}.",
+        f"Goal: reach position {agent_car.goal_position:.0f}.",
+        "Nearby cars:",
+    ]
+    for car in cars:
+        if car.car_id == agent_car.car_id:
+            continue
+        detail = f"- Car {car.car_id}: lane {car.lane}, position {car.position:.0f}, speed {car.speed:.0f}"
+        # Add context about relative position
+        if car.lane == agent_car.lane:
+            pos_diff = car.position - agent_car.position
+            if pos_diff > 0:
+                detail += f" [AHEAD IN YOUR LANE - {pos_diff:.0f} units away]"
+            else:
+                detail += f" [BEHIND IN YOUR LANE - {abs(pos_diff):.0f} units away]"
+        if car.reached_goal:
+            detail += " [REACHED GOAL]"
+        lines.append(detail)
+    return "\n".join(lines)
+def _build_structured_data(
+    cars: List[Car],
+    proximity_pairs: List[ProximityData],
+) -> tuple[List[CarStateData], List[LaneOccupancyData]]:
+    """Build structured arrays for the observation."""
+    cars_data = [c.to_state_data() for c in cars]
+    # Lane occupancies
+    lane_map: dict[int, list[int]] = {}
+    for car in cars:
+        if not car.reached_goal:
+            lane_map.setdefault(car.lane, []).append(car.car_id)
+    lane_occupancies = [
+        LaneOccupancyData(lane=lane, carIds=ids)
+        for lane, ids in sorted(lane_map.items())
+    ]
+    return cars_data, lane_occupancies
+class OverflowEnvironment(Environment):
+    """
+    Autonomous vehicle fleet oversight environment.
+    A 2D road grid with N cars. Car 0 is the LLM agent, others follow
+    scripted rules. The observer detects crashes and near-misses and
+    computes rewards based on safety.
+    """
+    def __init__(self):
+        super().__init__()
+        self._state = OverflowState(episode_id=str(uuid4()))
+        self._cars: List[Car] = []
+        self._rng = random.Random()
+        self._done = False
+        self._last_obs: Optional[OverflowObservation] = None
+        self._policies = {
+            "flat_mlp":         FlatMLPPolicy(obs_dim=OBS_DIM),
+            "ticket_attention":  TicketAttentionPolicy(obs_dim=OBS_DIM),
+        }
+    def _build_observation(
+        self,
+        incident_report: str,
+        reward: float,
+        proximities: Optional[List[ProximityData]] = None,
+    ) -> OverflowObservation:
+        """Build a full observation with text + structured data."""
+        agent = self._cars[0]
+        scene = _generate_scene_description(agent, self._cars)
+        prox = proximities or []
+        cars_data, lane_occ = _build_structured_data(self._cars, prox)
+        return OverflowObservation(
+            scene_description=scene,
+            incident_report=incident_report,
+            done=self._done,
+            reward=reward,
+            cars=cars_data,
+            proximities=prox,
+            lane_occupancies=lane_occ,
+        )
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> OverflowObservation:
+        """Reset the environment: create road and spawn cars."""
+        if seed is not None:
+            self._rng = random.Random(seed)
+        else:
+            self._rng = random.Random()
+        self._state = OverflowState(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+            crash_count=0,
+            near_miss_count=0,
+            cars_reached_goal=0,
+            total_cars=NUM_CARS,
+        )
+        self._done = False
+        # Spawn cars with random positions, speeds, lanes, and goals
+        self._cars = []
+        for i in range(NUM_CARS):
+            # Ensure no two cars spawn within crash distance
+            for _attempt in range(100):
+                lane = self._rng.randint(1, NUM_LANES)
+                position = float(self._rng.randint(10, 80))
+                too_close = False
+                for existing in self._cars:
+                    lane_diff = abs(lane - existing.lane) * 10.0
+                    pos_diff = abs(position - existing.position)
+                    dist = math.sqrt(lane_diff**2 + pos_diff**2)
+                    if dist < CRASH_DISTANCE * 2:
+                        too_close = True
+                        break
+                if not too_close:
+                    break
+            speed = float(self._rng.randint(40, 70))
+            goal = float(self._rng.randint(160, 195))
+            self._cars.append(
+                Car(
+                    car_id=i,
+                    lane=lane,
+                    position=position,
+                    speed=speed,
+                    goal_position=goal,
+                    is_agent=(i == 0),
+                    prev_speed=speed,  # no delta on first step
+                )
+            )
+        self._last_obs = self._build_observation(incident_report="", reward=0.0)
+        return self._last_obs
+    def step(
+        self,
+        action: OverflowAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> OverflowObservation:
+        """Execute one simulation step."""
+        if self._done:
+            return self._build_observation(
+                incident_report="Episode is over. Call reset() to start a new one.",
+                reward=0.0,
+            )
+        # Policy intercept: decision="policy:flat_mlp" or "policy:ticket_attention"
+        if action.decision.startswith("policy:") and self._last_obs is not None:
+            policy_name = action.decision.split(":", 1)[1].lower()
+            if policy_name in self._policies:
+                obs_vec = overflow_obs_to_policy_obs(self._last_obs)
+                act_vec = self._policies[policy_name].predict(obs_vec)
+                decision, reasoning = policy_action_to_decision(act_vec)
+                action = OverflowAction(
+                    decision=decision,
+                    reasoning=f"[{policy_name}] {reasoning}",
+                )
+        self._state.step_count += 1
+        reward = 0.0
+        incidents = []
+        # Snapshot previous speeds for acceleration tracking
+        for car in self._cars:
+            car.prev_speed = car.speed
+        # 1. Parse and apply the agent's action to Car 0
+        decision = _parse_decision(action)
+        _apply_action(self._cars[0], decision)
+        # 2. Compute and apply scripted actions for Cars 1-N
+        for car in self._cars[1:]:
+            if car.reached_goal:
+                continue
+            scripted_decision = _scripted_car_action(car, self._cars, self._rng)
+            _apply_action(car, scripted_decision)
+        # 3. Move all cars forward based on speed (speed is in units/step, scaled down)
+        for car in self._cars:
+            if car.reached_goal:
+                continue
+            car.position += car.speed * 0.1  # scale factor for reasonable movement
+        # 4. Collision detection (pairwise)
+        agent_crash = False
+        proximity_list: List[ProximityData] = []
+        active_cars = [c for c in self._cars if not c.reached_goal]
+        agent_id = self._cars[0].car_id
+        for i in range(len(active_cars)):
+            for j in range(i + 1, len(active_cars)):
+                dist = active_cars[i].distance_to(active_cars[j])
+                involves_agent = active_cars[i].car_id == agent_id or active_cars[j].car_id == agent_id
+                if dist < CRASH_DISTANCE:
+                    self._state.crash_count += 1
+                    proximity_list.append(
+                        ProximityData(
+                            carA=active_cars[i].car_id,
+                            carB=active_cars[j].car_id,
+                            distance=round(dist, 2),
+                        )
+                    )
+                    incidents.append(
+                        f"CRASH between Car {active_cars[i].car_id} and Car {active_cars[j].car_id}! "
+                        f"(distance: {dist:.1f})"
+                    )
+                    if involves_agent:
+                        agent_crash = True
+                elif dist < NEAR_MISS_DISTANCE:
+                    self._state.near_miss_count += 1
+                    # Only penalize near misses involving the agent
+                    if involves_agent:
+                        reward += REWARD_NEAR_MISS
+                    proximity_list.append(
+                        ProximityData(
+                            carA=active_cars[i].car_id,
+                            carB=active_cars[j].car_id,
+                            distance=round(dist, 2),
+                        )
+                    )
+                    incidents.append(
+                        f"NEAR MISS between Car {active_cars[i].car_id} and Car {active_cars[j].car_id} "
+                        f"(distance: {dist:.1f})"
+                    )
+        if agent_crash:
+            reward += REWARD_CRASH
+            self._done = True
+        else:
+            # 5. Goal check for agent car
+            agent = self._cars[0]
+            if agent.position >= agent.goal_position:
+                agent.reached_goal = True
+                self._state.cars_reached_goal += 1
+                reward += REWARD_REACHED_GOAL
+                incidents.append(
+                    f"Car 0 reached its goal at position {agent.goal_position:.0f}!"
+                )
+                self._done = True
+            # Check goal for scripted cars too (for state tracking)
+            for car in self._cars[1:]:
+                if not car.reached_goal and car.position >= car.goal_position:
+                    car.reached_goal = True
+                    self._state.cars_reached_goal += 1
+            # 6. Safe step bonus (no crash, agent still active)
+            if not self._done:
+                reward += REWARD_SAFE_STEP
+        # 7. Reasoning quality bonus
+        reasoning_bonus = _compute_reasoning_bonus(action.reasoning)
+        reward += reasoning_bonus
+        # 8. Max steps check
+        if self._state.step_count >= MAX_STEPS and not self._done:
+            self._done = True
+            incidents.append(f"Maximum steps ({MAX_STEPS}) reached.")
+        incident_report = (
+            "\n".join(incidents) if incidents else "Observer: No incidents this step."
+        )
+        self._last_obs = self._build_observation(
+            incident_report=incident_report,
+            reward=reward,
+            proximities=proximity_list,
+        )
+        return self._last_obs
+    @property
+    def state(self) -> OverflowState:
+        """Get the current environment state."""
+        return self._state

server/policy_adapter.py ADDED Viewed

	@@ -0,0 +1,80 @@

+"""
+Adapter between OverflowObservation (2D road grid) and the OpenENV policy
+observation format (ego state + ticket matrix).
+Nearby cars are converted to collision_risk tickets so TicketAttentionPolicy
+can reason about them using the same mechanism it was designed for.
+"""
+from __future__ import annotations
+import math
+import numpy as np
+try:
+    from ..policies.policy_spec import build_obs, build_ticket_vector, OBS_DIM
+except ImportError:
+    from policies.policy_spec import build_obs, build_ticket_vector, OBS_DIM
+def overflow_obs_to_policy_obs(obs) -> np.ndarray:
+    """OverflowObservation → 603-dim numpy vector for our policies."""
+    cars = obs.cars
+    if not cars:
+        return np.zeros(OBS_DIM, dtype=np.float32)
+    ego = next((c for c in cars if c.carId == 0), cars[0])
+    ego_speed_ms = ego.speed / 4.5            # OverflowEnv speed units → m/s
+    ego_x        = ego.position.x
+    ego_y        = (ego.lane - 2) * 3.7      # lane → lateral metres
+    ticket_vectors = []
+    for car in cars:
+        if car.carId == 0:
+            continue
+        rel_x    = car.position.x - ego.position.x
+        rel_y    = (car.lane - ego.lane) * 3.7
+        car_spd  = car.speed / 4.5
+        distance = math.sqrt(rel_x ** 2 + rel_y ** 2)
+        if distance > 80:
+            continue
+        closing  = max(ego_speed_ms - car_spd * math.copysign(1, max(rel_x, 0.01)), 0.1)
+        ttc      = min(distance / closing, 30.0)
+        severity = 1.0 if distance < 8 else (0.75 if distance < 15 else 0.5)
+        ticket_vectors.append(build_ticket_vector(
+            severity_weight=severity, ttl=5.0,
+            pos_x=rel_x, pos_y=rel_y, pos_z=0.0,
+            vel_x=car_spd, vel_y=0.0, vel_z=0.0,
+            heading=0.0,
+            size_length=4.0, size_width=2.0, size_height=1.5,
+            distance=distance, time_to_collision=ttc,
+            bearing=math.atan2(rel_y, max(rel_x, 0.01)),
+            ticket_type="collision_risk", entity_type="vehicle", confidence=1.0,
+        ))
+    tv = np.array(ticket_vectors, dtype=np.float32) if ticket_vectors else None
+    return build_obs(
+        ego_x=ego_x, ego_y=ego_y, ego_z=0.0,
+        ego_vx=ego_speed_ms, ego_vy=0.0,
+        heading=0.0, speed=ego_speed_ms,
+        steer=0.0, throttle=0.5, brake=0.0,
+        ticket_vectors=tv,
+    )
+def policy_action_to_decision(action_vec: np.ndarray) -> tuple[str, str]:
+    """Continuous [steer, throttle, brake] → (text decision, reasoning)."""
+    steer, throttle, brake = float(action_vec[0]), float(action_vec[1]), float(action_vec[2])
+    if abs(steer) > 0.35:
+        decision  = "lane_change_left" if steer < 0 else "lane_change_right"
+        reasoning = f"steer={steer:.2f}: lateral avoidance"
+    elif brake > 0.25:
+        decision  = "brake"
+        reasoning = f"brake={brake:.2f}: closing gap"
+    elif throttle > 0.20:
+        decision  = "accelerate"
+        reasoning = f"throttle={throttle:.2f}: clear ahead"
+    else:
+        decision  = "maintain"
+        reasoning = f"s={steer:.2f} t={throttle:.2f} b={brake:.2f}: holding course"
+    return decision, reasoning

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+--extra-index-url https://download.pytorch.org/whl/cpu
+torch>=2.5.0
+gymnasium>=0.29.0
+openenv-core[core]>=0.2.1
+fastapi>=0.115.0
+pydantic>=2.0.0
+uvicorn[standard]>=0.24.0
+requests>=2.31.0

training/__init__.py ADDED Viewed

File without changes

training/__pycache__/__init__.cpython-314.pyc ADDED Viewed

Binary file (169 Bytes). View file

training/__pycache__/curriculum.cpython-314.pyc ADDED Viewed

Binary file (5.69 kB). View file

training/__pycache__/overflow_gym_env.cpython-314.pyc ADDED Viewed

Binary file (9.75 kB). View file

training/__pycache__/ppo_trainer.cpython-314.pyc ADDED Viewed

Binary file (20.6 kB). View file

training/__pycache__/reward.cpython-314.pyc ADDED Viewed

Binary file (3.47 kB). View file

training/curriculum.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""
+CurriculumManager — ported from openenv/training/curriculum.py.
+Same 4-stage progression and same reward thresholds. Adapted for
+OverflowEnvironment: no ticket injection (the env has its own scripted
+NPCs), stages instead control training logging and advancement criteria.
+Stage 1  No extra pressure.  Goal: learn basic speed + lane keeping.
+Stage 2  Standard traffic.   Goal: survive without crashing.
+Stage 3  Evaluate more.      Goal: consistent goal-reaching.
+Stage 4  Full evaluation.    Goal: high mean reward over long window.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import List
+@dataclass
+class StageConfig:
+    stage:             int
+    name:              str
+    description:       str
+    advance_threshold: float   # mean episode reward to advance
+    advance_window:    int     # consecutive episodes required
+STAGES: List[StageConfig] = [
+    StageConfig(
+        stage=1, name="Survival",
+        description="Learn basic speed control and lane keeping.",
+        advance_threshold=50.0, advance_window=8,
+    ),
+    StageConfig(
+        stage=2, name="Crash Avoidance",
+        description="Navigate traffic without colliding.",
+        advance_threshold=120.0, advance_window=15,
+    ),
+    StageConfig(
+        stage=3, name="Goal Reaching",
+        description="Consistently reach the goal position.",
+        advance_threshold=200.0, advance_window=15,
+    ),
+    StageConfig(
+        stage=4, name="Mastery",
+        description="High reward, smooth driving, minimal near-misses.",
+        advance_threshold=280.0, advance_window=15,
+    ),
+]
+class CurriculumManager:
+    """
+    Tracks stage progression based on episode rewards.
+    Same API as openenv CurriculumManager — PPOTrainer calls it unchanged.
+    """
+    def __init__(self, seed: int = 0):
+        self._stage_idx    = 0
+        self._rewards: List[float] = []
+        self._auto_advance = True
+    @property
+    def current_stage(self) -> int:
+        return STAGES[self._stage_idx].stage
+    @property
+    def config(self) -> StageConfig:
+        return STAGES[self._stage_idx]
+    def step(self, sim_time: float) -> list:
+        """No ticket injection in OverflowEnvironment — always returns []."""
+        return []
+    def record_episode_reward(self, reward: float) -> bool:
+        """Record episode reward and advance stage if threshold met."""
+        self._rewards.append(reward)
+        cfg    = self.config
+        window = self._rewards[-cfg.advance_window:]
+        if (
+            self._auto_advance
+            and len(window) >= cfg.advance_window
+            and sum(window) / len(window) >= cfg.advance_threshold
+            and self._stage_idx < len(STAGES) - 1
+        ):
+            self._stage_idx += 1
+            self._rewards = []
+            print(f"[Curriculum] Advanced to Stage {self.current_stage}: {self.config.name}")
+            return True
+        return False
+    def force_stage(self, stage: int) -> None:
+        idx = stage - 1
+        if 0 <= idx < len(STAGES):
+            self._stage_idx = idx
+            self._rewards   = []
+            print(f"[Curriculum] Forced to Stage {stage}: {self.config.name}")

training/overflow_gym_env.py ADDED Viewed

	@@ -0,0 +1,170 @@

+"""
+Gymnasium wrapper around OverflowEnvironment.
+Bridges the gap between OverflowEnvironment (text actions, structured obs)
+and our PPO trainer (continuous actions, numeric obs vector).
+Observation: 603-dim float32 vector (same layout as CarEnv3D — ego state +
+             collision-risk ticket matrix built from nearby cars)
+Action:      [steer, throttle, brake] all in [-1, 1]
+             → mapped to text decision for OverflowEnvironment
+This makes OverflowEnvironment a drop-in replacement for CarEnv3D so that
+FlatMLPPolicy and TicketAttentionPolicy train with the exact same PPO loop.
+"""
+from __future__ import annotations
+import math
+from typing import Any, Dict, Optional, Tuple
+import numpy as np
+import gymnasium as gym
+from gymnasium import spaces
+from ..server.overflow_environment import OverflowEnvironment
+from ..models import OverflowAction
+from ..policies.policy_spec import (
+    build_obs, build_ticket_vector, OBS_DIM,
+)
+from .reward import compute_reward
+# ── Action mapping ────────────────────────────────────────────────────────────
+def _action_to_decision(action: np.ndarray) -> str:
+    steer, throttle, brake = float(action[0]), float(action[1]), float(action[2])
+    if abs(steer) > 0.35:
+        return "lane_change_left" if steer < 0 else "lane_change_right"
+    if brake > 0.25:
+        return "brake"
+    if throttle > 0.20:
+        return "accelerate"
+    return "maintain"
+# ── Observation extraction ────────────────────────────────────────────────────
+def _obs_to_vector(overflow_obs) -> np.ndarray:
+    """OverflowObservation → 603-dim numpy vector matching policy_spec layout."""
+    cars = overflow_obs.cars
+    if not cars:
+        return np.zeros(OBS_DIM, dtype=np.float32)
+    ego = next((c for c in cars if c.carId == 0), cars[0])
+    ego_speed_ms = ego.speed / 4.5
+    ego_x        = ego.position.x
+    ego_y        = (ego.lane - 2) * 3.7
+    ticket_vectors = []
+    for car in cars:
+        if car.carId == 0:
+            continue
+        rel_x    = car.position.x - ego.position.x
+        rel_y    = (car.lane - ego.lane) * 3.7
+        car_spd  = car.speed / 4.5
+        distance = math.sqrt(rel_x ** 2 + rel_y ** 2)
+        if distance > 80:
+            continue
+        closing  = max(ego_speed_ms - car_spd * math.copysign(1, max(rel_x, 0.01)), 0.1)
+        ttc      = min(distance / closing, 30.0)
+        severity = 1.0 if distance < 8 else (0.75 if distance < 15 else 0.5)
+        ticket_vectors.append(build_ticket_vector(
+            severity_weight=severity, ttl=5.0,
+            pos_x=rel_x, pos_y=rel_y, pos_z=0.0,
+            vel_x=car_spd, vel_y=0.0, vel_z=0.0,
+            heading=0.0,
+            size_length=4.0, size_width=2.0, size_height=1.5,
+            distance=distance, time_to_collision=ttc,
+            bearing=math.atan2(rel_y, max(rel_x, 0.01)),
+            ticket_type="collision_risk", entity_type="vehicle", confidence=1.0,
+        ))
+    tv = np.array(ticket_vectors, dtype=np.float32) if ticket_vectors else None
+    return build_obs(
+        ego_x=ego_x, ego_y=ego_y, ego_z=0.0,
+        ego_vx=ego_speed_ms, ego_vy=0.0,
+        heading=0.0, speed=ego_speed_ms,
+        steer=0.0, throttle=0.5, brake=0.0,
+        ticket_vectors=tv,
+    )
+# ── Gymnasium wrapper ─────────────────────────────────────────────────────────
+class OverflowGymEnv(gym.Env):
+    """
+    Gymnasium-compatible wrapper around OverflowEnvironment.
+    Provides the same interface as CarEnv3D so PPOTrainer works unchanged.
+    """
+    metadata = {"render_modes": []}
+    def __init__(self):
+        super().__init__()
+        self._env = OverflowEnvironment()
+        self._last_overflow_obs = None
+        self._prev_action = np.zeros(3, dtype=np.float32)
+        self._sim_time = 0.0         # incremented each step (mirrors CarEnv3D._sim_time)
+        self._step_dt  = 0.1         # seconds per step
+        self.observation_space = spaces.Box(
+            low=-1.0, high=1.0, shape=(OBS_DIM,), dtype=np.float32
+        )
+        self.action_space = spaces.Box(
+            low=-1.0, high=1.0, shape=(3,), dtype=np.float32
+        )
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        options: Optional[Dict[str, Any]] = None,
+    ) -> Tuple[np.ndarray, Dict]:
+        super().reset(seed=seed)
+        self._last_overflow_obs = self._env.reset(seed=seed)
+        self._prev_action = np.zeros(3, dtype=np.float32)
+        self._sim_time = 0.0
+        return _obs_to_vector(self._last_overflow_obs), {}
+    def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, bool, Dict]:
+        decision = _action_to_decision(action)
+        overflow_action = OverflowAction(decision=decision, reasoning="")
+        overflow_obs = self._env.step(overflow_action)
+        self._last_overflow_obs = overflow_obs
+        self._sim_time += self._step_dt
+        obs_vec = _obs_to_vector(overflow_obs)
+        # Extract signals for reward shaping
+        ego = next((c for c in overflow_obs.cars if c.carId == 0), None)
+        ego_speed_ms = (ego.speed / 4.5) if ego else 0.0
+        ego_y        = ((ego.lane - 2) * 3.7) if ego else 0.0
+        collision    = any("CRASH" in p for p in (overflow_obs.incident_report or "").split("\n")
+                           if "Car 0" in p)
+        goal_reached = overflow_obs.done and not collision
+        reward = compute_reward(
+            ego_speed    = ego_speed_ms,
+            ego_y        = ego_y,
+            action       = action,
+            prev_action  = self._prev_action,
+            collision    = collision,
+            goal_reached = goal_reached,
+            near_miss    = "NEAR MISS" in (overflow_obs.incident_report or ""),
+            raw_reward   = overflow_obs.reward or 0.0,
+        )
+        self._prev_action = action.copy()
+        terminated = overflow_obs.done
+        truncated  = False
+        info: Dict[str, Any] = {
+            "collision":    collision,
+            "goal_reached": goal_reached,
+            "incident":     overflow_obs.incident_report,
+        }
+        return obs_vec, reward, terminated, truncated, info

training/ppo_trainer.py ADDED Viewed

	@@ -0,0 +1,329 @@

+"""
+PPO trainer — ported directly from openenv/training/ppo_trainer.py.
+Same algorithm, same hyperparameters, same GAE implementation.
+Only change: uses OverflowGymEnv instead of CarEnv3D.
+Usage:
+    from overflow_env.training.ppo_trainer import run_training
+    run_training(policy_type="attention", total_steps=2_000_000)
+"""
+from __future__ import annotations
+import time
+from collections import deque
+from pathlib import Path
+from typing import Optional
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from .overflow_gym_env import OverflowGymEnv
+from .curriculum import CurriculumManager
+from .reward import compute_episode_bonus
+from ..policies.base_policy import BasePolicy
+from ..policies.policy_spec import OBS_DIM
+# ── Rollout buffer ─────────────────────────────────────────────────────────────
+# Identical to openenv/training/ppo_trainer.py
+class RolloutBuffer:
+    def __init__(self, n_steps: int, obs_dim: int, device: torch.device):
+        self.n    = n_steps
+        self.obs  = torch.zeros(n_steps, obs_dim, device=device)
+        self.acts = torch.zeros(n_steps, 3,        device=device)
+        self.rew  = torch.zeros(n_steps,            device=device)
+        self.val  = torch.zeros(n_steps,            device=device)
+        self.logp = torch.zeros(n_steps,            device=device)
+        self.done = torch.zeros(n_steps,            device=device)
+        self.ptr  = 0
+    def add(self, obs, act, rew, val, logp, done):
+        i = self.ptr
+        self.obs[i]  = torch.as_tensor(obs, dtype=torch.float32)
+        self.acts[i] = torch.as_tensor(act, dtype=torch.float32)
+        self.rew[i]  = float(rew)
+        self.val[i]  = float(val)
+        self.logp[i] = float(logp)
+        self.done[i] = float(done)
+        self.ptr += 1
+    def full(self) -> bool:
+        return self.ptr >= self.n
+    def reset(self):
+        self.ptr = 0
+    def compute_returns(self, last_val: float, gamma: float, gae_lambda: float):
+        """Generalized Advantage Estimation — identical to openenv."""
+        adv = torch.zeros_like(self.rew)
+        gae = 0.0
+        for t in reversed(range(self.n)):
+            next_val = last_val if t == self.n - 1 else float(self.val[t + 1])
+            delta    = self.rew[t] + gamma * next_val * (1 - self.done[t]) - self.val[t]
+            gae      = delta + gamma * gae_lambda * (1 - self.done[t]) * gae
+            adv[t]   = gae
+        self.ret = adv + self.val
+# ── PPO Trainer ────────────────────────────────────────────────────────────────
+class PPOTrainer:
+    """
+    Identical to openenv PPOTrainer — same hyperparameters, same PPO update.
+    Environment is OverflowGymEnv instead of CarEnv3D.
+    """
+    def __init__(
+        self,
+        policy:        BasePolicy,
+        env:           OverflowGymEnv,
+        curriculum:    Optional[CurriculumManager] = None,
+        # PPO hyperparameters — same defaults as openenv
+        lr:            float = 3e-4,
+        gamma:         float = 0.99,
+        gae_lambda:    float = 0.95,
+        clip_range:    float = 0.2,
+        clip_range_vf: float = 0.2,
+        ent_coef:      float = 0.02,
+        vf_coef:       float = 0.5,
+        max_grad_norm: float = 0.5,
+        n_steps:       int   = 2048,
+        batch_size:    int   = 256,
+        n_epochs:      int   = 10,
+        save_dir:      str   = "checkpoints",
+        log_interval:  int   = 10,
+        device:        str   = "auto",
+    ):
+        self.policy     = policy
+        self.env        = env
+        self.curriculum = curriculum or CurriculumManager()
+        self.gamma      = gamma
+        self.gae_lambda = gae_lambda
+        self.clip       = clip_range
+        self.clip_vf    = clip_range_vf
+        self.ent_coef   = ent_coef
+        self.vf_coef    = vf_coef
+        self.max_grad   = max_grad_norm
+        self.n_steps    = n_steps
+        self.batch_size = batch_size
+        self.n_epochs   = n_epochs
+        self.log_every  = log_interval
+        self.save_dir   = Path(save_dir)
+        self.save_dir.mkdir(parents=True, exist_ok=True)
+        if device == "auto":
+            device = "cuda" if torch.cuda.is_available() else \
+                     "mps"  if torch.backends.mps.is_available() else "cpu"
+        self.device = torch.device(device)
+        self.policy.to(self.device)
+        self.optimizer = optim.Adam(policy.parameters(), lr=lr, eps=1e-5)
+        self.scheduler = optim.lr_scheduler.LinearLR(
+            self.optimizer, start_factor=1.0, end_factor=0.1, total_iters=500,
+        )
+        self.buffer = RolloutBuffer(n_steps, OBS_DIM, self.device)
+        self.ep_rewards  = deque(maxlen=100)
+        self.ep_lengths  = deque(maxlen=100)
+        self.total_steps = 0
+        self.n_updates   = 0
+    # ── Main training loop ─────────────────────────────────────────────────────
+    def train(self, total_steps: int = 2_000_000) -> None:
+        print(f"\n{'='*70}", flush=True)
+        print(f"  OpenENV PPO Training — policy={self.policy.__class__.__name__}", flush=True)
+        print(f"  total_steps={total_steps}  n_steps={self.n_steps}  lr={self.optimizer.param_groups[0]['lr']:.0e}", flush=True)
+        print(f"  gamma={self.gamma}  gae_lambda={self.gae_lambda}  clip={self.clip}  ent_coef={self.ent_coef}", flush=True)
+        print(f"{'='*70}\n", flush=True)
+        obs, _ = self.env.reset()
+        ep_reward = 0.0
+        ep_steps  = 0
+        t0 = time.time()
+        while self.total_steps < total_steps:
+            self.buffer.reset()
+            self.policy.eval()
+            # ── Collect rollout ──────────────────────────────────────────────
+            for _ in range(self.n_steps):
+                # Curriculum step (returns [] for OverflowEnv — kept for API compat)
+                self.curriculum.step(self.env._sim_time)
+                obs_t = torch.as_tensor(obs, dtype=torch.float32, device=self.device)
+                with torch.no_grad():
+                    act_mean, val = self.policy(obs_t.unsqueeze(0))
+                act_mean = act_mean.squeeze(0)
+                val      = val.squeeze(0)
+                dist   = torch.distributions.Normal(act_mean, torch.ones_like(act_mean) * 0.3)
+                action = dist.sample().clamp(-1, 1)
+                logp   = dist.log_prob(action).sum()
+                next_obs, reward, term, trunc, info = self.env.step(action.cpu().numpy())
+                self.buffer.add(
+                    obs, action.cpu().numpy(), reward,
+                    float(val), float(logp), float(term or trunc),
+                )
+                obs        = next_obs
+                ep_reward += reward
+                ep_steps  += 1
+                self.total_steps += 1
+                if term or trunc:
+                    bonus = compute_episode_bonus(
+                        total_steps=ep_steps,
+                        survived=not info.get("collision", False),
+                    )
+                    ep_reward += bonus
+                    self.ep_rewards.append(ep_reward)
+                    self.ep_lengths.append(ep_steps)
+                    advanced = self.curriculum.record_episode_reward(ep_reward)
+                    outcome = "CRASH" if info.get("collision") else ("GOAL" if info.get("goal_reached") else "timeout")
+                    print(
+                        f"  ep#{len(self.ep_rewards):>4d} | "
+                        f"steps={ep_steps:>3d} | "
+                        f"reward={ep_reward:>8.2f} | "
+                        f"outcome={outcome:<8} | "
+                        f"stage={self.curriculum.current_stage} | "
+                        f"total_steps={self.total_steps}",
+                        flush=True,
+                    )
+                    obs, _ = self.env.reset()
+                    ep_reward = 0.0
+                    ep_steps  = 0
+            # ── PPO update ───────────────────────────────────────────────────
+            with torch.no_grad():
+                obs_t = torch.as_tensor(obs, dtype=torch.float32, device=self.device)
+                _, last_val = self.policy(obs_t.unsqueeze(0))
+            self.buffer.compute_returns(float(last_val), self.gamma, self.gae_lambda)
+            self.policy.train()
+            self._ppo_update()
+            self.n_updates += 1
+            self.scheduler.step()
+            elapsed = time.time() - t0
+            sps     = self.total_steps / max(elapsed, 1)
+            mean_r  = np.mean(self.ep_rewards) if self.ep_rewards else 0.0
+            mean_l  = np.mean(self.ep_lengths) if self.ep_lengths else 0.0
+            print(
+                f"\n[PPO update #{self.n_updates}] "
+                f"step={self.total_steps}  "
+                f"mean_reward={mean_r:.2f}  "
+                f"mean_ep_len={mean_l:.0f}  "
+                f"stage={self.curriculum.current_stage}  "
+                f"sps={sps:.0f}\n",
+                flush=True,
+            )
+            # ── Checkpoint ───────────────────────────────────────────────────
+            if self.n_updates % 50 == 0:
+                ckpt = self.save_dir / f"policy_step{self.total_steps}_stage{self.curriculum.current_stage}.pt"
+                torch.save({
+                    "step":   self.total_steps,
+                    "stage":  self.curriculum.current_stage,
+                    "policy": self.policy.state_dict(),
+                    "optim":  self.optimizer.state_dict(),
+                }, ckpt)
+                print(f"[PPO] Saved checkpoint → {ckpt}")
+    # ── PPO update pass — identical to openenv ─────────────────────────────────
+    def _ppo_update(self):
+        obs      = self.buffer.obs
+        acts     = self.buffer.acts
+        old_logp = self.buffer.logp
+        adv      = self.buffer.ret - self.buffer.val
+        adv      = (adv - adv.mean()) / (adv.std() + 1e-8)
+        ret      = self.buffer.ret
+        old_val  = self.buffer.val
+        indices = torch.randperm(self.n_steps, device=self.device)
+        for _ in range(self.n_epochs):
+            for start in range(0, self.n_steps, self.batch_size):
+                idx = indices[start: start + self.batch_size]
+                act_mean, val = self.policy(obs[idx])
+                val = val.squeeze(-1)
+                dist    = torch.distributions.Normal(act_mean, torch.ones_like(act_mean) * 0.3)
+                logp    = dist.log_prob(acts[idx]).sum(dim=-1)
+                entropy = dist.entropy().sum(dim=-1).mean()
+                ratio    = torch.exp(logp - old_logp[idx])
+                pg_loss1 = -adv[idx] * ratio
+                pg_loss2 = -adv[idx] * ratio.clamp(1 - self.clip, 1 + self.clip)
+                pg_loss  = torch.max(pg_loss1, pg_loss2).mean()
+                val_unclipped = (val - ret[idx]) ** 2
+                val_clipped   = (
+                    old_val[idx]
+                    + (val - old_val[idx]).clamp(-self.clip_vf, self.clip_vf)
+                    - ret[idx]
+                ) ** 2
+                vf_loss = 0.5 * torch.max(val_unclipped, val_clipped).mean()
+                loss = pg_loss + self.vf_coef * vf_loss - self.ent_coef * entropy
+                self.optimizer.zero_grad()
+                loss.backward()
+                nn.utils.clip_grad_norm_(self.policy.parameters(), self.max_grad)
+                self.optimizer.step()
+# ── Entry point ────────────────────────────────────────────────────────────────
+def run_training(
+    policy_type: str = "attention",
+    total_steps: int = 2_000_000,
+    start_stage: int = 1,
+    checkpoint:  Optional[str] = None,
+    device:      str = "auto",
+) -> None:
+    from ..policies.ticket_attention_policy import TicketAttentionPolicy
+    from ..policies.flat_mlp_policy         import FlatMLPPolicy
+    policy_map = {
+        "attention": lambda: TicketAttentionPolicy(obs_dim=OBS_DIM),
+        "mlp":       lambda: FlatMLPPolicy(obs_dim=OBS_DIM),
+    }
+    policy = policy_map[policy_type]()
+    if checkpoint:
+        ckpt = torch.load(checkpoint, map_location="cpu")
+        policy.load_state_dict(ckpt["policy"])
+        print(f"[PPO] Loaded checkpoint from {checkpoint}")
+    env = OverflowGymEnv()
+    cm  = CurriculumManager()
+    if start_stage > 1:
+        cm.force_stage(start_stage)
+    trainer = PPOTrainer(policy=policy, env=env, curriculum=cm, device=device, n_steps=512)
+    trainer.train(total_steps=total_steps)
+if __name__ == "__main__":
+    import argparse
+    p = argparse.ArgumentParser()
+    p.add_argument("--policy",     default="attention", choices=["attention", "mlp"])
+    p.add_argument("--steps",      default=2_000_000,  type=int)
+    p.add_argument("--stage",      default=1,           type=int)
+    p.add_argument("--checkpoint", default=None)
+    p.add_argument("--device",     default="auto")
+    args = p.parse_args()
+    run_training(args.policy, args.steps, args.stage, args.checkpoint, args.device)

training/reward.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""
+Reward shaping for OverflowEnvironment — ported from openenv/training/reward.py.
+Same core principle: BASE + THREAT_RESPONSE with clear gradient direction.
+Adapted to OverflowEnvironment's signals (no EventTicket objects — uses
+collision/near-miss flags and raw reward from the environment).
+  BASE:           survival + speed + lane  ~+0.4/step
+  COLLISION:      -50  (terminal)
+  NEAR MISS:      -0.8 per event
+  GOAL REACHED:   +5.0 (terminal bonus)
+  SMOOTH DRIVING: small bonus when no threats
+"""
+from __future__ import annotations
+import numpy as np
+# ── Same weights as openenv/training/reward.py ────────────────────────────────
+W_ALIVE         =  0.40
+W_SPEED         =  0.10
+W_LANE          =  0.15
+W_SMOOTH        =  0.03
+TARGET_SPEED    = 11.0    # m/s (~40 km/h)
+TARGET_SPEED_TOL =  3.0
+W_COLLISION     = -50.0
+W_NEAR_MISS     =  -0.8
+W_GOAL          =   5.0
+W_SURVIVE_BONUS =   5.0
+ROAD_HALF_WIDTH = 3.7 * 1.5   # ~2.5 lanes worth of tolerance
+def compute_reward(
+    ego_speed:    float,
+    ego_y:        float,
+    action:       np.ndarray,
+    prev_action:  np.ndarray,
+    collision:    bool,
+    goal_reached: bool,
+    near_miss:    bool,
+    raw_reward:   float,       # OverflowEnvironment's built-in reward (used as baseline)
+) -> float:
+    """
+    Shaped reward. Mirrors openenv reward structure:
+      - collision → large terminal penalty
+      - base survival + speed + lane keeping
+      - near-miss penalty
+      - goal bonus
+      - smooth driving bonus when clear
+    """
+    if collision:
+        return W_COLLISION
+    reward = 0.0
+    # 1. Survival
+    reward += W_ALIVE
+    # 2. Speed maintenance (same formula as openenv)
+    speed_err = abs(ego_speed - TARGET_SPEED)
+    if speed_err < TARGET_SPEED_TOL:
+        reward += W_SPEED * (1.0 - speed_err / TARGET_SPEED_TOL)
+    else:
+        reward -= 0.03 * min(speed_err - TARGET_SPEED_TOL, 5.0)
+    # 3. Lane keeping
+    norm_y = abs(ego_y) / ROAD_HALF_WIDTH
+    reward += W_LANE * max(0.0, 1.0 - norm_y ** 2)
+    # 4. Near miss penalty
+    if near_miss:
+        reward += W_NEAR_MISS
+    # 5. Goal bonus
+    if goal_reached:
+        reward += W_GOAL
+    # 6. Smooth driving
+    action_delta = np.abs(action - prev_action).sum()
+    reward += W_SMOOTH * max(0.0, 1.0 - action_delta * 3.0)
+    return float(reward)
+def compute_episode_bonus(total_steps: int, survived: bool) -> float:
+    """End-of-episode bonus — same as openenv."""
+    if not survived:
+        return 0.0
+    bonus  = W_SURVIVE_BONUS
+    bonus += min(total_steps, 500) * 0.02   # longevity reward
+    return float(bonus)