# Environment

OpenEnv-compatible RL environment wrapping the car racing game. Provides a typed
observation (egocentric headlight image + scalar features), typed action, curriculum
builder, and an Impala CNN encoder ready to plug into a PPO actor-critic.

## Quick Start

```python
from env import CurriculumBuilder, DriveAction

builder = CurriculumBuilder()

# Training loop
env = builder.next_env()
obs = env.reset()

total_reward = 0.0
while not obs.done:
    # obs.image   : (64, 64, 3) uint8 numpy array
    # obs.speed   : float  0..1
    # obs.on_track: float  1.0 / 0.0
    action = DriveAction(accel=1.0, steer=0.0)
    obs = env.step(action)
    total_reward += obs.reward

advanced = builder.record(total_reward)   # auto-advances curriculum when ready
print(builder.status)
```

## File Structure

```
env/
  models.py       DriveAction and RaceObservation (Pydantic, OpenEnv-compatible)
  environment.py  RaceEnvironment — server-side wrapper around game.rl_splits.CarEnv
  client.py       RaceEnvClient  — OpenEnv WebSocket client
  encoder.py      ImpalaCNN + RaceEncoder (PyTorch) for PPO actor-critic
  curriculum.py   CurriculumBuilder — wraps rl_splits TRAIN/VAL/TEST splits
```

## Observation Space

`RaceObservation` has two parts that feed different network branches:

### Image — `obs.image` → CNN encoder
- Shape: `(64, 64, 3)` uint8
- **Egocentric**: car always faces up, track geometry is heading-invariant
- Rendering pipeline per step:
  1. Blit track surface to offscreen canvas
  2. Draw headlight cone (60° spread, 60 px ahead)
  3. Crop 120×120 px square centred on car (grass-padded at borders)
  4. Rotate so car heading maps to UP
  5. Re-crop centre after rotation padding
  6. Scale to 64×64

### Scalars — `obs.speed / on_track / sin_angle / cos_angle` → MLP encoder

| Field | Range | Purpose |
|-------|-------|---------|
| `speed` | 0..1 | Speed / max_speed. Controls braking decisions. |
| `on_track` | 0 or 1 | Reactive penalty signal. |
| `sin_angle` | −1..1 | Absolute heading orientation. |
| `cos_angle` | −1..1 | Absolute heading orientation. |

**Dropped from original CarEnv obs** (would hurt generalisation to unseen tracks):

| Dropped field | Why |
|--------------|-----|
| `x`, `y` | Absolute screen position — track-specific, causes overfitting |
| `gate_side` | Distance to start/finish gate — meaningless on unseen layouts |

## Action Space

`DriveAction(accel, steer)` — continuous, both clamped to `[−1, 1]` inside `CarEnv.step`.

| Field | Range | Effect |
|-------|-------|--------|
| `accel` | +1 | Full throttle |
| `accel` | −1 | Brake |
| `steer` | +1 | Steer right |
| `steer` | −1 | Steer left |

## Reward Function

Defined in `game/rl_splits.py:CarEnv.step`. Rewards are **not** scaled by
complexity — all values are fixed, keeping episode returns comparable across
tracks in the same rollout buffer. Complexity only scales the curriculum threshold.

| Term | Trigger | Value | Goal |
|------|---------|-------|------|
| Forward pulse | Every step | `+speed/max_speed × 0.01` | Prevent stalling |
| Off-track | Every step off road | `−0.5` | Stay on road |
| Crash event | on→off transition | `−5.0` | Penalise each boundary hit |
| Lap completion | Gate crossed cleanly | `+50 × time_ratio × dist_ratio` | Fast + efficient path |
| Out of bounds | Terminal | `−100` | Don't leave screen |

**Lap completion bonus:**

```
time_ratio = clamp(par_time_steps / actual_lap_steps,  0.5, 2.0)
dist_ratio = clamp(optimal_dist   / actual_lap_dist,   0.5, 1.0)
```

`dist_ratio` is capped at **1.0** — no bonus for going shorter than the track
centreline (that implies off-track corner cutting). `lap_dist` only accumulates
while `on_track=True`, closing the exploit where brief grass-cutting reduced
path length and inflated `dist_ratio`.

| Performance | time_ratio | dist_ratio | Lap bonus |
|------------|-----------|-----------|-----------|
| Faster than par, tight line | 2.0 | 1.0 | +100 |
| On-par, centreline path | 1.0 | 1.0 | +50 |
| Slow, meandering | 0.5 | 0.5 | +12.5 |

**Complexity scales the curriculum threshold, not the reward:**

```
effective_threshold = base_threshold × track.complexity
```

Track 16 (C=3.45) requires a window mean of `30 × 3.45 = 104` to advance —
meaning consistently good laps — while Track 1 (C=1.0) only needs 30.
Because rewards themselves are unscaled, value-function targets stay in the
same range regardless of which track the agent is currently on.

## Encoder

`RaceEncoder` fuses both observation branches into a single feature vector for PPO:

```
image (64×64×3)
  └─► ImpalaCNN  →  256-d
                         ├─► cat  →  288-d  →  Actor / Critic heads
scalars (4,)              │
  └─► MLP 4→32→32  →  32-d
```

```python
import torch
from env import RaceEncoder

encoder = RaceEncoder()           # out_features = 288
img     = torch.zeros(4, 3, 64, 64)   # batch of 4, normalised 0..1
scalars = torch.zeros(4, 4)
features = encoder(img, scalars)  # (4, 288)
```

### ImpalaCNN vs Nature CNN

| | Nature CNN (DQN) | ImpalaCNN (IMPALA) |
|---|---|---|
| Architecture | 3 plain conv layers | 3 blocks × (Conv + MaxPool + 2 ResBlocks) |
| Skip connections | None | Yes — `x = x + residual(x)` in each block |
| Gradient flow | Vanishes in early layers | Direct path back through shortcuts |
| Sample efficiency | Baseline | ~3–5× better on visual RL tasks |
| Inference cost | Fast | Same (equivalent depth) |

## Curriculum Builder

Based on the 16-track split in `game/rl_splits.py`:

| Split | Tracks | Purpose |
|-------|--------|---------|
| TRAIN | 1,2, 5,6, 9,10, 13,14 | 2 per tier, curriculum ordered easy→hard |
| VAL | 3, 7, 11, 15 | 1 per tier — performance gating, never trained on |
| TEST | 4, 8, 12, 16 | 1 per tier — held-out, final evaluation only |

```python
from env import CurriculumBuilder

builder = CurriculumBuilder(
    threshold=30.0,  # mean reward needed to advance (same value works all tracks due to complexity scaling)
    window=50,       # rolling window size — advance only after 50 consecutive episodes exceed threshold
                     # too small (e.g. 5)  → advances on lucky streaks, policy not stable yet
                     # too large (e.g. 500) → stays on mastered track too long, slows curriculum
    replay_frac=0.3, # 30% of episodes replay mastered tracks (prevents forgetting)
    use_image=True,  # set False to skip image rendering (fast unit tests / ablations)
)

env = builder.next_env()          # samples frontier (or replay) track
builder.record(episode_reward)    # auto-advances when threshold met

for env in builder.val_envs():    # evaluate on held-out VAL tracks
    ...

print(builder.status)             # "Frontier: track 2 'Standard Oval' [2/8] ..."
print(builder.is_complete)        # True when all TRAIN tracks mastered
```

## OpenEnv Client (Remote Server)

To run the environment as a server and connect from a remote training process:

```python
# server — start with: openenv serve env.environment:RaceEnvironment
# client
from env import RaceEnvClient, DriveAction

async with RaceEnvClient(base_url="http://localhost:8000") as client:
    result = await client.reset()
    result = await client.step(DriveAction(accel=1.0, steer=0.0))

# or synchronously
with RaceEnvClient(base_url="http://localhost:8000").sync() as client:
    result = client.reset()
    result = client.step(DriveAction(accel=1.0, steer=0.0))
```

## Headless Mode (parallel training)

Set these env vars before importing pygame to run without a display:

```python
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"
os.environ["SDL_AUDIODRIVER"] = "dummy"
```

`RaceEnvironment` renders entirely to offscreen `pygame.Surface` objects, so no
display is needed at any point.