File size: 7,854 Bytes
41a9651
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Environment

OpenEnv-compatible RL environment wrapping the car racing game. Provides a typed
observation (egocentric headlight image + scalar features), typed action, curriculum
builder, and an Impala CNN encoder ready to plug into a PPO actor-critic.

## Quick Start

```python
from env import CurriculumBuilder, DriveAction

builder = CurriculumBuilder()

# Training loop
env = builder.next_env()
obs = env.reset()

total_reward = 0.0
while not obs.done:
    # obs.image   : (64, 64, 3) uint8 numpy array
    # obs.speed   : float  0..1
    # obs.on_track: float  1.0 / 0.0
    action = DriveAction(accel=1.0, steer=0.0)
    obs = env.step(action)
    total_reward += obs.reward

advanced = builder.record(total_reward)   # auto-advances curriculum when ready
print(builder.status)
```

## File Structure

```
env/
  models.py       DriveAction and RaceObservation (Pydantic, OpenEnv-compatible)
  environment.py  RaceEnvironment β€” server-side wrapper around game.rl_splits.CarEnv
  client.py       RaceEnvClient  β€” OpenEnv WebSocket client
  encoder.py      ImpalaCNN + RaceEncoder (PyTorch) for PPO actor-critic
  curriculum.py   CurriculumBuilder β€” wraps rl_splits TRAIN/VAL/TEST splits
```

## Observation Space

`RaceObservation` has two parts that feed different network branches:

### Image β€” `obs.image` β†’ CNN encoder
- Shape: `(64, 64, 3)` uint8
- **Egocentric**: car always faces up, track geometry is heading-invariant
- Rendering pipeline per step:
  1. Blit track surface to offscreen canvas
  2. Draw headlight cone (60Β° spread, 60 px ahead)
  3. Crop 120Γ—120 px square centred on car (grass-padded at borders)
  4. Rotate so car heading maps to UP
  5. Re-crop centre after rotation padding
  6. Scale to 64Γ—64

### Scalars β€” `obs.speed / on_track / sin_angle / cos_angle` β†’ MLP encoder

| Field | Range | Purpose |
|-------|-------|---------|
| `speed` | 0..1 | Speed / max_speed. Controls braking decisions. |
| `on_track` | 0 or 1 | Reactive penalty signal. |
| `sin_angle` | βˆ’1..1 | Absolute heading orientation. |
| `cos_angle` | βˆ’1..1 | Absolute heading orientation. |

**Dropped from original CarEnv obs** (would hurt generalisation to unseen tracks):

| Dropped field | Why |
|--------------|-----|
| `x`, `y` | Absolute screen position β€” track-specific, causes overfitting |
| `gate_side` | Distance to start/finish gate β€” meaningless on unseen layouts |

## Action Space

`DriveAction(accel, steer)` β€” continuous, both clamped to `[βˆ’1, 1]` inside `CarEnv.step`.

| Field | Range | Effect |
|-------|-------|--------|
| `accel` | +1 | Full throttle |
| `accel` | βˆ’1 | Brake |
| `steer` | +1 | Steer right |
| `steer` | βˆ’1 | Steer left |

## Reward Function

Defined in `game/rl_splits.py:CarEnv.step`. Rewards are **not** scaled by
complexity β€” all values are fixed, keeping episode returns comparable across
tracks in the same rollout buffer. Complexity only scales the curriculum threshold.

| Term | Trigger | Value | Goal |
|------|---------|-------|------|
| Forward pulse | Every step | `+speed/max_speed Γ— 0.01` | Prevent stalling |
| Off-track | Every step off road | `βˆ’0.5` | Stay on road |
| Crash event | onβ†’off transition | `βˆ’5.0` | Penalise each boundary hit |
| Lap completion | Gate crossed cleanly | `+50 Γ— time_ratio Γ— dist_ratio` | Fast + efficient path |
| Out of bounds | Terminal | `βˆ’100` | Don't leave screen |

**Lap completion bonus:**

```
time_ratio = clamp(par_time_steps / actual_lap_steps,  0.5, 2.0)
dist_ratio = clamp(optimal_dist   / actual_lap_dist,   0.5, 1.0)
```

`dist_ratio` is capped at **1.0** β€” no bonus for going shorter than the track
centreline (that implies off-track corner cutting). `lap_dist` only accumulates
while `on_track=True`, closing the exploit where brief grass-cutting reduced
path length and inflated `dist_ratio`.

| Performance | time_ratio | dist_ratio | Lap bonus |
|------------|-----------|-----------|-----------|
| Faster than par, tight line | 2.0 | 1.0 | +100 |
| On-par, centreline path | 1.0 | 1.0 | +50 |
| Slow, meandering | 0.5 | 0.5 | +12.5 |

**Complexity scales the curriculum threshold, not the reward:**

```
effective_threshold = base_threshold Γ— track.complexity
```

Track 16 (C=3.45) requires a window mean of `30 Γ— 3.45 = 104` to advance β€”
meaning consistently good laps β€” while Track 1 (C=1.0) only needs 30.
Because rewards themselves are unscaled, value-function targets stay in the
same range regardless of which track the agent is currently on.

## Encoder

`RaceEncoder` fuses both observation branches into a single feature vector for PPO:

```
image (64Γ—64Γ—3)
  └─► ImpalaCNN  β†’  256-d
                         β”œβ”€β–Ί cat  β†’  288-d  β†’  Actor / Critic heads
scalars (4,)              β”‚
  └─► MLP 4β†’32β†’32  β†’  32-d
```

```python
import torch
from env import RaceEncoder

encoder = RaceEncoder()           # out_features = 288
img     = torch.zeros(4, 3, 64, 64)   # batch of 4, normalised 0..1
scalars = torch.zeros(4, 4)
features = encoder(img, scalars)  # (4, 288)
```

### ImpalaCNN vs Nature CNN

| | Nature CNN (DQN) | ImpalaCNN (IMPALA) |
|---|---|---|
| Architecture | 3 plain conv layers | 3 blocks Γ— (Conv + MaxPool + 2 ResBlocks) |
| Skip connections | None | Yes β€” `x = x + residual(x)` in each block |
| Gradient flow | Vanishes in early layers | Direct path back through shortcuts |
| Sample efficiency | Baseline | ~3–5Γ— better on visual RL tasks |
| Inference cost | Fast | Same (equivalent depth) |

## Curriculum Builder

Based on the 16-track split in `game/rl_splits.py`:

| Split | Tracks | Purpose |
|-------|--------|---------|
| TRAIN | 1,2, 5,6, 9,10, 13,14 | 2 per tier, curriculum ordered easy→hard |
| VAL | 3, 7, 11, 15 | 1 per tier β€” performance gating, never trained on |
| TEST | 4, 8, 12, 16 | 1 per tier β€” held-out, final evaluation only |

```python
from env import CurriculumBuilder

builder = CurriculumBuilder(
    threshold=30.0,  # mean reward needed to advance (same value works all tracks due to complexity scaling)
    window=50,       # rolling window size β€” advance only after 50 consecutive episodes exceed threshold
                     # too small (e.g. 5)  β†’ advances on lucky streaks, policy not stable yet
                     # too large (e.g. 500) β†’ stays on mastered track too long, slows curriculum
    replay_frac=0.3, # 30% of episodes replay mastered tracks (prevents forgetting)
    use_image=True,  # set False to skip image rendering (fast unit tests / ablations)
)

env = builder.next_env()          # samples frontier (or replay) track
builder.record(episode_reward)    # auto-advances when threshold met

for env in builder.val_envs():    # evaluate on held-out VAL tracks
    ...

print(builder.status)             # "Frontier: track 2 'Standard Oval' [2/8] ..."
print(builder.is_complete)        # True when all TRAIN tracks mastered
```

## OpenEnv Client (Remote Server)

To run the environment as a server and connect from a remote training process:

```python
# server β€” start with: openenv serve env.environment:RaceEnvironment
# client
from env import RaceEnvClient, DriveAction

async with RaceEnvClient(base_url="http://localhost:8000") as client:
    result = await client.reset()
    result = await client.step(DriveAction(accel=1.0, steer=0.0))

# or synchronously
with RaceEnvClient(base_url="http://localhost:8000").sync() as client:
    result = client.reset()
    result = client.step(DriveAction(accel=1.0, steer=0.0))
```

## Headless Mode (parallel training)

Set these env vars before importing pygame to run without a display:

```python
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"
os.environ["SDL_AUDIODRIVER"] = "dummy"
```

`RaceEnvironment` renders entirely to offscreen `pygame.Surface` objects, so no
display is needed at any point.