Spaces:
Sleeping
Sleeping
| # Environment | |
| OpenEnv-compatible RL environment wrapping the car racing game. Provides a typed | |
| observation (egocentric headlight image + scalar features), typed action, curriculum | |
| builder, and an Impala CNN encoder ready to plug into a PPO actor-critic. | |
| ## Quick Start | |
| ```python | |
| from env import CurriculumBuilder, DriveAction | |
| builder = CurriculumBuilder() | |
| # Training loop | |
| env = builder.next_env() | |
| obs = env.reset() | |
| total_reward = 0.0 | |
| while not obs.done: | |
| # obs.image : (64, 64, 3) uint8 numpy array | |
| # obs.speed : float 0..1 | |
| # obs.on_track: float 1.0 / 0.0 | |
| action = DriveAction(accel=1.0, steer=0.0) | |
| obs = env.step(action) | |
| total_reward += obs.reward | |
| advanced = builder.record(total_reward) # auto-advances curriculum when ready | |
| print(builder.status) | |
| ``` | |
| ## File Structure | |
| ``` | |
| env/ | |
| models.py DriveAction and RaceObservation (Pydantic, OpenEnv-compatible) | |
| environment.py RaceEnvironment β server-side wrapper around game.rl_splits.CarEnv | |
| client.py RaceEnvClient β OpenEnv WebSocket client | |
| encoder.py ImpalaCNN + RaceEncoder (PyTorch) for PPO actor-critic | |
| curriculum.py CurriculumBuilder β wraps rl_splits TRAIN/VAL/TEST splits | |
| ``` | |
| ## Observation Space | |
| `RaceObservation` has two parts that feed different network branches: | |
| ### Image β `obs.image` β CNN encoder | |
| - Shape: `(64, 64, 3)` uint8 | |
| - **Egocentric**: car always faces up, track geometry is heading-invariant | |
| - Rendering pipeline per step: | |
| 1. Blit track surface to offscreen canvas | |
| 2. Draw headlight cone (60Β° spread, 60 px ahead) | |
| 3. Crop 120Γ120 px square centred on car (grass-padded at borders) | |
| 4. Rotate so car heading maps to UP | |
| 5. Re-crop centre after rotation padding | |
| 6. Scale to 64Γ64 | |
| ### Scalars β `obs.speed / on_track / sin_angle / cos_angle` β MLP encoder | |
| | Field | Range | Purpose | | |
| |-------|-------|---------| | |
| | `speed` | 0..1 | Speed / max_speed. Controls braking decisions. | | |
| | `on_track` | 0 or 1 | Reactive penalty signal. | | |
| | `sin_angle` | β1..1 | Absolute heading orientation. | | |
| | `cos_angle` | β1..1 | Absolute heading orientation. | | |
| **Dropped from original CarEnv obs** (would hurt generalisation to unseen tracks): | |
| | Dropped field | Why | | |
| |--------------|-----| | |
| | `x`, `y` | Absolute screen position β track-specific, causes overfitting | | |
| | `gate_side` | Distance to start/finish gate β meaningless on unseen layouts | | |
| ## Action Space | |
| `DriveAction(accel, steer)` β continuous, both clamped to `[β1, 1]` inside `CarEnv.step`. | |
| | Field | Range | Effect | | |
| |-------|-------|--------| | |
| | `accel` | +1 | Full throttle | | |
| | `accel` | β1 | Brake | | |
| | `steer` | +1 | Steer right | | |
| | `steer` | β1 | Steer left | | |
| ## Reward Function | |
| Defined in `game/rl_splits.py:CarEnv.step`. Rewards are **not** scaled by | |
| complexity β all values are fixed, keeping episode returns comparable across | |
| tracks in the same rollout buffer. Complexity only scales the curriculum threshold. | |
| | Term | Trigger | Value | Goal | | |
| |------|---------|-------|------| | |
| | Forward pulse | Every step | `+speed/max_speed Γ 0.01` | Prevent stalling | | |
| | Off-track | Every step off road | `β0.5` | Stay on road | | |
| | Crash event | onβoff transition | `β5.0` | Penalise each boundary hit | | |
| | Lap completion | Gate crossed cleanly | `+50 Γ time_ratio Γ dist_ratio` | Fast + efficient path | | |
| | Out of bounds | Terminal | `β100` | Don't leave screen | | |
| **Lap completion bonus:** | |
| ``` | |
| time_ratio = clamp(par_time_steps / actual_lap_steps, 0.5, 2.0) | |
| dist_ratio = clamp(optimal_dist / actual_lap_dist, 0.5, 1.0) | |
| ``` | |
| `dist_ratio` is capped at **1.0** β no bonus for going shorter than the track | |
| centreline (that implies off-track corner cutting). `lap_dist` only accumulates | |
| while `on_track=True`, closing the exploit where brief grass-cutting reduced | |
| path length and inflated `dist_ratio`. | |
| | Performance | time_ratio | dist_ratio | Lap bonus | | |
| |------------|-----------|-----------|-----------| | |
| | Faster than par, tight line | 2.0 | 1.0 | +100 | | |
| | On-par, centreline path | 1.0 | 1.0 | +50 | | |
| | Slow, meandering | 0.5 | 0.5 | +12.5 | | |
| **Complexity scales the curriculum threshold, not the reward:** | |
| ``` | |
| effective_threshold = base_threshold Γ track.complexity | |
| ``` | |
| Track 16 (C=3.45) requires a window mean of `30 Γ 3.45 = 104` to advance β | |
| meaning consistently good laps β while Track 1 (C=1.0) only needs 30. | |
| Because rewards themselves are unscaled, value-function targets stay in the | |
| same range regardless of which track the agent is currently on. | |
| ## Encoder | |
| `RaceEncoder` fuses both observation branches into a single feature vector for PPO: | |
| ``` | |
| image (64Γ64Γ3) | |
| βββΊ ImpalaCNN β 256-d | |
| βββΊ cat β 288-d β Actor / Critic heads | |
| scalars (4,) β | |
| βββΊ MLP 4β32β32 β 32-d | |
| ``` | |
| ```python | |
| import torch | |
| from env import RaceEncoder | |
| encoder = RaceEncoder() # out_features = 288 | |
| img = torch.zeros(4, 3, 64, 64) # batch of 4, normalised 0..1 | |
| scalars = torch.zeros(4, 4) | |
| features = encoder(img, scalars) # (4, 288) | |
| ``` | |
| ### ImpalaCNN vs Nature CNN | |
| | | Nature CNN (DQN) | ImpalaCNN (IMPALA) | | |
| |---|---|---| | |
| | Architecture | 3 plain conv layers | 3 blocks Γ (Conv + MaxPool + 2 ResBlocks) | | |
| | Skip connections | None | Yes β `x = x + residual(x)` in each block | | |
| | Gradient flow | Vanishes in early layers | Direct path back through shortcuts | | |
| | Sample efficiency | Baseline | ~3β5Γ better on visual RL tasks | | |
| | Inference cost | Fast | Same (equivalent depth) | | |
| ## Curriculum Builder | |
| Based on the 16-track split in `game/rl_splits.py`: | |
| | Split | Tracks | Purpose | | |
| |-------|--------|---------| | |
| | TRAIN | 1,2, 5,6, 9,10, 13,14 | 2 per tier, curriculum ordered easyβhard | | |
| | VAL | 3, 7, 11, 15 | 1 per tier β performance gating, never trained on | | |
| | TEST | 4, 8, 12, 16 | 1 per tier β held-out, final evaluation only | | |
| ```python | |
| from env import CurriculumBuilder | |
| builder = CurriculumBuilder( | |
| threshold=30.0, # mean reward needed to advance (same value works all tracks due to complexity scaling) | |
| window=50, # rolling window size β advance only after 50 consecutive episodes exceed threshold | |
| # too small (e.g. 5) β advances on lucky streaks, policy not stable yet | |
| # too large (e.g. 500) β stays on mastered track too long, slows curriculum | |
| replay_frac=0.3, # 30% of episodes replay mastered tracks (prevents forgetting) | |
| use_image=True, # set False to skip image rendering (fast unit tests / ablations) | |
| ) | |
| env = builder.next_env() # samples frontier (or replay) track | |
| builder.record(episode_reward) # auto-advances when threshold met | |
| for env in builder.val_envs(): # evaluate on held-out VAL tracks | |
| ... | |
| print(builder.status) # "Frontier: track 2 'Standard Oval' [2/8] ..." | |
| print(builder.is_complete) # True when all TRAIN tracks mastered | |
| ``` | |
| ## OpenEnv Client (Remote Server) | |
| To run the environment as a server and connect from a remote training process: | |
| ```python | |
| # server β start with: openenv serve env.environment:RaceEnvironment | |
| # client | |
| from env import RaceEnvClient, DriveAction | |
| async with RaceEnvClient(base_url="http://localhost:8000") as client: | |
| result = await client.reset() | |
| result = await client.step(DriveAction(accel=1.0, steer=0.0)) | |
| # or synchronously | |
| with RaceEnvClient(base_url="http://localhost:8000").sync() as client: | |
| result = client.reset() | |
| result = client.step(DriveAction(accel=1.0, steer=0.0)) | |
| ``` | |
| ## Headless Mode (parallel training) | |
| Set these env vars before importing pygame to run without a display: | |
| ```python | |
| import os | |
| os.environ["SDL_VIDEODRIVER"] = "dummy" | |
| os.environ["SDL_AUDIODRIVER"] = "dummy" | |
| ``` | |
| `RaceEnvironment` renders entirely to offscreen `pygame.Surface` objects, so no | |
| display is needed at any point. | |