Car-Racing-Agent / env /README.md
nirmalpratheep's picture
Upload 11 files
41a9651 verified
# Environment
OpenEnv-compatible RL environment wrapping the car racing game. Provides a typed
observation (egocentric headlight image + scalar features), typed action, curriculum
builder, and an Impala CNN encoder ready to plug into a PPO actor-critic.
## Quick Start
```python
from env import CurriculumBuilder, DriveAction
builder = CurriculumBuilder()
# Training loop
env = builder.next_env()
obs = env.reset()
total_reward = 0.0
while not obs.done:
# obs.image : (64, 64, 3) uint8 numpy array
# obs.speed : float 0..1
# obs.on_track: float 1.0 / 0.0
action = DriveAction(accel=1.0, steer=0.0)
obs = env.step(action)
total_reward += obs.reward
advanced = builder.record(total_reward) # auto-advances curriculum when ready
print(builder.status)
```
## File Structure
```
env/
models.py DriveAction and RaceObservation (Pydantic, OpenEnv-compatible)
environment.py RaceEnvironment β€” server-side wrapper around game.rl_splits.CarEnv
client.py RaceEnvClient β€” OpenEnv WebSocket client
encoder.py ImpalaCNN + RaceEncoder (PyTorch) for PPO actor-critic
curriculum.py CurriculumBuilder β€” wraps rl_splits TRAIN/VAL/TEST splits
```
## Observation Space
`RaceObservation` has two parts that feed different network branches:
### Image β€” `obs.image` β†’ CNN encoder
- Shape: `(64, 64, 3)` uint8
- **Egocentric**: car always faces up, track geometry is heading-invariant
- Rendering pipeline per step:
1. Blit track surface to offscreen canvas
2. Draw headlight cone (60Β° spread, 60 px ahead)
3. Crop 120Γ—120 px square centred on car (grass-padded at borders)
4. Rotate so car heading maps to UP
5. Re-crop centre after rotation padding
6. Scale to 64Γ—64
### Scalars β€” `obs.speed / on_track / sin_angle / cos_angle` β†’ MLP encoder
| Field | Range | Purpose |
|-------|-------|---------|
| `speed` | 0..1 | Speed / max_speed. Controls braking decisions. |
| `on_track` | 0 or 1 | Reactive penalty signal. |
| `sin_angle` | βˆ’1..1 | Absolute heading orientation. |
| `cos_angle` | βˆ’1..1 | Absolute heading orientation. |
**Dropped from original CarEnv obs** (would hurt generalisation to unseen tracks):
| Dropped field | Why |
|--------------|-----|
| `x`, `y` | Absolute screen position β€” track-specific, causes overfitting |
| `gate_side` | Distance to start/finish gate β€” meaningless on unseen layouts |
## Action Space
`DriveAction(accel, steer)` β€” continuous, both clamped to `[βˆ’1, 1]` inside `CarEnv.step`.
| Field | Range | Effect |
|-------|-------|--------|
| `accel` | +1 | Full throttle |
| `accel` | βˆ’1 | Brake |
| `steer` | +1 | Steer right |
| `steer` | βˆ’1 | Steer left |
## Reward Function
Defined in `game/rl_splits.py:CarEnv.step`. Rewards are **not** scaled by
complexity β€” all values are fixed, keeping episode returns comparable across
tracks in the same rollout buffer. Complexity only scales the curriculum threshold.
| Term | Trigger | Value | Goal |
|------|---------|-------|------|
| Forward pulse | Every step | `+speed/max_speed Γ— 0.01` | Prevent stalling |
| Off-track | Every step off road | `βˆ’0.5` | Stay on road |
| Crash event | onβ†’off transition | `βˆ’5.0` | Penalise each boundary hit |
| Lap completion | Gate crossed cleanly | `+50 Γ— time_ratio Γ— dist_ratio` | Fast + efficient path |
| Out of bounds | Terminal | `βˆ’100` | Don't leave screen |
**Lap completion bonus:**
```
time_ratio = clamp(par_time_steps / actual_lap_steps, 0.5, 2.0)
dist_ratio = clamp(optimal_dist / actual_lap_dist, 0.5, 1.0)
```
`dist_ratio` is capped at **1.0** β€” no bonus for going shorter than the track
centreline (that implies off-track corner cutting). `lap_dist` only accumulates
while `on_track=True`, closing the exploit where brief grass-cutting reduced
path length and inflated `dist_ratio`.
| Performance | time_ratio | dist_ratio | Lap bonus |
|------------|-----------|-----------|-----------|
| Faster than par, tight line | 2.0 | 1.0 | +100 |
| On-par, centreline path | 1.0 | 1.0 | +50 |
| Slow, meandering | 0.5 | 0.5 | +12.5 |
**Complexity scales the curriculum threshold, not the reward:**
```
effective_threshold = base_threshold Γ— track.complexity
```
Track 16 (C=3.45) requires a window mean of `30 Γ— 3.45 = 104` to advance β€”
meaning consistently good laps β€” while Track 1 (C=1.0) only needs 30.
Because rewards themselves are unscaled, value-function targets stay in the
same range regardless of which track the agent is currently on.
## Encoder
`RaceEncoder` fuses both observation branches into a single feature vector for PPO:
```
image (64Γ—64Γ—3)
└─► ImpalaCNN β†’ 256-d
β”œβ”€β–Ί cat β†’ 288-d β†’ Actor / Critic heads
scalars (4,) β”‚
└─► MLP 4β†’32β†’32 β†’ 32-d
```
```python
import torch
from env import RaceEncoder
encoder = RaceEncoder() # out_features = 288
img = torch.zeros(4, 3, 64, 64) # batch of 4, normalised 0..1
scalars = torch.zeros(4, 4)
features = encoder(img, scalars) # (4, 288)
```
### ImpalaCNN vs Nature CNN
| | Nature CNN (DQN) | ImpalaCNN (IMPALA) |
|---|---|---|
| Architecture | 3 plain conv layers | 3 blocks Γ— (Conv + MaxPool + 2 ResBlocks) |
| Skip connections | None | Yes β€” `x = x + residual(x)` in each block |
| Gradient flow | Vanishes in early layers | Direct path back through shortcuts |
| Sample efficiency | Baseline | ~3–5Γ— better on visual RL tasks |
| Inference cost | Fast | Same (equivalent depth) |
## Curriculum Builder
Based on the 16-track split in `game/rl_splits.py`:
| Split | Tracks | Purpose |
|-------|--------|---------|
| TRAIN | 1,2, 5,6, 9,10, 13,14 | 2 per tier, curriculum ordered easy→hard |
| VAL | 3, 7, 11, 15 | 1 per tier β€” performance gating, never trained on |
| TEST | 4, 8, 12, 16 | 1 per tier β€” held-out, final evaluation only |
```python
from env import CurriculumBuilder
builder = CurriculumBuilder(
threshold=30.0, # mean reward needed to advance (same value works all tracks due to complexity scaling)
window=50, # rolling window size β€” advance only after 50 consecutive episodes exceed threshold
# too small (e.g. 5) β†’ advances on lucky streaks, policy not stable yet
# too large (e.g. 500) β†’ stays on mastered track too long, slows curriculum
replay_frac=0.3, # 30% of episodes replay mastered tracks (prevents forgetting)
use_image=True, # set False to skip image rendering (fast unit tests / ablations)
)
env = builder.next_env() # samples frontier (or replay) track
builder.record(episode_reward) # auto-advances when threshold met
for env in builder.val_envs(): # evaluate on held-out VAL tracks
...
print(builder.status) # "Frontier: track 2 'Standard Oval' [2/8] ..."
print(builder.is_complete) # True when all TRAIN tracks mastered
```
## OpenEnv Client (Remote Server)
To run the environment as a server and connect from a remote training process:
```python
# server β€” start with: openenv serve env.environment:RaceEnvironment
# client
from env import RaceEnvClient, DriveAction
async with RaceEnvClient(base_url="http://localhost:8000") as client:
result = await client.reset()
result = await client.step(DriveAction(accel=1.0, steer=0.0))
# or synchronously
with RaceEnvClient(base_url="http://localhost:8000").sync() as client:
result = client.reset()
result = client.step(DriveAction(accel=1.0, steer=0.0))
```
## Headless Mode (parallel training)
Set these env vars before importing pygame to run without a display:
```python
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"
os.environ["SDL_AUDIODRIVER"] = "dummy"
```
`RaceEnvironment` renders entirely to offscreen `pygame.Surface` objects, so no
display is needed at any point.