Car-Racing-Agent / env /README.md
nirmalpratheep's picture
Upload 11 files
41a9651 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Environment

OpenEnv-compatible RL environment wrapping the car racing game. Provides a typed observation (egocentric headlight image + scalar features), typed action, curriculum builder, and an Impala CNN encoder ready to plug into a PPO actor-critic.

Quick Start

from env import CurriculumBuilder, DriveAction

builder = CurriculumBuilder()

# Training loop
env = builder.next_env()
obs = env.reset()

total_reward = 0.0
while not obs.done:
    # obs.image   : (64, 64, 3) uint8 numpy array
    # obs.speed   : float  0..1
    # obs.on_track: float  1.0 / 0.0
    action = DriveAction(accel=1.0, steer=0.0)
    obs = env.step(action)
    total_reward += obs.reward

advanced = builder.record(total_reward)   # auto-advances curriculum when ready
print(builder.status)

File Structure

env/
  models.py       DriveAction and RaceObservation (Pydantic, OpenEnv-compatible)
  environment.py  RaceEnvironment β€” server-side wrapper around game.rl_splits.CarEnv
  client.py       RaceEnvClient  β€” OpenEnv WebSocket client
  encoder.py      ImpalaCNN + RaceEncoder (PyTorch) for PPO actor-critic
  curriculum.py   CurriculumBuilder β€” wraps rl_splits TRAIN/VAL/TEST splits

Observation Space

RaceObservation has two parts that feed different network branches:

Image β€” obs.image β†’ CNN encoder

  • Shape: (64, 64, 3) uint8
  • Egocentric: car always faces up, track geometry is heading-invariant
  • Rendering pipeline per step:
    1. Blit track surface to offscreen canvas
    2. Draw headlight cone (60Β° spread, 60 px ahead)
    3. Crop 120Γ—120 px square centred on car (grass-padded at borders)
    4. Rotate so car heading maps to UP
    5. Re-crop centre after rotation padding
    6. Scale to 64Γ—64

Scalars β€” obs.speed / on_track / sin_angle / cos_angle β†’ MLP encoder

Field Range Purpose
speed 0..1 Speed / max_speed. Controls braking decisions.
on_track 0 or 1 Reactive penalty signal.
sin_angle βˆ’1..1 Absolute heading orientation.
cos_angle βˆ’1..1 Absolute heading orientation.

Dropped from original CarEnv obs (would hurt generalisation to unseen tracks):

Dropped field Why
x, y Absolute screen position β€” track-specific, causes overfitting
gate_side Distance to start/finish gate β€” meaningless on unseen layouts

Action Space

DriveAction(accel, steer) β€” continuous, both clamped to [βˆ’1, 1] inside CarEnv.step.

Field Range Effect
accel +1 Full throttle
accel βˆ’1 Brake
steer +1 Steer right
steer βˆ’1 Steer left

Reward Function

Defined in game/rl_splits.py:CarEnv.step. Rewards are not scaled by complexity β€” all values are fixed, keeping episode returns comparable across tracks in the same rollout buffer. Complexity only scales the curriculum threshold.

Term Trigger Value Goal
Forward pulse Every step +speed/max_speed Γ— 0.01 Prevent stalling
Off-track Every step off road βˆ’0.5 Stay on road
Crash event onβ†’off transition βˆ’5.0 Penalise each boundary hit
Lap completion Gate crossed cleanly +50 Γ— time_ratio Γ— dist_ratio Fast + efficient path
Out of bounds Terminal βˆ’100 Don't leave screen

Lap completion bonus:

time_ratio = clamp(par_time_steps / actual_lap_steps,  0.5, 2.0)
dist_ratio = clamp(optimal_dist   / actual_lap_dist,   0.5, 1.0)

dist_ratio is capped at 1.0 β€” no bonus for going shorter than the track centreline (that implies off-track corner cutting). lap_dist only accumulates while on_track=True, closing the exploit where brief grass-cutting reduced path length and inflated dist_ratio.

Performance time_ratio dist_ratio Lap bonus
Faster than par, tight line 2.0 1.0 +100
On-par, centreline path 1.0 1.0 +50
Slow, meandering 0.5 0.5 +12.5

Complexity scales the curriculum threshold, not the reward:

effective_threshold = base_threshold Γ— track.complexity

Track 16 (C=3.45) requires a window mean of 30 Γ— 3.45 = 104 to advance β€” meaning consistently good laps β€” while Track 1 (C=1.0) only needs 30. Because rewards themselves are unscaled, value-function targets stay in the same range regardless of which track the agent is currently on.

Encoder

RaceEncoder fuses both observation branches into a single feature vector for PPO:

image (64Γ—64Γ—3)
  └─► ImpalaCNN  β†’  256-d
                         β”œβ”€β–Ί cat  β†’  288-d  β†’  Actor / Critic heads
scalars (4,)              β”‚
  └─► MLP 4β†’32β†’32  β†’  32-d
import torch
from env import RaceEncoder

encoder = RaceEncoder()           # out_features = 288
img     = torch.zeros(4, 3, 64, 64)   # batch of 4, normalised 0..1
scalars = torch.zeros(4, 4)
features = encoder(img, scalars)  # (4, 288)

ImpalaCNN vs Nature CNN

Nature CNN (DQN) ImpalaCNN (IMPALA)
Architecture 3 plain conv layers 3 blocks Γ— (Conv + MaxPool + 2 ResBlocks)
Skip connections None Yes β€” x = x + residual(x) in each block
Gradient flow Vanishes in early layers Direct path back through shortcuts
Sample efficiency Baseline ~3–5Γ— better on visual RL tasks
Inference cost Fast Same (equivalent depth)

Curriculum Builder

Based on the 16-track split in game/rl_splits.py:

Split Tracks Purpose
TRAIN 1,2, 5,6, 9,10, 13,14 2 per tier, curriculum ordered easy→hard
VAL 3, 7, 11, 15 1 per tier β€” performance gating, never trained on
TEST 4, 8, 12, 16 1 per tier β€” held-out, final evaluation only
from env import CurriculumBuilder

builder = CurriculumBuilder(
    threshold=30.0,  # mean reward needed to advance (same value works all tracks due to complexity scaling)
    window=50,       # rolling window size β€” advance only after 50 consecutive episodes exceed threshold
                     # too small (e.g. 5)  β†’ advances on lucky streaks, policy not stable yet
                     # too large (e.g. 500) β†’ stays on mastered track too long, slows curriculum
    replay_frac=0.3, # 30% of episodes replay mastered tracks (prevents forgetting)
    use_image=True,  # set False to skip image rendering (fast unit tests / ablations)
)

env = builder.next_env()          # samples frontier (or replay) track
builder.record(episode_reward)    # auto-advances when threshold met

for env in builder.val_envs():    # evaluate on held-out VAL tracks
    ...

print(builder.status)             # "Frontier: track 2 'Standard Oval' [2/8] ..."
print(builder.is_complete)        # True when all TRAIN tracks mastered

OpenEnv Client (Remote Server)

To run the environment as a server and connect from a remote training process:

# server β€” start with: openenv serve env.environment:RaceEnvironment
# client
from env import RaceEnvClient, DriveAction

async with RaceEnvClient(base_url="http://localhost:8000") as client:
    result = await client.reset()
    result = await client.step(DriveAction(accel=1.0, steer=0.0))

# or synchronously
with RaceEnvClient(base_url="http://localhost:8000").sync() as client:
    result = client.reset()
    result = client.step(DriveAction(accel=1.0, steer=0.0))

Headless Mode (parallel training)

Set these env vars before importing pygame to run without a display:

import os
os.environ["SDL_VIDEODRIVER"] = "dummy"
os.environ["SDL_AUDIODRIVER"] = "dummy"

RaceEnvironment renders entirely to offscreen pygame.Surface objects, so no display is needed at any point.