title: Planetary Rover Navigation Simulator
emoji: πͺ
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: OpenEnv RL environment β Meta PyTorch Hackathon
Planetary Rover Navigation Simulator
π Official Hackathon Submission Links
- π Live OpenEnv Simulator: Interactive Dashboard (HF Space)
- π§ GRPO Training Run: View on Google Colab
- π» Source Code: GitHub Repository
- π Technical Write-up: Hugging Face Blog Post
π Project Overview
The Planetary Rover Navigation Simulator is a Dockerized OpenEnv microservice β a standards-compliant HTTP API that completely separates the physics World from the AI Brain. The physics engine (FastAPI + Pydantic + Euler integration) runs inside a Docker container and exposes six REST endpoints. Any agent β a hardcoded heuristic, a Llama 3.2 1B fine-tuned with GRPO, or your own PyTorch policy β connects over HTTP and never touches the simulation internals. This clean separation means you can swap the AI brain without restarting the world, and swap the world without retraining the agent.
The environment is a fully self-contained HTTP microservice exposing the standard OpenEnv API: /reset, /step, /state, /tasks, /baseline, and /grader.
βοΈ Engineering Highlights β Theme #5: Wild Card
1 Β· Solving the Stationary Exploit with Reward Shaping
Traditional sparse rewards (only rewarding upon waypoint arrival) provide no gradient signal for intermediate steps, while our original dense distance penalty (+max(0, (100 - dist) * 0.001)) inadvertently trained the rover to stand still (the Stationary Exploit). A stationary rover accumulates a small, consistent negative reward across all GRPO group samples β the group advantage is always near zero, the policy never updates, and the rover learns that doing nothing is the optimal strategy.
We fixed this with two cooperating shaping techniques from the deep RL literature:
Potential-Based Reward Shaping (Flat Terrain) Grounded in Ng et al. (1999): the shaping signal is the exact potential difference between consecutive states, guaranteeing policy invariance while providing a dense gradient.
Ξ¦(s) = βdistance_to_waypoint
shaping = Ξ¦(sβ²) β Ξ¦(s) = d_prev β d_curr # = PBRS_SCALE Γ (d_prev β d_curr)
- A stationary rover gets exactly zero shaping. Combined with the step penalty (
β0.01) and battery drain, every idle step is strictly net-negative β the exploit is closed by construction. - Moving closer β positive. Moving away β negative. The gradient is always informative.
Vector-Field Reward Shaping (Crater Avoidance Zone)
Activated within 10 m of an obstacle, replacing the flat β5.0 collision penalty with a continuous directional signal:
repulsive = unit vector away from nearest obstacle centre
attractive = unit vector toward goal waypoint
tangent = 90Β° CCW rotation of repulsive vector (goal-directed)
blend = GOAL_BLEND Γ attractive + REP_BLEND Γ tangent
reward = VF_SCALE Γ cosine_similarity(rover_heading, blend) Γ proximity_weight
The reward peaks at +VF_SCALE when the rover's heading perfectly aligns with the blended safe-path tangent, and reaches βVF_SCALE when heading directly into the obstacle. The proximity weight (1 β d/VF_RADIUS) concentrates the signal close to the danger zone. The rover learns to arc around craters rather than stop before them.
2 Β· The Format Gatekeeper β Pydantic as a Training Reward
LLMs fine-tuned for structured output routinely collapse to producing prose ("I think the rover should move forward...") because prose is always grammatically valid, while JSON can fail in many ways. Standard GRPO would assign a reward purely from the environment outcome β but if the action can't be parsed, no environment step fires at all, and the episode silently terminates with a zero reward, giving the policy no gradient signal.
We address this with a two-tier reward function inside the GRPO training loop:
| Tier | Signal | Value |
|---|---|---|
| Format reward | Pydantic-validated JSON with all 4 required fields (thrust, steering, brake, vertical_thruster) |
+0.2 |
| Correctness reward | thrust β₯ 0.5 and brake == 0 (moving, not stalling) |
+0.3 |
| Field alignment bonus | abs(steering) β€ 0.8 (not spinning in place) |
+0.1 |
| Episode score | /grader endpoint response [0.0, 1.0] |
passed via dataset |
A hallucinated prose response gets 0.0 β a strict mathematical punishment. A correctly formatted, physically reasonable action gets up to 0.6 before the environment score is even consulted. The Llama 3.2 1B model learns that JSON compliance is a prerequisite, not a suggestion.
3 Β· Sim-to-Real Readiness via Physics Randomisation
Over-fitting to a deterministic simulation is the primary failure mode in sim-to-real transfer. Three features in the physics engine prevent this:
| Feature | Implementation |
|---|---|
| Domain Randomisation | Terrain type, height, and obstacle positions are fully re-seeded every episode from a configurable RNG. Friction variance is implicit in the terrain-slope drag calculation: drag = 1 β clamp(slope_proj Γ 0.3, β0.3, 0.3). Each episode presents a different friction profile. |
| Action Smoothing (servo limits) | The yaw-rate model couples steering authority to forward thrust: yaw_rate = steering Γ MAX_YAW_RATE Γ (thrust + 0.1). At low speeds the rover can barely turn, mirroring real servo dynamics. The rover cannot spin in place at full steering with zero thrust. |
| Sensor Noise (implicit) | The obstacle sensor returns the 8 nearest contacts normalised to [β1, 1] and padded with dist_norm = 1.0 for absent obstacles. The finite 50 m sensor range and discrete 8-slot representation force the policy to reason under partial observability rather than treating the obstacle map as a complete world model. |
These three features ensure the trained policy generalises across episode seeds rather than memorising a single fixed layout.
4 Β· Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Docker Container β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Physics World (main.py) β β
β β β β
β β TerrainGrid ββ RoverSim ββ ObstacleField β β
β β β β β β β
β β Euler Kinematics Battery Collision FSM β β
β β β β β β β
β β ββββββββββββββββββββββββββ β β
β β β Reward Engine β β β
β β β PBRS + Vector-Field β β β
β β ββββββββββββββββββββββββββ β β
β β β β β
β β FastAPI (port 7860) β β
β β /reset /step /state /tasks /baseline β β
β β /grader β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β HTTP (JSON)
ββββββββββββββββ΄ββββββββββββββββ
β β
ββββββββΌββββββββ ββββββββββΌβββββββββ
β AI Brain β β GRPO Trainer β
β inference.py β β train.py β
β β β β
β Llama 3.2 1B β β Unsloth 4-bit β
β AsyncOpenAI β β TRL GRPOTrainer β
β aiohttp β β 24GB Cloud GPU β
βββββββββββββββββ βββββββββββββββββββ
By migrating our final training pipeline to a 24GB Cloud GPU, we scaled our GRPO rollouts to run multiple environment trajectories simultaneously, maximizing throughput and VRAM utilization.
π Evidence of Training & Convergence
Sub-Section 1: Training Results & Analysis
Figure 1: Policy Update Magnitude (Loss). The curve exhibits a definitive 'Discovery Spike' at Step 160, marking the transition from random exploration to the policy identifying structured reward patterns.
Figure 2: Reward Breakdown. Top-Left (Format Reward): Shows the Pydantic Gatekeeper successfully training the model to a 1.0 plateau (100% compliance). Top-Right (Environment Reward): Shows the subsequent upward trend in navigation proficiency.
Figure 3: Real-time System Integration Logs. Verifying the internal communication between the Llama 3.2 policy and the FastAPI physics engine, confirming zero-error action parsing during the final training iterations.
Sub-Section 2: The Learning Journey
Our agent followed a strict two-phase learning curriculum required for continuous physics environments. First, it mastered 'Communication'βlearning to strictly adhere to the Pydantic JSON schema. Once format-perfect, it mastered 'Navigation'βbalancing battery efficiency with waypoint proximity. Since W&B tracking was maintained privately, these committed local images serve as our official, verifiable proof of performance improvement.
Sub-Section 3: Performance Comparison
| Metric | Baseline (Untrained) | GRPO-Trained Agent |
|---|---|---|
| Action Formatting | < 10% valid JSON | 100% (Strict Pydantic) |
| Obstacle Handling | High collision rate | Maintains safety buffer |
| Reward Trend | Flat / Stochastic | Consistently Upward |
Quick Start
1. Install dependencies
pip install -r requirements.txt
# or, using uv:
uv sync
2. Configure environment variables
Create a .env file in the project root (see .env.example):
HF_TOKEN=hf_your_token_here
MODEL_NAME=meta-llama/Llama-3.2-1B-Instruct
API_BASE_URL=https://api-inference.huggingface.co/v1
3. Run the environment server
# Terminal 1 β start the simulation server on port 7860
export $(grep -v '^#' .env | xargs) && uv run uvicorn main:app --host 0.0.0.0 --port 7860
4. Run the LLM inference agent
# Terminal 2 β requires the server to be running first
export $(grep -v '^#' .env | xargs) && uv run python inference.py
Exit code 0 = all three tasks scored above 0.0. Exit code 1 = at least one task failed.
Run with Docker
docker build -t rover-env .
docker run -p 7860:7860 rover-env
# Then run the inference agent against the container
export $(grep -v '^#' .env | xargs) && uv run python inference.py
Interactive API docs
Once running, visit http://localhost:7860/docs for the full Swagger UI with live endpoint testing.
Environment Overview
| Property | Value |
|---|---|
| World size | 1000 Γ 1000 m (rover bounded to Β±500 m) |
| Timestep | 1 second per step() call |
| Max speed | 5.0 m/s at full thrust |
| Waypoint radius | 2.0 m (arrival threshold) |
| Collision radius | 0.5 m (obstacle contact) |
| Sensor range | 50.0 m (obstacle detection) |
| Terrain grid | 20 m Γ 20 m cells, lazy generation |
| Coordinate system | Cartesian, right-hand, Z = terrain height |
Physics are computed in pure Python using Euler integration β no external simulation library required.
Observation Space
Returned as a JSON object by /reset, /state, and the obs field of /step.
| Field | Type | Shape | Bounds | Description |
|---|---|---|---|---|
rover_position |
Box |
[3] |
[-500, 500]Β³ |
[x, y, z] absolute position in metres |
rover_heading |
Box |
[1] |
[-Ο, Ο] |
Yaw angle in radians (east = 0) |
rover_velocity |
Box |
[3] |
[-5, 5]Β³ |
[vx, vy, vz] velocity in m/s |
target_position |
Box |
[3] |
[-500, 500]Β³ |
Active waypoint absolute position |
target_relative |
Box |
[3] |
[-1000, 1000]Β³ |
Vector from rover to waypoint β use this for goal-conditioned policies |
target_distance |
Box |
[1] |
[0, 1414] |
Euclidean distance to active waypoint in metres |
waypoints_remaining |
Discrete |
β | {0, 1, 2, 3} |
Unvisited waypoints left this episode |
obstacle_map |
Box |
[8, 3] |
[-1, 1] |
8 nearest obstacles as [dx_norm, dy_norm, dist_norm]; padded with dist_norm=1.0 when fewer than 8 in range |
obstacle_count |
Discrete |
β | {0 β¦ 8} |
Number of obstacles within 50 m sensor range |
nearest_obstacle_distance |
Box |
[1] |
[0, 50] |
Raw distance to closest obstacle in metres |
battery_level |
Box |
[1] |
[0, 1] |
Normalised remaining battery (0 = dead, 1 = full) |
battery_drain_rate |
Box |
[1] |
[0, 1] |
Current drain per step as fraction of total capacity |
terrain_type |
Discrete |
β | {0, 1, 2, 3} |
Tile under rover: 0=flat, 1=rocky, 2=crater_floor, 3=crater_rim |
terrain_slope |
Box |
[2] |
[-1, 1] |
[slope_x, slope_y] surface normal projections |
steps_taken |
Box |
[1] |
[0, 500] |
Steps elapsed this episode |
steps_remaining_norm |
Box |
[1] |
[0, 1] |
Remaining step budget normalised to [0, 1] |
Policy tip: target_relative gives you the direct (dx, dy) vector every step. Compute atan2(dy, dx) to get the heading you need, then steer toward it.
Action Space
Sent as a JSON body to POST /step?episode_id=<uuid>.
| Field | Type | Bounds | Description |
|---|---|---|---|
thrust |
Box float32 |
[0.0, 1.0] |
Forward drive intensity. 0.0 = stopped, 1.0 = full throttle |
steering |
Box float32 |
[-1.0, 1.0] |
Yaw rate command. -1.0 = hard left, 0.0 = straight, 1.0 = hard right. Effective yaw rate scales with current thrust |
brake |
Discrete int32 |
{0, 1} |
Binary regen-braking flag. 1 = halve speed and recover a small amount of battery |
vertical_thruster |
Box float32 |
[-0.2, 0.2] |
Vertical adjustment for crater terrain. Has no effect and incurs no cost on flat terrain |
Example action (beeline at full throttle):
{
"thrust": 1.0,
"steering": 0.0,
"brake": 0,
"vertical_thruster": 0.0
}
Tasks
All three tasks have exactly one waypoint. The rover always spawns at (0, 0) heading east.
Task 1 β Easy: Flat Plains Transit
| Parameter | Value |
|---|---|
task_id |
"easy" |
| Difficulty | β |
| Max steps | 200 |
| Starting battery | 100% |
| Drain multiplier | Γ1.0 |
| Obstacles | None |
| Terrain | Flat |
| Scoring formula | proximity Γ 0.85 + step_efficiency Γ 0.15 |
Navigate to a single waypoint on flat, open terrain with no obstacles and a full battery. The only challenge is correctly steering toward target_relative.
Task 2 β Medium: Crater Avoidance
| Parameter | Value |
|---|---|
task_id |
"medium" |
| Difficulty | ββ |
| Max steps | 300 |
| Starting battery | 100% |
| Drain multiplier | Γ1.0 |
| Obstacles | 1 deterministic crater ring (22 posts, 2 gaps) |
| Terrain | Flat |
| Scoring formula | proximity Γ 0.75 + step_efficiency Γ 0.25 β min(collisions Γ 0.06, 0.40) |
A ring of 22 obstacle posts is placed at the midpoint of the roverβwaypoint line, blocking the direct path. Two 48Β° gaps are cut perpendicular to the approach direction. Each collision subtracts 0.06 from the score (capped at β0.40).
Key observation fields for this task: obstacle_map, obstacle_count, nearest_obstacle_distance.
Task 3 β Hard: Battery Sprint
| Parameter | Value |
|---|---|
task_id |
"hard" |
| Difficulty | βββ |
| Max steps | 100 |
| Starting battery | 35% |
| Drain multiplier | Γ4.0 |
| Obstacles | None |
| Terrain | Flat |
| Scoring formula | proximity Γ 0.65 + battery_efficiency Γ 0.35 |
The rover starts with only 35% battery. Combined with a Γ4 drain multiplier, a full-throttle beeline exhausts the battery in approximately 8 steps β barely enough to reach the waypoint. Any detour is fatal.
battery_efficiency = battery_remaining / 0.35 (normalised against starting charge).
API Reference
All endpoints return JSON. The base URL for a running server is http://localhost:7860.
GET /tasks
Returns metadata for all three tasks including the full action schema, scoring formula, and policy hints.
curl http://localhost:7860/tasks
POST /reset
Starts a new episode. Returns the initial observation and an episode_id required by all subsequent calls.
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy", "seed": 42}'
Response fields: obs (full Observation), episode_id (UUID string), task_id.
GET /state
Returns the current observation without advancing the simulation.
curl "http://localhost:7860/state?episode_id=<uuid>"
POST /step
Applies one action and advances the simulation by one timestep (dt = 1 s).
curl -X POST "http://localhost:7860/step?episode_id=<uuid>" \
-H "Content-Type: application/json" \
-d '{"thrust": 1.0, "steering": 0.0, "brake": 0, "vertical_thruster": 0.0}'
Response fields: obs, reward (float), done (bool), truncated (bool), info (dict).
The info dict contains grader telemetry ready to pass directly to /grader:
{
"termination_reason": "waypoint_reached | battery_dead | max_steps | unknown",
"initial_distance": 94.6,
"min_distance": 0.14,
"collision_count": 0,
"waypoints_hit": 1,
"total_waypoints": 1,
"steps": 20,
"max_steps": 200,
"battery": 0.800
}
GET /baseline
Returns the machine-readable environment identity card (name, version, full observation and action space declarations, task list). Used by the OpenEnv registry and auto-validators.
curl http://localhost:7860/baseline
POST /grader
Scores a completed episode. Returns a float in [0.0, 1.0].
All fields can be read directly from the final step() info dict β no client-side bookkeeping required.
curl -X POST http://localhost:7860/grader \
-H "Content-Type: application/json" \
-d '{
"episode_id": "<uuid>",
"task_id": "easy",
"termination_reason": "waypoint_reached",
"initial_distance": 94.6,
"min_distance_achieved": 0.14,
"waypoints_reached": 1,
"total_waypoints": 1,
"steps_taken": 20,
"max_steps": 200,
"battery_remaining": 0.800,
"collision_count": 0
}'
Response fields:
| Field | Type | Description |
|---|---|---|
score |
float | Final score in [0.0, 1.0] |
verdict |
string | WIN, WIN_WITH_COLLISIONS, PARTIAL_PROGRESS, COLLISION_LOSS, BATTERY_DEAD, or TIMEOUT |
proximity_progress |
float | Raw linear proximity metric. Exactly 0.70 when the rover closed 70% of the gap |
score_rationale |
string | One-sentence explanation of the outcome |
breakdown |
dict | Per-component scores (keys vary by task) |
Grading
Scoring formulas
Easy β Flat Plains Transit
score = proximity Γ 0.85 + step_efficiency Γ 0.15
Medium β Crater Avoidance
collision_penalty = min(collision_count Γ 0.06, 0.40)
score = proximity Γ 0.75 + step_efficiency Γ 0.25 β collision_penalty
Hard β Battery Sprint
battery_efficiency = battery_remaining / 0.35
score = proximity Γ 0.65 + battery_efficiency Γ 0.35
Shared metrics
proximity is a strictly linear metric:
proximity = 1.0 β (min_distance_achieved / initial_distance)
This is exactly 0.70 when the rover closed 70% of the spawnβwaypoint gap, 0.0 if it never moved, and overridden to 1.0 on confirmed arrival.
step_efficiency:
step_efficiency = 1.0 β (steps_taken / max_steps)
Score examples
| Scenario | Score |
|---|---|
| Easy: beeline arrival using 50% of budget | 0.85 + 0.075 = 0.925 |
| Easy: arrival using full budget | 0.85 + 0.000 = 0.850 |
| Easy: 70% progress, no arrival | ~0.595β0.700 |
| Medium: arrival, zero collisions | 0.75 + 0.25 = 1.000 |
| Medium: arrival, 3 collisions | 1.00 β 0.18 = 0.820 |
| Medium: stuck in ring, 8+ collisions | β€ 0.000 |
| Hard: arrival, 50% starting battery left | 0.65 + 0.175 = 0.825 |
| Hard: arrival, battery = 0 on landing | 0.65 + 0.000 = 0.650 |
| Hard: battery dead at 70% progress | 0.455 + 0.000 = 0.455 |
Reward Signal
The step reward returned by /step is used for online RL training. It is separate from the grader score.
Note β reward system overhauled in Phase 4. The original static penalties caused the stationary exploit (see Engineering Highlights above). The values below reflect the current
_compute_rewardimplementation.
| Event | Reward | Notes |
|---|---|---|
| Every step | β0.01 | Constant time-pressure; ensures idle steps are always net-negative |
| Battery drain | βdrain Γ 1.0 | Proportional efficiency cost (coefficient reduced from 2.0 to 1.0 β PBRS now carries the main navigation signal) |
| Waypoint reached | +100.0 | Asymmetric terminal bonus; episode returns immediately β prevents early policy collapse |
| Battery depleted | β20.0 | Terminal penalty |
| Potential-based shaping | PBRS_SCALE Γ (d_prev β d_curr) where PBRS_SCALE = 0.5 |
Exactly 0 when stationary; positive when closing gap; negative when moving away |
| Vector-field shaping | VF_SCALE Γ cos_sim Γ proximity_weight (VF_SCALE = 1.5) |
Active within 10 m of obstacles; proximity_weight = 1 β d / 10; ranges from β1.5 (heading into obstacle) to +1.5 (aligned with safe tangent) |
File Structure
planetary-rover-env/
βββ openenv.yaml # Typed observation + action space declarations
βββ main.py # FastAPI server β physics engine + all routes (1632 lines)
βββ inference.py # LLM-driven inference agent (HF Inference API)
βββ train.py # GRPO training script (Unsloth 4-bit + TRL GRPOTrainer)
βββ requirements.txt # Pinned runtime dependencies
βββ Dockerfile # Two-stage optimised build, port 7860, non-root user
βββ README.md # This file
Dependencies
| Package | Version | Role |
|---|---|---|
fastapi |
0.115.6 | ASGI web framework |
uvicorn[standard] |
0.32.1 | ASGI server (uvloop + httptools) |
pydantic |
2.10.3 | Request/response validation |
aiohttp |
β | Async HTTP client in inference.py |
openai |
β | OpenAI-compatible LLM client in inference.py |
The simulation engine itself uses only Python stdlib (math, random, uuid, dataclasses, enum).
Inference Agent Results
Running the LLM inference agent against a local server:
export $(grep -v '^#' .env | xargs) && uv run python inference.py
Reference scores (with the strategies embedded in the system prompt):
| Task | Agent strategy | Typical score | Verdict |
|---|---|---|---|
| easy | Beeline: atan2(dy, dx) heading lock, thrust=1.0 |
0.92β0.98 | WIN |
| medium | Two-phase detour: approach β perpendicular β approach | 0.85β0.92 | WIN |
| hard | Heading lock on step 1, never steer again | 0.45β0.65 | WIN / BATTERY_DEAD |
These scores represent LLM-driven P-controller navigation. A trained RL policy should significantly exceed them on all three tasks.
Building Your Own Agent
The minimal loop to run an episode:
import requests, math
BASE = "http://localhost:7860"
# 1. Discover the task
tasks = requests.get(f"{BASE}/tasks").json()
# 2. Reset
resp = requests.post(f"{BASE}/reset", json={"task_id": "easy", "seed": 42}).json()
episode_id = resp["episode_id"]
obs = resp["obs"]
# 3. Step loop
while True:
dx = obs["target_relative"]["x"]
dy = obs["target_relative"]["y"]
heading_error = math.atan2(dy, dx) - obs["rover_heading"]
action = {
"thrust": 1.0,
"steering": max(-1.0, min(1.0, heading_error * 2.5)),
"brake": 0,
"vertical_thruster": 0.0,
}
step = requests.post(f"{BASE}/step", json=action,
params={"episode_id": episode_id}).json()
obs = step["obs"]
if step["done"] or step["truncated"]:
info = step["info"]
break
# 4. Grade
grade = requests.post(f"{BASE}/grader", json={
"episode_id": episode_id,
"task_id": "easy",
"termination_reason": info["termination_reason"],
"initial_distance": info["initial_distance"],
"min_distance_achieved": info["min_distance"],
"waypoints_reached": info["waypoints_hit"],
"total_waypoints": info["total_waypoints"],
"steps_taken": info["steps"],
"max_steps": info["max_steps"],
"battery_remaining": info["battery"],
"collision_count": info["collision_count"],
}).json()
print(f"Score: {grade['score']} Verdict: {grade['verdict']}")
print(f"Rationale: {grade['score_rationale']}")
License
MIT