File size: 26,829 Bytes
b163d5f
 
 
 
5a85fc9
b163d5f
 
 
5a85fc9
b163d5f
 
 
 
ae26ca6
d00513d
 
ae26ca6
 
 
fae67c8
d00513d
 
 
 
 
 
 
 
 
 
 
 
 
 
1476327
d00513d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1476327
d00513d
 
b163d5f
 
1476327
 
 
 
6b36897
1476327
6b36897
1476327
6b36897
 
1476327
6b36897
 
 
 
 
 
 
 
 
 
 
b163d5f
6b36897
2abec6a
6b36897
 
 
 
 
2abec6a
 
 
b163d5f
 
d00513d
b163d5f
 
 
d00513d
 
b163d5f
 
d00513d
 
 
 
 
 
 
 
 
 
 
b163d5f
 
d00513d
 
 
 
 
b163d5f
d00513d
 
 
b163d5f
 
 
 
 
 
 
 
 
d00513d
 
 
b163d5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d00513d
 
51bb0d4
 
d00513d
 
 
51bb0d4
d00513d
 
b163d5f
 
 
 
 
 
 
 
51bb0d4
 
 
 
b163d5f
 
 
 
 
 
 
 
 
 
 
 
 
d00513d
 
b163d5f
 
 
 
 
d00513d
 
 
 
 
 
 
b163d5f
d00513d
b163d5f
 
 
 
 
 
 
d00513d
b163d5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
---
title: Planetary Rover Navigation Simulator
emoji: πŸͺ
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
license: mit
short_description: OpenEnv RL environment β€” Meta PyTorch Hackathon
---

# Planetary Rover Navigation Simulator


### πŸ“‹ Official Hackathon Submission Links

* **🌍 Live OpenEnv Simulator:** [Interactive Dashboard (HF Space)](https://huggingface.co/spaces/atomic24/planetary-rover-navigation)
* **🧠 GRPO Training Run:** [View on Google Colab](https://colab.research.google.com/github/Gurram-Bhaskar/planetary-rover-navigation/blob/main/train_colab.ipynb)
* **πŸ’» Source Code:** [GitHub Repository](https://github.com/Gurram-Bhaskar/planetary-rover-navigation)
* **πŸ“– Technical Write-up:** [Hugging Face Blog Post ](https://huggingface.co/spaces/atomic24/planetary-rover-navigation/blob/main/BLOG.md)


## πŸš€ Project Overview

The **Planetary Rover Navigation Simulator** is a Dockerized **OpenEnv microservice** β€” a standards-compliant HTTP API that completely separates the physics *World* from the AI *Brain*. The physics engine (FastAPI + Pydantic + Euler integration) runs inside a Docker container and exposes six REST endpoints. Any agent β€” a hardcoded heuristic, a Llama 3.2 1B fine-tuned with GRPO, or your own PyTorch policy β€” connects over HTTP and never touches the simulation internals. This clean separation means you can swap the AI brain without restarting the world, and swap the world without retraining the agent.

The environment is a fully self-contained HTTP microservice exposing the standard OpenEnv API: `/reset`, `/step`, `/state`, `/tasks`, `/baseline`, and `/grader`.

---

## βš™οΈ Engineering Highlights β€” Theme #5: Wild Card

### 1 Β· Solving the Stationary Exploit with Reward Shaping

Traditional sparse rewards (only rewarding upon waypoint arrival) provide no gradient signal for intermediate steps, while our original dense distance penalty (`+max(0, (100 - dist) * 0.001)`) inadvertently trained the rover to **stand still** (the Stationary Exploit). A stationary rover accumulates a small, *consistent* negative reward across all GRPO group samples β€” the group advantage is always near zero, the policy never updates, and the rover learns that doing nothing is the optimal strategy.

We fixed this with two cooperating shaping techniques from the deep RL literature:

**Potential-Based Reward Shaping (Flat Terrain)**
Grounded in Ng et al. (1999): the shaping signal is the exact potential difference between consecutive states, guaranteeing policy invariance while providing a dense gradient.

```
Ξ¦(s) = βˆ’distance_to_waypoint
shaping = Ξ¦(sβ€²) βˆ’ Ξ¦(s) = d_prev βˆ’ d_curr        # = PBRS_SCALE Γ— (d_prev βˆ’ d_curr)
```

- A stationary rover gets **exactly zero** shaping. Combined with the step penalty (`βˆ’0.01`) and battery drain, every idle step is strictly net-negative β€” the exploit is closed by construction.
- Moving closer β†’ positive. Moving away β†’ negative. The gradient is always informative.

**Vector-Field Reward Shaping (Crater Avoidance Zone)**
Activated within **10 m** of an obstacle, replacing the flat `βˆ’5.0` collision penalty with a continuous directional signal:

```
repulsive  = unit vector away from nearest obstacle centre
attractive = unit vector toward goal waypoint
tangent    = 90Β° CCW rotation of repulsive vector (goal-directed)
blend      = GOAL_BLEND Γ— attractive + REP_BLEND Γ— tangent
reward     = VF_SCALE Γ— cosine_similarity(rover_heading, blend) Γ— proximity_weight
```

The reward peaks at `+VF_SCALE` when the rover's heading perfectly aligns with the blended safe-path tangent, and reaches `βˆ’VF_SCALE` when heading directly into the obstacle. The proximity weight `(1 βˆ’ d/VF_RADIUS)` concentrates the signal close to the danger zone. The rover learns to arc around craters rather than stop before them.

---

### 2 Β· The Format Gatekeeper β€” Pydantic as a Training Reward

LLMs fine-tuned for structured output routinely collapse to producing prose ("I think the rover should move forward...") because prose is always grammatically valid, while JSON can fail in many ways. Standard GRPO would assign a reward purely from the environment outcome β€” but if the action can't be parsed, no environment step fires at all, and the episode silently terminates with a zero reward, giving the policy no gradient signal.

We address this with a **two-tier reward function** inside the GRPO training loop:

| Tier | Signal | Value |
|---|---|---|
| **Format reward** | Pydantic-validated JSON with all 4 required fields (`thrust`, `steering`, `brake`, `vertical_thruster`) | +0.2 |
| **Correctness reward** | `thrust β‰₯ 0.5` and `brake == 0` (moving, not stalling) | +0.3 |
| **Field alignment bonus** | `abs(steering) ≀ 0.8` (not spinning in place) | +0.1 |
| **Episode score** | `/grader` endpoint response `[0.0, 1.0]` | passed via dataset |

A hallucinated prose response gets **0.0** β€” a strict mathematical punishment. A correctly formatted, physically reasonable action gets up to **0.6** before the environment score is even consulted. The Llama 3.2 1B model learns that JSON compliance is a prerequisite, not a suggestion.

---

### 3 Β· Sim-to-Real Readiness via Physics Randomisation

Over-fitting to a deterministic simulation is the primary failure mode in sim-to-real transfer. Three features in the physics engine prevent this:

| Feature | Implementation |
|---|---|
| **Domain Randomisation** | Terrain type, height, and obstacle positions are fully re-seeded every episode from a configurable RNG. Friction variance is implicit in the terrain-slope drag calculation: `drag = 1 βˆ’ clamp(slope_proj Γ— 0.3, βˆ’0.3, 0.3)`. Each episode presents a different friction profile. |
| **Action Smoothing (servo limits)** | The yaw-rate model couples steering authority to forward thrust: `yaw_rate = steering Γ— MAX_YAW_RATE Γ— (thrust + 0.1)`. At low speeds the rover can barely turn, mirroring real servo dynamics. The rover cannot spin in place at full steering with zero thrust. |
| **Sensor Noise (implicit)** | The obstacle sensor returns the 8 nearest contacts normalised to `[βˆ’1, 1]` and padded with `dist_norm = 1.0` for absent obstacles. The finite 50 m sensor range and discrete 8-slot representation force the policy to reason under partial observability rather than treating the obstacle map as a complete world model. |

These three features ensure the trained policy generalises across episode seeds rather than memorising a single fixed layout.

---

### 4 Β· Architecture Diagram

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Docker Container                       β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Physics World (main.py)             β”‚   β”‚
β”‚  β”‚                                                  β”‚   β”‚
β”‚  β”‚  TerrainGrid  ←→  RoverSim  ←→  ObstacleField  β”‚   β”‚
β”‚  β”‚       ↕              ↕               ↕           β”‚   β”‚
β”‚  β”‚  Euler Kinematics  Battery      Collision FSM   β”‚   β”‚
β”‚  β”‚       ↕              ↕               ↕           β”‚   β”‚
β”‚  β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚   β”‚
β”‚  β”‚         β”‚   Reward Engine        β”‚               β”‚   β”‚
β”‚  β”‚         β”‚   PBRS + Vector-Field  β”‚               β”‚   β”‚
β”‚  β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚   β”‚
β”‚  β”‚                     ↕                            β”‚   β”‚
β”‚  β”‚           FastAPI  (port 7860)                   β”‚   β”‚
β”‚  β”‚   /reset  /step  /state  /tasks  /baseline       β”‚   β”‚
β”‚  β”‚                   /grader                        β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ HTTP (JSON)
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                              β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  AI Brain     β”‚             β”‚  GRPO Trainer   β”‚
   β”‚ inference.py  β”‚             β”‚   train.py      β”‚
   β”‚               β”‚             β”‚                 β”‚
   β”‚ Llama 3.2 1B  β”‚             β”‚ Unsloth 4-bit   β”‚
   β”‚ AsyncOpenAI   β”‚             β”‚ TRL GRPOTrainer β”‚
   β”‚ aiohttp       β”‚             β”‚ 24GB Cloud GPU  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```


By migrating our final training pipeline to a 24GB Cloud GPU, we scaled our GRPO rollouts to run multiple environment trajectories simultaneously, maximizing throughput and VRAM utilization.

---

## πŸ“ˆ Evidence of Training & Convergence

### Sub-Section 1: Training Results & Analysis

![Training Loss](./images/loss.png)
*Figure 1: Policy Update Magnitude (Loss). The curve exhibits a definitive 'Discovery Spike' at Step 160, marking the transition from random exploration to the policy identifying structured reward patterns.*

![Reward Evolution](./images/reward.png)
*Figure 2: Reward Breakdown. Top-Left (Format Reward): Shows the Pydantic Gatekeeper successfully training the model to a 1.0 plateau (100% compliance). Top-Right (Environment Reward): Shows the subsequent upward trend in navigation proficiency.*

![System Logs 1](./images/training_logs1.png)
![System Logs 2](./images/training_logs2.png)
![System Logs 3](./images/training_logs3.png)
*Figure 3: Real-time System Integration Logs. Verifying the internal communication between the Llama 3.2 policy and the FastAPI physics engine, confirming zero-error action parsing during the final training iterations.*

### Sub-Section 2: The Learning Journey

Our agent followed a strict two-phase learning curriculum required for continuous physics environments. First, it mastered 'Communication'β€”learning to strictly adhere to the Pydantic JSON schema. Once format-perfect, it mastered 'Navigation'β€”balancing battery efficiency with waypoint proximity. Since W&B tracking was maintained privately, these committed local images serve as our official, verifiable proof of performance improvement.

### Sub-Section 3: Performance Comparison

| Metric | Baseline (Untrained) | GRPO-Trained Agent |
| :--- | :--- | :--- |
| Action Formatting | < 10% valid JSON | 100% (Strict Pydantic) |
| Obstacle Handling | High collision rate | Maintains safety buffer |
| Reward Trend | Flat / Stochastic | Consistently Upward |

---

## Quick Start

### 1. Install dependencies

```bash
pip install -r requirements.txt
# or, using uv:
uv sync
```

### 2. Configure environment variables

Create a `.env` file in the project root (see `.env.example`):

```env
HF_TOKEN=hf_your_token_here
MODEL_NAME=meta-llama/Llama-3.2-1B-Instruct
API_BASE_URL=https://api-inference.huggingface.co/v1
```

### 3. Run the environment server

```bash
# Terminal 1 β€” start the simulation server on port 7860
export $(grep -v '^#' .env | xargs) && uv run uvicorn main:app --host 0.0.0.0 --port 7860
```

### 4. Run the LLM inference agent

```bash
# Terminal 2 β€” requires the server to be running first
export $(grep -v '^#' .env | xargs) && uv run python inference.py
```

Exit code `0` = all three tasks scored above `0.0`. Exit code `1` = at least one task failed.

### Run with Docker

```bash
docker build -t rover-env .
docker run -p 7860:7860 rover-env

# Then run the inference agent against the container
export $(grep -v '^#' .env | xargs) && uv run python inference.py
```

### Interactive API docs

Once running, visit `http://localhost:7860/docs` for the full Swagger UI with live endpoint testing.

---

## Environment Overview

| Property | Value |
|---|---|
| World size | 1000 Γ— 1000 m (rover bounded to Β±500 m) |
| Timestep | 1 second per `step()` call |
| Max speed | 5.0 m/s at full thrust |
| Waypoint radius | 2.0 m (arrival threshold) |
| Collision radius | 0.5 m (obstacle contact) |
| Sensor range | 50.0 m (obstacle detection) |
| Terrain grid | 20 m Γ— 20 m cells, lazy generation |
| Coordinate system | Cartesian, right-hand, Z = terrain height |

Physics are computed in pure Python using Euler integration β€” no external simulation library required.

---

## Observation Space

Returned as a JSON object by `/reset`, `/state`, and the `obs` field of `/step`.

| Field | Type | Shape | Bounds | Description |
|---|---|---|---|---|
| `rover_position` | `Box` | `[3]` | `[-500, 500]Β³` | `[x, y, z]` absolute position in metres |
| `rover_heading` | `Box` | `[1]` | `[-Ο€, Ο€]` | Yaw angle in radians (east = 0) |
| `rover_velocity` | `Box` | `[3]` | `[-5, 5]Β³` | `[vx, vy, vz]` velocity in m/s |
| `target_position` | `Box` | `[3]` | `[-500, 500]Β³` | Active waypoint absolute position |
| `target_relative` | `Box` | `[3]` | `[-1000, 1000]Β³` | Vector from rover to waypoint β€” use this for goal-conditioned policies |
| `target_distance` | `Box` | `[1]` | `[0, 1414]` | Euclidean distance to active waypoint in metres |
| `waypoints_remaining` | `Discrete` | β€” | `{0, 1, 2, 3}` | Unvisited waypoints left this episode |
| `obstacle_map` | `Box` | `[8, 3]` | `[-1, 1]` | 8 nearest obstacles as `[dx_norm, dy_norm, dist_norm]`; padded with `dist_norm=1.0` when fewer than 8 in range |
| `obstacle_count` | `Discrete` | β€” | `{0 … 8}` | Number of obstacles within 50 m sensor range |
| `nearest_obstacle_distance` | `Box` | `[1]` | `[0, 50]` | Raw distance to closest obstacle in metres |
| `battery_level` | `Box` | `[1]` | `[0, 1]` | Normalised remaining battery (0 = dead, 1 = full) |
| `battery_drain_rate` | `Box` | `[1]` | `[0, 1]` | Current drain per step as fraction of total capacity |
| `terrain_type` | `Discrete` | β€” | `{0, 1, 2, 3}` | Tile under rover: 0=flat, 1=rocky, 2=crater\_floor, 3=crater\_rim |
| `terrain_slope` | `Box` | `[2]` | `[-1, 1]` | `[slope_x, slope_y]` surface normal projections |
| `steps_taken` | `Box` | `[1]` | `[0, 500]` | Steps elapsed this episode |
| `steps_remaining_norm` | `Box` | `[1]` | `[0, 1]` | Remaining step budget normalised to `[0, 1]` |

**Policy tip:** `target_relative` gives you the direct `(dx, dy)` vector every step. Compute `atan2(dy, dx)` to get the heading you need, then steer toward it.

---

## Action Space

Sent as a JSON body to `POST /step?episode_id=<uuid>`.

| Field | Type | Bounds | Description |
|---|---|---|---|
| `thrust` | `Box` float32 | `[0.0, 1.0]` | Forward drive intensity. `0.0` = stopped, `1.0` = full throttle |
| `steering` | `Box` float32 | `[-1.0, 1.0]` | Yaw rate command. `-1.0` = hard left, `0.0` = straight, `1.0` = hard right. Effective yaw rate scales with current thrust |
| `brake` | `Discrete` int32 | `{0, 1}` | Binary regen-braking flag. `1` = halve speed and recover a small amount of battery |
| `vertical_thruster` | `Box` float32 | `[-0.2, 0.2]` | Vertical adjustment for crater terrain. Has no effect and incurs no cost on flat terrain |

**Example action (beeline at full throttle):**

```json
{
  "thrust": 1.0,
  "steering": 0.0,
  "brake": 0,
  "vertical_thruster": 0.0
}
```

---

## Tasks

All three tasks have exactly one waypoint. The rover always spawns at `(0, 0)` heading east.

### Task 1 β€” Easy: Flat Plains Transit

| Parameter | Value |
|---|---|
| `task_id` | `"easy"` |
| Difficulty | ⭐ |
| Max steps | 200 |
| Starting battery | 100% |
| Drain multiplier | Γ—1.0 |
| Obstacles | None |
| Terrain | Flat |
| Scoring formula | `proximity Γ— 0.85 + step_efficiency Γ— 0.15` |

Navigate to a single waypoint on flat, open terrain with no obstacles and a full battery. The only challenge is correctly steering toward `target_relative`.

### Task 2 β€” Medium: Crater Avoidance

| Parameter | Value |
|---|---|
| `task_id` | `"medium"` |
| Difficulty | ⭐⭐ |
| Max steps | 300 |
| Starting battery | 100% |
| Drain multiplier | Γ—1.0 |
| Obstacles | 1 deterministic crater ring (22 posts, 2 gaps) |
| Terrain | Flat |
| Scoring formula | `proximity Γ— 0.75 + step_efficiency Γ— 0.25 βˆ’ min(collisions Γ— 0.06, 0.40)` |

A ring of 22 obstacle posts is placed at the midpoint of the roverβ†’waypoint line, blocking the direct path. Two 48Β° gaps are cut perpendicular to the approach direction. Each collision subtracts 0.06 from the score (capped at βˆ’0.40).

**Key observation fields for this task:** `obstacle_map`, `obstacle_count`, `nearest_obstacle_distance`.

### Task 3 β€” Hard: Battery Sprint

| Parameter | Value |
|---|---|
| `task_id` | `"hard"` |
| Difficulty | ⭐⭐⭐ |
| Max steps | 100 |
| Starting battery | **35%** |
| Drain multiplier | **Γ—4.0** |
| Obstacles | None |
| Terrain | Flat |
| Scoring formula | `proximity Γ— 0.65 + battery_efficiency Γ— 0.35` |

The rover starts with only 35% battery. Combined with a Γ—4 drain multiplier, a full-throttle beeline exhausts the battery in approximately 8 steps β€” barely enough to reach the waypoint. Any detour is fatal.

`battery_efficiency = battery_remaining / 0.35` (normalised against starting charge).

---

## API Reference

All endpoints return JSON. The base URL for a running server is `http://localhost:7860`.

### `GET /tasks`

Returns metadata for all three tasks including the full action schema, scoring formula, and policy hints.

```bash
curl http://localhost:7860/tasks
```

### `POST /reset`

Starts a new episode. Returns the initial observation and an `episode_id` required by all subsequent calls.

```bash
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy", "seed": 42}'
```

**Response fields:** `obs` (full Observation), `episode_id` (UUID string), `task_id`.

### `GET /state`

Returns the current observation without advancing the simulation.

```bash
curl "http://localhost:7860/state?episode_id=<uuid>"
```

### `POST /step`

Applies one action and advances the simulation by one timestep (dt = 1 s).

```bash
curl -X POST "http://localhost:7860/step?episode_id=<uuid>" \
  -H "Content-Type: application/json" \
  -d '{"thrust": 1.0, "steering": 0.0, "brake": 0, "vertical_thruster": 0.0}'
```

**Response fields:** `obs`, `reward` (float), `done` (bool), `truncated` (bool), `info` (dict).

The `info` dict contains grader telemetry ready to pass directly to `/grader`:

```json
{
  "termination_reason": "waypoint_reached | battery_dead | max_steps | unknown",
  "initial_distance": 94.6,
  "min_distance": 0.14,
  "collision_count": 0,
  "waypoints_hit": 1,
  "total_waypoints": 1,
  "steps": 20,
  "max_steps": 200,
  "battery": 0.800
}
```

### `GET /baseline`

Returns the machine-readable environment identity card (name, version, full observation and action space declarations, task list). Used by the OpenEnv registry and auto-validators.

```bash
curl http://localhost:7860/baseline
```

### `POST /grader`

Scores a completed episode. Returns a float in `[0.0, 1.0]`.

All fields can be read directly from the final `step()` `info` dict β€” no client-side bookkeeping required.

```bash
curl -X POST http://localhost:7860/grader \
  -H "Content-Type: application/json" \
  -d '{
    "episode_id":            "<uuid>",
    "task_id":               "easy",
    "termination_reason":    "waypoint_reached",
    "initial_distance":      94.6,
    "min_distance_achieved": 0.14,
    "waypoints_reached":     1,
    "total_waypoints":       1,
    "steps_taken":           20,
    "max_steps":             200,
    "battery_remaining":     0.800,
    "collision_count":       0
  }'
```

**Response fields:**

| Field | Type | Description |
|---|---|---|
| `score` | float | Final score in `[0.0, 1.0]` |
| `verdict` | string | `WIN`, `WIN_WITH_COLLISIONS`, `PARTIAL_PROGRESS`, `COLLISION_LOSS`, `BATTERY_DEAD`, or `TIMEOUT` |
| `proximity_progress` | float | Raw linear proximity metric. Exactly `0.70` when the rover closed 70% of the gap |
| `score_rationale` | string | One-sentence explanation of the outcome |
| `breakdown` | dict | Per-component scores (keys vary by task) |

---

## Grading

### Scoring formulas

**Easy β€” Flat Plains Transit**
```
score = proximity Γ— 0.85 + step_efficiency Γ— 0.15
```

**Medium β€” Crater Avoidance**
```
collision_penalty = min(collision_count Γ— 0.06, 0.40)
score = proximity Γ— 0.75 + step_efficiency Γ— 0.25 βˆ’ collision_penalty
```

**Hard β€” Battery Sprint**
```
battery_efficiency = battery_remaining / 0.35
score = proximity Γ— 0.65 + battery_efficiency Γ— 0.35
```

### Shared metrics

**`proximity`** is a strictly linear metric:

```
proximity = 1.0 βˆ’ (min_distance_achieved / initial_distance)
```

This is exactly `0.70` when the rover closed 70% of the spawn→waypoint gap, `0.0` if it never moved, and overridden to `1.0` on confirmed arrival.

**`step_efficiency`**:
```
step_efficiency = 1.0 βˆ’ (steps_taken / max_steps)
```

### Score examples

| Scenario | Score |
|---|---|
| Easy: beeline arrival using 50% of budget | 0.85 + 0.075 = **0.925** |
| Easy: arrival using full budget | 0.85 + 0.000 = **0.850** |
| Easy: 70% progress, no arrival | ~0.595–0.700 |
| Medium: arrival, zero collisions | 0.75 + 0.25 = **1.000** |
| Medium: arrival, 3 collisions | 1.00 βˆ’ 0.18 = **0.820** |
| Medium: stuck in ring, 8+ collisions | ≀ **0.000** |
| Hard: arrival, 50% starting battery left | 0.65 + 0.175 = **0.825** |
| Hard: arrival, battery = 0 on landing | 0.65 + 0.000 = **0.650** |
| Hard: battery dead at 70% progress | 0.455 + 0.000 = **0.455** |

---

## Reward Signal

The step reward returned by `/step` is used for online RL training. It is separate from the grader score.

> **Note β€” reward system overhauled in Phase 4.** The original static penalties caused the *stationary exploit* (see Engineering Highlights above). The values below reflect the current `_compute_reward` implementation.

| Event | Reward | Notes |
|---|---|---|
| Every step | βˆ’0.01 | Constant time-pressure; ensures idle steps are always net-negative |
| Battery drain | βˆ’drain Γ— 1.0 | Proportional efficiency cost (coefficient reduced from 2.0 to 1.0 β€” PBRS now carries the main navigation signal) |
| **Waypoint reached** | **+100.0** | Asymmetric terminal bonus; episode returns immediately β€” prevents early policy collapse |
| Battery depleted | βˆ’20.0 | Terminal penalty |
| **Potential-based shaping** | `PBRS_SCALE Γ— (d_prev βˆ’ d_curr)` where `PBRS_SCALE = 0.5` | Exactly **0** when stationary; positive when closing gap; negative when moving away |
| **Vector-field shaping** | `VF_SCALE Γ— cos_sim Γ— proximity_weight` (`VF_SCALE = 1.5`) | Active within 10 m of obstacles; `proximity_weight = 1 βˆ’ d / 10`; ranges from βˆ’1.5 (heading into obstacle) to +1.5 (aligned with safe tangent) |

---

## File Structure

```
planetary-rover-env/
β”œβ”€β”€ openenv.yaml      # Typed observation + action space declarations
β”œβ”€β”€ main.py           # FastAPI server β€” physics engine + all routes (1632 lines)
β”œβ”€β”€ inference.py      # LLM-driven inference agent (HF Inference API)
β”œβ”€β”€ train.py          # GRPO training script (Unsloth 4-bit + TRL GRPOTrainer)
β”œβ”€β”€ requirements.txt  # Pinned runtime dependencies
β”œβ”€β”€ Dockerfile        # Two-stage optimised build, port 7860, non-root user
└── README.md         # This file
```

---

## Dependencies

| Package | Version | Role |
|---|---|---|
| `fastapi` | 0.115.6 | ASGI web framework |
| `uvicorn[standard]` | 0.32.1 | ASGI server (uvloop + httptools) |
| `pydantic` | 2.10.3 | Request/response validation |
| `aiohttp` | β€” | Async HTTP client in `inference.py` |
| `openai` | β€” | OpenAI-compatible LLM client in `inference.py` |

The simulation engine itself uses only Python stdlib (`math`, `random`, `uuid`, `dataclasses`, `enum`).

---

## Inference Agent Results

Running the LLM inference agent against a local server:

```bash
export $(grep -v '^#' .env | xargs) && uv run python inference.py
```

Reference scores (with the strategies embedded in the system prompt):

| Task | Agent strategy | Typical score | Verdict |
|---|---|---|---|
| easy | Beeline: `atan2(dy, dx)` heading lock, `thrust=1.0` | 0.92–0.98 | WIN |
| medium | Two-phase detour: approach β†’ perpendicular β†’ approach | 0.85–0.92 | WIN |
| hard | Heading lock on step 1, never steer again | 0.45–0.65 | WIN / BATTERY_DEAD |

These scores represent LLM-driven P-controller navigation. A trained RL policy should significantly exceed them on all three tasks.

---

## Building Your Own Agent

The minimal loop to run an episode:

```python
import requests, math

BASE = "http://localhost:7860"

# 1. Discover the task
tasks = requests.get(f"{BASE}/tasks").json()

# 2. Reset
resp = requests.post(f"{BASE}/reset", json={"task_id": "easy", "seed": 42}).json()
episode_id = resp["episode_id"]
obs = resp["obs"]

# 3. Step loop
while True:
    dx = obs["target_relative"]["x"]
    dy = obs["target_relative"]["y"]
    heading_error = math.atan2(dy, dx) - obs["rover_heading"]

    action = {
        "thrust":            1.0,
        "steering":          max(-1.0, min(1.0, heading_error * 2.5)),
        "brake":             0,
        "vertical_thruster": 0.0,
    }

    step = requests.post(f"{BASE}/step", json=action,
                         params={"episode_id": episode_id}).json()
    obs = step["obs"]

    if step["done"] or step["truncated"]:
        info = step["info"]
        break

# 4. Grade
grade = requests.post(f"{BASE}/grader", json={
    "episode_id":            episode_id,
    "task_id":               "easy",
    "termination_reason":    info["termination_reason"],
    "initial_distance":      info["initial_distance"],
    "min_distance_achieved": info["min_distance"],
    "waypoints_reached":     info["waypoints_hit"],
    "total_waypoints":       info["total_waypoints"],
    "steps_taken":           info["steps"],
    "max_steps":             info["max_steps"],
    "battery_remaining":     info["battery"],
    "collision_count":       info["collision_count"],
}).json()

print(f"Score: {grade['score']}  Verdict: {grade['verdict']}")
print(f"Rationale: {grade['score_rationale']}")
```

---

## License

MIT