Spaces:
Sleeping
Sleeping
| # π Agent & RL Training Documentation | |
| ## Autonomous Driving Multi-Agent OpenEnv | |
| --- | |
| ## 1. What Are the Agents? | |
| This project has **three vehicles** in the environment, each with a different policy: | |
| | Agent | Symbol | Type | Policy | Learns? | | |
| |-------|--------|------|--------|---------| | |
| | **Ego Vehicle** | π E | LLM-controlled | GRPO fine-tuned | β Yes | | |
| | **Blocker Vehicle** | π§ B | Rule-based | Tries to match ego's lane | β No | | |
| | **Traffic Vehicle** | π T | Stochastic | Random lane drift | β No | | |
| --- | |
| ## 2. How the Ego Agent Thinks | |
| Every step, the LLM agent receives: | |
| ``` | |
| SYSTEM PROMPT (instructions + action space) | |
| + | |
| USER PROMPT: | |
| βββ Current road render (ASCII grid) | |
| βββ Lidar sensor readings | |
| βββ Collision prediction | |
| βββ Recent negotiation log | |
| βββ Memory (last 3 steps) | |
| ``` | |
| And must output structured JSON: | |
| ```json | |
| { | |
| "thinking": "Blocker is 3 steps ahead in center lane. I should negotiate first, then change to the left lane.", | |
| "negotiate": "blocker|Please yield, I need to pass safely", | |
| "action": 2 | |
| } | |
| ``` | |
| The `thinking` field is the **chain-of-thought** β rewarded for being present and meaningful. This encourages the LLM to reason before acting. | |
| --- | |
| ## 3. Action Space | |
| | ID | Name | Effect | | |
| |----|------|--------| | |
| | 0 | `accelerate` | Ego moves +2 positions forward | | |
| | 1 | `brake` | Ego moves +1 position (slower, safer) | | |
| | 2 | `lane_left` | Ego shifts one lane left | | |
| | 3 | `lane_right` | Ego shifts one lane right | | |
| --- | |
| ## 4. Sensor Tools | |
| The agent can call these tools to observe the world: | |
| ### `lidar_scan()` β dict | |
| ```json | |
| { | |
| "blocker_distance": 3, | |
| "blocker_lane": 1, | |
| "traffic_distance": 6, | |
| "traffic_lane": 2, | |
| "ego_lane": 1, | |
| "ego_position": 4, | |
| "goal_distance": 15 | |
| } | |
| ``` | |
| ### `predict_collision()` β dict | |
| ```json | |
| { | |
| "blocker_threat": true, | |
| "traffic_threat": false, | |
| "immediate_collision": false | |
| } | |
| ``` | |
| --- | |
| ## 5. Negotiation System | |
| The ego agent can send **natural language messages** to other vehicles: | |
| ```python | |
| # In the environment | |
| response = env.negotiate("blocker", "Please yield, I need to pass safely") | |
| # β "Yielding lane β proceed safely." | |
| ``` | |
| **Blocker yielding logic:** | |
| - Blocker yields if: ego is β€4 steps away **AND** message contains polite words | |
| (`request`, `please`, `yield`, `allow`, `safe`) | |
| - If blocker yields: it moves out of ego's lane β ego gets +0.3 reward bonus | |
| **Why this matters for RL:** The LLM must learn *when* to negotiate vs. when to just act. Negotiating costs a step but can unlock reward. This creates a multi-step reasoning challenge. | |
| --- | |
| ## 6. Reward Structure | |
| | Event | Reward | Why | | |
| |-------|--------|-----| | |
| | Reach goal (position 19) | **+10.0** | Primary objective | | |
| | Collision | **β10.0** | Safety constraint | | |
| | Successful lane change past blocker | **+0.5** | Progress reward | | |
| | Negotiation causes blocker to yield | **+0.3** | Tool use reward | | |
| | Per step (time penalty) | **β0.05** | Encourages efficiency | | |
| | Invalid move (wall) | **β0.2** | Constraint violation | | |
| | Valid JSON format | **+0.2** | Structural reward | | |
| | Has `thinking` field (>15 chars) | **+0.2** | Reasoning reward | | |
| | Has `negotiate` field | **+0.1** | Tool awareness | | |
| | Invalid action int | **β0.1** | Format penalty | | |
| **Terminal propagation:** After each episode, a win bonus (+1.0) or loss penalty (β1.0) is added to **all steps** of that episode. This gives the policy a clear signal about whether its overall strategy was good. | |
| --- | |
| ## 7. How RL Training Works (GRPO) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β GRPO Training Loop β | |
| β β | |
| β 1. ROLLOUT COLLECTION β | |
| β βββ Play N games with current LLM policy β | |
| β βββ Each step: LLM generates response (+ vLLM fast) β | |
| β βββ Collect (prompt, response, reward) tuples β | |
| β β | |
| β 2. REWARD COMPUTATION β | |
| β βββ Environment reward (collision, goal, shaping) β | |
| β βββ Format reward (JSON structure) β | |
| β βββ Terminal propagation (win/loss to all steps) β | |
| β β | |
| β 3. GRPO UPDATE β | |
| β βββ num_generations=4: sample 4 responses per prompt β | |
| β βββ Compute relative advantage: r_i - mean(r) β | |
| β βββ Policy gradient loss with KL penalty vs base model β | |
| β βββ LoRA adapter weights updated β | |
| β β | |
| β 4. ONLINE RL (closed loop) β | |
| β βββ Repeat: play with updated policy β collect β update β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Why GRPO (not PPO)? | |
| GRPO (Group Relative Policy Optimization) is used because: | |
| - No separate value/critic network needed β simpler | |
| - Works well with LLMs generating text sequences | |
| - The "group" of N responses per prompt provides a natural baseline | |
| - Reward is relative: responses better than average get positive advantage | |
| ### LoRA Fine-tuning | |
| We use LoRA (Low-Rank Adaptation) so we only train ~1% of parameters: | |
| ``` | |
| Base model weights: FROZEN (Qwen3-4B or gpt-oss-20B) | |
| LoRA matrices: TRAINED (q_proj, k_proj, v_proj, o_proj, gate, up, down) | |
| ``` | |
| This means the model retains general language ability while learning the driving task. | |
| --- | |
| ## 8. What the Model Learns Over Training | |
| | Early training | Late training | | |
| |----------------|---------------| | |
| | Random actions | Strategic lane changes | | |
| | No negotiation | Negotiates when blocker is close | | |
| | Invalid JSON | Consistent structured output | | |
| | Collides frequently | Avoids collisions | | |
| | Doesn't use sensors | References lidar in reasoning | | |
| --- | |
| ## 9. W&B Metrics to Track | |
| | Metric | Meaning | | |
| |--------|---------| | |
| | `win_rate` | % episodes reaching goal | | |
| | `reward/mean` | Average reward per step | | |
| | `kl_divergence` | How far policy has drifted from base | | |
| | `format_reward` | % responses with valid JSON | | |
| | `policy/entropy` | Exploration (high) vs exploitation (low) | | |
| | `negotiation_rate` | % steps with negotiation attempt | | |
| --- | |
| ## 10. File Structure | |
| ``` | |
| final_project/ | |
| βββ env/ | |
| β βββ negotiation_env.py β Environment logic, sensors, reward | |
| βββ agents/ | |
| β βββ negotiation_agent.py β LLM agent, prompt, tool calls | |
| β βββ memory.py β Episode memory for in-context use | |
| βββ server/ | |
| β βββ server.py β FastAPI OpenEnv server | |
| β βββ requirements.txt β β Required for HF Spaces Docker build | |
| βββ training/ | |
| β βββ train_grpo_colab.ipynb β Full GRPO training notebook (H100) | |
| βββ ui/ | |
| β βββ app.py β Gradio simulator UI | |
| βββ docs/ | |
| β βββ DOCUMENTATION.md β This file | |
| βββ Dockerfile β HF Spaces deployment | |
| βββ README.md | |
| ``` | |
| --- | |
| ## 11. Quick Start | |
| ```bash | |
| # Install | |
| pip install -r server/requirements.txt | |
| # Run environment server | |
| uvicorn server.server:app --reload --port 7860 | |
| # Run UI | |
| python ui/app.py | |
| # Test environment | |
| python -c " | |
| from env.negotiation_env import NegotiationDrivingEnv | |
| env = NegotiationDrivingEnv() | |
| obs, _ = env.reset() | |
| print(env.render()) | |
| print(env.lidar_scan()) | |
| r = env.negotiate('blocker', 'Please yield, I need to pass safely') | |
| print('Blocker says:', r) | |
| obs, reward, done, _, info = env.step(2) # lane_left | |
| print('Reward:', reward, '| Info:', info) | |
| " | |
| ``` | |
| --- | |
| ## 12. Deployment to HF Spaces | |
| ```bash | |
| # Login | |
| huggingface-cli login | |
| # Push (from project root) | |
| # The Dockerfile copies server/requirements.txt β this must exist! | |
| git init && git add . && git commit -m "initial" | |
| huggingface-cli repo create autonomous-driving-env --type space --sdk docker | |
| git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-driving-env | |
| git push hf main | |
| ``` | |