# 📚 Agent & RL Training Documentation ## Autonomous Driving Multi-Agent OpenEnv --- ## 1. What Are the Agents? This project has **three vehicles** in the environment, each with a different policy: | Agent | Symbol | Type | Policy | Learns? | |-------|--------|------|--------|---------| | **Ego Vehicle** | 🚗 E | LLM-controlled | GRPO fine-tuned | ✅ Yes | | **Blocker Vehicle** | 🚧 B | Rule-based | Tries to match ego's lane | ❌ No | | **Traffic Vehicle** | 🚕 T | Stochastic | Random lane drift | ❌ No | --- ## 2. How the Ego Agent Thinks Every step, the LLM agent receives: ``` SYSTEM PROMPT (instructions + action space) + USER PROMPT: ├── Current road render (ASCII grid) ├── Lidar sensor readings ├── Collision prediction ├── Recent negotiation log └── Memory (last 3 steps) ``` And must output structured JSON: ```json { "thinking": "Blocker is 3 steps ahead in center lane. I should negotiate first, then change to the left lane.", "negotiate": "blocker|Please yield, I need to pass safely", "action": 2 } ``` The `thinking` field is the **chain-of-thought** — rewarded for being present and meaningful. This encourages the LLM to reason before acting. --- ## 3. Action Space | ID | Name | Effect | |----|------|--------| | 0 | `accelerate` | Ego moves +2 positions forward | | 1 | `brake` | Ego moves +1 position (slower, safer) | | 2 | `lane_left` | Ego shifts one lane left | | 3 | `lane_right` | Ego shifts one lane right | --- ## 4. Sensor Tools The agent can call these tools to observe the world: ### `lidar_scan()` → dict ```json { "blocker_distance": 3, "blocker_lane": 1, "traffic_distance": 6, "traffic_lane": 2, "ego_lane": 1, "ego_position": 4, "goal_distance": 15 } ``` ### `predict_collision()` → dict ```json { "blocker_threat": true, "traffic_threat": false, "immediate_collision": false } ``` --- ## 5. Negotiation System The ego agent can send **natural language messages** to other vehicles: ```python # In the environment response = env.negotiate("blocker", "Please yield, I need to pass safely") # → "Yielding lane — proceed safely." ``` **Blocker yielding logic:** - Blocker yields if: ego is ≤4 steps away **AND** message contains polite words (`request`, `please`, `yield`, `allow`, `safe`) - If blocker yields: it moves out of ego's lane → ego gets +0.3 reward bonus **Why this matters for RL:** The LLM must learn *when* to negotiate vs. when to just act. Negotiating costs a step but can unlock reward. This creates a multi-step reasoning challenge. --- ## 6. Reward Structure | Event | Reward | Why | |-------|--------|-----| | Reach goal (position 19) | **+10.0** | Primary objective | | Collision | **−10.0** | Safety constraint | | Successful lane change past blocker | **+0.5** | Progress reward | | Negotiation causes blocker to yield | **+0.3** | Tool use reward | | Per step (time penalty) | **−0.05** | Encourages efficiency | | Invalid move (wall) | **−0.2** | Constraint violation | | Valid JSON format | **+0.2** | Structural reward | | Has `thinking` field (>15 chars) | **+0.2** | Reasoning reward | | Has `negotiate` field | **+0.1** | Tool awareness | | Invalid action int | **−0.1** | Format penalty | **Terminal propagation:** After each episode, a win bonus (+1.0) or loss penalty (−1.0) is added to **all steps** of that episode. This gives the policy a clear signal about whether its overall strategy was good. --- ## 7. How RL Training Works (GRPO) ``` ┌─────────────────────────────────────────────────────────────┐ │ GRPO Training Loop │ │ │ │ 1. ROLLOUT COLLECTION │ │ ├── Play N games with current LLM policy │ │ ├── Each step: LLM generates response (+ vLLM fast) │ │ └── Collect (prompt, response, reward) tuples │ │ │ │ 2. REWARD COMPUTATION │ │ ├── Environment reward (collision, goal, shaping) │ │ ├── Format reward (JSON structure) │ │ └── Terminal propagation (win/loss to all steps) │ │ │ │ 3. GRPO UPDATE │ │ ├── num_generations=4: sample 4 responses per prompt │ │ ├── Compute relative advantage: r_i - mean(r) │ │ ├── Policy gradient loss with KL penalty vs base model │ │ └── LoRA adapter weights updated │ │ │ │ 4. ONLINE RL (closed loop) │ │ └── Repeat: play with updated policy → collect → update │ └─────────────────────────────────────────────────────────────┘ ``` ### Why GRPO (not PPO)? GRPO (Group Relative Policy Optimization) is used because: - No separate value/critic network needed — simpler - Works well with LLMs generating text sequences - The "group" of N responses per prompt provides a natural baseline - Reward is relative: responses better than average get positive advantage ### LoRA Fine-tuning We use LoRA (Low-Rank Adaptation) so we only train ~1% of parameters: ``` Base model weights: FROZEN (Qwen3-4B or gpt-oss-20B) LoRA matrices: TRAINED (q_proj, k_proj, v_proj, o_proj, gate, up, down) ``` This means the model retains general language ability while learning the driving task. --- ## 8. What the Model Learns Over Training | Early training | Late training | |----------------|---------------| | Random actions | Strategic lane changes | | No negotiation | Negotiates when blocker is close | | Invalid JSON | Consistent structured output | | Collides frequently | Avoids collisions | | Doesn't use sensors | References lidar in reasoning | --- ## 9. W&B Metrics to Track | Metric | Meaning | |--------|---------| | `win_rate` | % episodes reaching goal | | `reward/mean` | Average reward per step | | `kl_divergence` | How far policy has drifted from base | | `format_reward` | % responses with valid JSON | | `policy/entropy` | Exploration (high) vs exploitation (low) | | `negotiation_rate` | % steps with negotiation attempt | --- ## 10. File Structure ``` final_project/ ├── env/ │ └── negotiation_env.py ← Environment logic, sensors, reward ├── agents/ │ ├── negotiation_agent.py ← LLM agent, prompt, tool calls │ └── memory.py ← Episode memory for in-context use ├── server/ │ ├── server.py ← FastAPI OpenEnv server │ └── requirements.txt ← ✅ Required for HF Spaces Docker build ├── training/ │ └── train_grpo_colab.ipynb ← Full GRPO training notebook (H100) ├── ui/ │ └── app.py ← Gradio simulator UI ├── docs/ │ └── DOCUMENTATION.md ← This file ├── Dockerfile ← HF Spaces deployment └── README.md ``` --- ## 11. Quick Start ```bash # Install pip install -r server/requirements.txt # Run environment server uvicorn server.server:app --reload --port 7860 # Run UI python ui/app.py # Test environment python -c " from env.negotiation_env import NegotiationDrivingEnv env = NegotiationDrivingEnv() obs, _ = env.reset() print(env.render()) print(env.lidar_scan()) r = env.negotiate('blocker', 'Please yield, I need to pass safely') print('Blocker says:', r) obs, reward, done, _, info = env.step(2) # lane_left print('Reward:', reward, '| Info:', info) " ``` --- ## 12. Deployment to HF Spaces ```bash # Login huggingface-cli login # Push (from project root) # The Dockerfile copies server/requirements.txt — this must exist! git init && git add . && git commit -m "initial" huggingface-cli repo create autonomous-driving-env --type space --sdk docker git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-driving-env git push hf main ```