AD / docs /DOCUMENTATION.md
helshahaby's picture
Upload 13 files
3d58f38 verified

πŸ“š Agent & RL Training Documentation

Autonomous Driving Multi-Agent OpenEnv


1. What Are the Agents?

This project has three vehicles in the environment, each with a different policy:

Agent Symbol Type Policy Learns?
Ego Vehicle πŸš— E LLM-controlled GRPO fine-tuned βœ… Yes
Blocker Vehicle 🚧 B Rule-based Tries to match ego's lane ❌ No
Traffic Vehicle πŸš• T Stochastic Random lane drift ❌ No

2. How the Ego Agent Thinks

Every step, the LLM agent receives:

SYSTEM PROMPT (instructions + action space)
         +
USER PROMPT:
  β”œβ”€β”€ Current road render (ASCII grid)
  β”œβ”€β”€ Lidar sensor readings
  β”œβ”€β”€ Collision prediction
  β”œβ”€β”€ Recent negotiation log
  └── Memory (last 3 steps)

And must output structured JSON:

{
  "thinking": "Blocker is 3 steps ahead in center lane. I should negotiate first, then change to the left lane.",
  "negotiate": "blocker|Please yield, I need to pass safely",
  "action": 2
}

The thinking field is the chain-of-thought β€” rewarded for being present and meaningful. This encourages the LLM to reason before acting.


3. Action Space

ID Name Effect
0 accelerate Ego moves +2 positions forward
1 brake Ego moves +1 position (slower, safer)
2 lane_left Ego shifts one lane left
3 lane_right Ego shifts one lane right

4. Sensor Tools

The agent can call these tools to observe the world:

lidar_scan() β†’ dict

{
  "blocker_distance": 3,
  "blocker_lane": 1,
  "traffic_distance": 6,
  "traffic_lane": 2,
  "ego_lane": 1,
  "ego_position": 4,
  "goal_distance": 15
}

predict_collision() β†’ dict

{
  "blocker_threat": true,
  "traffic_threat": false,
  "immediate_collision": false
}

5. Negotiation System

The ego agent can send natural language messages to other vehicles:

# In the environment
response = env.negotiate("blocker", "Please yield, I need to pass safely")
# β†’ "Yielding lane β€” proceed safely."

Blocker yielding logic:

  • Blocker yields if: ego is ≀4 steps away AND message contains polite words (request, please, yield, allow, safe)
  • If blocker yields: it moves out of ego's lane β†’ ego gets +0.3 reward bonus

Why this matters for RL: The LLM must learn when to negotiate vs. when to just act. Negotiating costs a step but can unlock reward. This creates a multi-step reasoning challenge.


6. Reward Structure

Event Reward Why
Reach goal (position 19) +10.0 Primary objective
Collision βˆ’10.0 Safety constraint
Successful lane change past blocker +0.5 Progress reward
Negotiation causes blocker to yield +0.3 Tool use reward
Per step (time penalty) βˆ’0.05 Encourages efficiency
Invalid move (wall) βˆ’0.2 Constraint violation
Valid JSON format +0.2 Structural reward
Has thinking field (>15 chars) +0.2 Reasoning reward
Has negotiate field +0.1 Tool awareness
Invalid action int βˆ’0.1 Format penalty

Terminal propagation: After each episode, a win bonus (+1.0) or loss penalty (βˆ’1.0) is added to all steps of that episode. This gives the policy a clear signal about whether its overall strategy was good.


7. How RL Training Works (GRPO)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    GRPO Training Loop                       β”‚
β”‚                                                             β”‚
β”‚  1. ROLLOUT COLLECTION                                      β”‚
β”‚     β”œβ”€β”€ Play N games with current LLM policy                β”‚
β”‚     β”œβ”€β”€ Each step: LLM generates response (+ vLLM fast)     β”‚
β”‚     └── Collect (prompt, response, reward) tuples           β”‚
β”‚                                                             β”‚
β”‚  2. REWARD COMPUTATION                                      β”‚
β”‚     β”œβ”€β”€ Environment reward (collision, goal, shaping)       β”‚
β”‚     β”œβ”€β”€ Format reward (JSON structure)                      β”‚
β”‚     └── Terminal propagation (win/loss to all steps)        β”‚
β”‚                                                             β”‚
β”‚  3. GRPO UPDATE                                             β”‚
β”‚     β”œβ”€β”€ num_generations=4: sample 4 responses per prompt    β”‚
β”‚     β”œβ”€β”€ Compute relative advantage: r_i - mean(r)           β”‚
β”‚     β”œβ”€β”€ Policy gradient loss with KL penalty vs base model  β”‚
β”‚     └── LoRA adapter weights updated                        β”‚
β”‚                                                             β”‚
β”‚  4. ONLINE RL (closed loop)                                 β”‚
β”‚     └── Repeat: play with updated policy β†’ collect β†’ update β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why GRPO (not PPO)?

GRPO (Group Relative Policy Optimization) is used because:

  • No separate value/critic network needed β€” simpler
  • Works well with LLMs generating text sequences
  • The "group" of N responses per prompt provides a natural baseline
  • Reward is relative: responses better than average get positive advantage

LoRA Fine-tuning

We use LoRA (Low-Rank Adaptation) so we only train ~1% of parameters:

Base model weights:  FROZEN  (Qwen3-4B or gpt-oss-20B)
LoRA matrices:       TRAINED  (q_proj, k_proj, v_proj, o_proj, gate, up, down)

This means the model retains general language ability while learning the driving task.


8. What the Model Learns Over Training

Early training Late training
Random actions Strategic lane changes
No negotiation Negotiates when blocker is close
Invalid JSON Consistent structured output
Collides frequently Avoids collisions
Doesn't use sensors References lidar in reasoning

9. W&B Metrics to Track

Metric Meaning
win_rate % episodes reaching goal
reward/mean Average reward per step
kl_divergence How far policy has drifted from base
format_reward % responses with valid JSON
policy/entropy Exploration (high) vs exploitation (low)
negotiation_rate % steps with negotiation attempt

10. File Structure

final_project/
β”œβ”€β”€ env/
β”‚   └── negotiation_env.py     ← Environment logic, sensors, reward
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ negotiation_agent.py   ← LLM agent, prompt, tool calls
β”‚   └── memory.py              ← Episode memory for in-context use
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ server.py              ← FastAPI OpenEnv server
β”‚   └── requirements.txt       ← βœ… Required for HF Spaces Docker build
β”œβ”€β”€ training/
β”‚   └── train_grpo_colab.ipynb ← Full GRPO training notebook (H100)
β”œβ”€β”€ ui/
β”‚   └── app.py                 ← Gradio simulator UI
β”œβ”€β”€ docs/
β”‚   └── DOCUMENTATION.md       ← This file
β”œβ”€β”€ Dockerfile                 ← HF Spaces deployment
└── README.md

11. Quick Start

# Install
pip install -r server/requirements.txt

# Run environment server
uvicorn server.server:app --reload --port 7860

# Run UI
python ui/app.py

# Test environment
python -c "
from env.negotiation_env import NegotiationDrivingEnv
env = NegotiationDrivingEnv()
obs, _ = env.reset()
print(env.render())
print(env.lidar_scan())
r = env.negotiate('blocker', 'Please yield, I need to pass safely')
print('Blocker says:', r)
obs, reward, done, _, info = env.step(2)  # lane_left
print('Reward:', reward, '| Info:', info)
"

12. Deployment to HF Spaces

# Login
huggingface-cli login

# Push (from project root)
# The Dockerfile copies server/requirements.txt β€” this must exist!
git init && git add . && git commit -m "initial"
huggingface-cli repo create autonomous-driving-env --type space --sdk docker
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-driving-env
git push hf main