AD / docs /DOCUMENTATION.md
helshahaby's picture
Upload 13 files
3d58f38 verified
# πŸ“š Agent & RL Training Documentation
## Autonomous Driving Multi-Agent OpenEnv
---
## 1. What Are the Agents?
This project has **three vehicles** in the environment, each with a different policy:
| Agent | Symbol | Type | Policy | Learns? |
|-------|--------|------|--------|---------|
| **Ego Vehicle** | πŸš— E | LLM-controlled | GRPO fine-tuned | βœ… Yes |
| **Blocker Vehicle** | 🚧 B | Rule-based | Tries to match ego's lane | ❌ No |
| **Traffic Vehicle** | πŸš• T | Stochastic | Random lane drift | ❌ No |
---
## 2. How the Ego Agent Thinks
Every step, the LLM agent receives:
```
SYSTEM PROMPT (instructions + action space)
+
USER PROMPT:
β”œβ”€β”€ Current road render (ASCII grid)
β”œβ”€β”€ Lidar sensor readings
β”œβ”€β”€ Collision prediction
β”œβ”€β”€ Recent negotiation log
└── Memory (last 3 steps)
```
And must output structured JSON:
```json
{
"thinking": "Blocker is 3 steps ahead in center lane. I should negotiate first, then change to the left lane.",
"negotiate": "blocker|Please yield, I need to pass safely",
"action": 2
}
```
The `thinking` field is the **chain-of-thought** β€” rewarded for being present and meaningful. This encourages the LLM to reason before acting.
---
## 3. Action Space
| ID | Name | Effect |
|----|------|--------|
| 0 | `accelerate` | Ego moves +2 positions forward |
| 1 | `brake` | Ego moves +1 position (slower, safer) |
| 2 | `lane_left` | Ego shifts one lane left |
| 3 | `lane_right` | Ego shifts one lane right |
---
## 4. Sensor Tools
The agent can call these tools to observe the world:
### `lidar_scan()` β†’ dict
```json
{
"blocker_distance": 3,
"blocker_lane": 1,
"traffic_distance": 6,
"traffic_lane": 2,
"ego_lane": 1,
"ego_position": 4,
"goal_distance": 15
}
```
### `predict_collision()` β†’ dict
```json
{
"blocker_threat": true,
"traffic_threat": false,
"immediate_collision": false
}
```
---
## 5. Negotiation System
The ego agent can send **natural language messages** to other vehicles:
```python
# In the environment
response = env.negotiate("blocker", "Please yield, I need to pass safely")
# β†’ "Yielding lane β€” proceed safely."
```
**Blocker yielding logic:**
- Blocker yields if: ego is ≀4 steps away **AND** message contains polite words
(`request`, `please`, `yield`, `allow`, `safe`)
- If blocker yields: it moves out of ego's lane β†’ ego gets +0.3 reward bonus
**Why this matters for RL:** The LLM must learn *when* to negotiate vs. when to just act. Negotiating costs a step but can unlock reward. This creates a multi-step reasoning challenge.
---
## 6. Reward Structure
| Event | Reward | Why |
|-------|--------|-----|
| Reach goal (position 19) | **+10.0** | Primary objective |
| Collision | **βˆ’10.0** | Safety constraint |
| Successful lane change past blocker | **+0.5** | Progress reward |
| Negotiation causes blocker to yield | **+0.3** | Tool use reward |
| Per step (time penalty) | **βˆ’0.05** | Encourages efficiency |
| Invalid move (wall) | **βˆ’0.2** | Constraint violation |
| Valid JSON format | **+0.2** | Structural reward |
| Has `thinking` field (>15 chars) | **+0.2** | Reasoning reward |
| Has `negotiate` field | **+0.1** | Tool awareness |
| Invalid action int | **βˆ’0.1** | Format penalty |
**Terminal propagation:** After each episode, a win bonus (+1.0) or loss penalty (βˆ’1.0) is added to **all steps** of that episode. This gives the policy a clear signal about whether its overall strategy was good.
---
## 7. How RL Training Works (GRPO)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GRPO Training Loop β”‚
β”‚ β”‚
β”‚ 1. ROLLOUT COLLECTION β”‚
β”‚ β”œβ”€β”€ Play N games with current LLM policy β”‚
β”‚ β”œβ”€β”€ Each step: LLM generates response (+ vLLM fast) β”‚
β”‚ └── Collect (prompt, response, reward) tuples β”‚
β”‚ β”‚
β”‚ 2. REWARD COMPUTATION β”‚
β”‚ β”œβ”€β”€ Environment reward (collision, goal, shaping) β”‚
β”‚ β”œβ”€β”€ Format reward (JSON structure) β”‚
β”‚ └── Terminal propagation (win/loss to all steps) β”‚
β”‚ β”‚
β”‚ 3. GRPO UPDATE β”‚
β”‚ β”œβ”€β”€ num_generations=4: sample 4 responses per prompt β”‚
β”‚ β”œβ”€β”€ Compute relative advantage: r_i - mean(r) β”‚
β”‚ β”œβ”€β”€ Policy gradient loss with KL penalty vs base model β”‚
β”‚ └── LoRA adapter weights updated β”‚
β”‚ β”‚
β”‚ 4. ONLINE RL (closed loop) β”‚
β”‚ └── Repeat: play with updated policy β†’ collect β†’ update β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Why GRPO (not PPO)?
GRPO (Group Relative Policy Optimization) is used because:
- No separate value/critic network needed β€” simpler
- Works well with LLMs generating text sequences
- The "group" of N responses per prompt provides a natural baseline
- Reward is relative: responses better than average get positive advantage
### LoRA Fine-tuning
We use LoRA (Low-Rank Adaptation) so we only train ~1% of parameters:
```
Base model weights: FROZEN (Qwen3-4B or gpt-oss-20B)
LoRA matrices: TRAINED (q_proj, k_proj, v_proj, o_proj, gate, up, down)
```
This means the model retains general language ability while learning the driving task.
---
## 8. What the Model Learns Over Training
| Early training | Late training |
|----------------|---------------|
| Random actions | Strategic lane changes |
| No negotiation | Negotiates when blocker is close |
| Invalid JSON | Consistent structured output |
| Collides frequently | Avoids collisions |
| Doesn't use sensors | References lidar in reasoning |
---
## 9. W&B Metrics to Track
| Metric | Meaning |
|--------|---------|
| `win_rate` | % episodes reaching goal |
| `reward/mean` | Average reward per step |
| `kl_divergence` | How far policy has drifted from base |
| `format_reward` | % responses with valid JSON |
| `policy/entropy` | Exploration (high) vs exploitation (low) |
| `negotiation_rate` | % steps with negotiation attempt |
---
## 10. File Structure
```
final_project/
β”œβ”€β”€ env/
β”‚ └── negotiation_env.py ← Environment logic, sensors, reward
β”œβ”€β”€ agents/
β”‚ β”œβ”€β”€ negotiation_agent.py ← LLM agent, prompt, tool calls
β”‚ └── memory.py ← Episode memory for in-context use
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ server.py ← FastAPI OpenEnv server
β”‚ └── requirements.txt ← βœ… Required for HF Spaces Docker build
β”œβ”€β”€ training/
β”‚ └── train_grpo_colab.ipynb ← Full GRPO training notebook (H100)
β”œβ”€β”€ ui/
β”‚ └── app.py ← Gradio simulator UI
β”œβ”€β”€ docs/
β”‚ └── DOCUMENTATION.md ← This file
β”œβ”€β”€ Dockerfile ← HF Spaces deployment
└── README.md
```
---
## 11. Quick Start
```bash
# Install
pip install -r server/requirements.txt
# Run environment server
uvicorn server.server:app --reload --port 7860
# Run UI
python ui/app.py
# Test environment
python -c "
from env.negotiation_env import NegotiationDrivingEnv
env = NegotiationDrivingEnv()
obs, _ = env.reset()
print(env.render())
print(env.lidar_scan())
r = env.negotiate('blocker', 'Please yield, I need to pass safely')
print('Blocker says:', r)
obs, reward, done, _, info = env.step(2) # lane_left
print('Reward:', reward, '| Info:', info)
"
```
---
## 12. Deployment to HF Spaces
```bash
# Login
huggingface-cli login
# Push (from project root)
# The Dockerfile copies server/requirements.txt β€” this must exist!
git init && git add . && git commit -m "initial"
huggingface-cli repo create autonomous-driving-env --type space --sdk docker
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-driving-env
git push hf main
```