# 📚 Agent & RL Training Documentation
## Autonomous Driving Multi-Agent OpenEnv

---

## 1. What Are the Agents?

This project has **three vehicles** in the environment, each with a different policy:

| Agent | Symbol | Type | Policy | Learns? |
|-------|--------|------|--------|---------|
| **Ego Vehicle** | 🚗 E | LLM-controlled | GRPO fine-tuned | ✅ Yes |
| **Blocker Vehicle** | 🚧 B | Rule-based | Tries to match ego's lane | ❌ No |
| **Traffic Vehicle** | 🚕 T | Stochastic | Random lane drift | ❌ No |

---

## 2. How the Ego Agent Thinks

Every step, the LLM agent receives:

```
SYSTEM PROMPT (instructions + action space)
         +
USER PROMPT:
  ├── Current road render (ASCII grid)
  ├── Lidar sensor readings
  ├── Collision prediction
  ├── Recent negotiation log
  └── Memory (last 3 steps)
```

And must output structured JSON:

```json
{
  "thinking": "Blocker is 3 steps ahead in center lane. I should negotiate first, then change to the left lane.",
  "negotiate": "blocker|Please yield, I need to pass safely",
  "action": 2
}
```

The `thinking` field is the **chain-of-thought** — rewarded for being present and meaningful. This encourages the LLM to reason before acting.

---

## 3. Action Space

| ID | Name | Effect |
|----|------|--------|
| 0 | `accelerate` | Ego moves +2 positions forward |
| 1 | `brake` | Ego moves +1 position (slower, safer) |
| 2 | `lane_left` | Ego shifts one lane left |
| 3 | `lane_right` | Ego shifts one lane right |

---

## 4. Sensor Tools

The agent can call these tools to observe the world:

### `lidar_scan()` → dict
```json
{
  "blocker_distance": 3,
  "blocker_lane": 1,
  "traffic_distance": 6,
  "traffic_lane": 2,
  "ego_lane": 1,
  "ego_position": 4,
  "goal_distance": 15
}
```

### `predict_collision()` → dict
```json
{
  "blocker_threat": true,
  "traffic_threat": false,
  "immediate_collision": false
}
```

---

## 5. Negotiation System

The ego agent can send **natural language messages** to other vehicles:

```python
# In the environment
response = env.negotiate("blocker", "Please yield, I need to pass safely")
# → "Yielding lane — proceed safely."
```

**Blocker yielding logic:**
- Blocker yields if: ego is ≤4 steps away **AND** message contains polite words
  (`request`, `please`, `yield`, `allow`, `safe`)
- If blocker yields: it moves out of ego's lane → ego gets +0.3 reward bonus

**Why this matters for RL:** The LLM must learn *when* to negotiate vs. when to just act. Negotiating costs a step but can unlock reward. This creates a multi-step reasoning challenge.

---

## 6. Reward Structure

| Event | Reward | Why |
|-------|--------|-----|
| Reach goal (position 19) | **+10.0** | Primary objective |
| Collision | **−10.0** | Safety constraint |
| Successful lane change past blocker | **+0.5** | Progress reward |
| Negotiation causes blocker to yield | **+0.3** | Tool use reward |
| Per step (time penalty) | **−0.05** | Encourages efficiency |
| Invalid move (wall) | **−0.2** | Constraint violation |
| Valid JSON format | **+0.2** | Structural reward |
| Has `thinking` field (>15 chars) | **+0.2** | Reasoning reward |
| Has `negotiate` field | **+0.1** | Tool awareness |
| Invalid action int | **−0.1** | Format penalty |

**Terminal propagation:** After each episode, a win bonus (+1.0) or loss penalty (−1.0) is added to **all steps** of that episode. This gives the policy a clear signal about whether its overall strategy was good.

---

## 7. How RL Training Works (GRPO)

```
┌─────────────────────────────────────────────────────────────┐
│                    GRPO Training Loop                       │
│                                                             │
│  1. ROLLOUT COLLECTION                                      │
│     ├── Play N games with current LLM policy                │
│     ├── Each step: LLM generates response (+ vLLM fast)     │
│     └── Collect (prompt, response, reward) tuples           │
│                                                             │
│  2. REWARD COMPUTATION                                      │
│     ├── Environment reward (collision, goal, shaping)       │
│     ├── Format reward (JSON structure)                      │
│     └── Terminal propagation (win/loss to all steps)        │
│                                                             │
│  3. GRPO UPDATE                                             │
│     ├── num_generations=4: sample 4 responses per prompt    │
│     ├── Compute relative advantage: r_i - mean(r)           │
│     ├── Policy gradient loss with KL penalty vs base model  │
│     └── LoRA adapter weights updated                        │
│                                                             │
│  4. ONLINE RL (closed loop)                                 │
│     └── Repeat: play with updated policy → collect → update │
└─────────────────────────────────────────────────────────────┘
```

### Why GRPO (not PPO)?

GRPO (Group Relative Policy Optimization) is used because:
- No separate value/critic network needed — simpler
- Works well with LLMs generating text sequences
- The "group" of N responses per prompt provides a natural baseline
- Reward is relative: responses better than average get positive advantage

### LoRA Fine-tuning

We use LoRA (Low-Rank Adaptation) so we only train ~1% of parameters:

```
Base model weights:  FROZEN  (Qwen3-4B or gpt-oss-20B)
LoRA matrices:       TRAINED  (q_proj, k_proj, v_proj, o_proj, gate, up, down)
```

This means the model retains general language ability while learning the driving task.

---

## 8. What the Model Learns Over Training

| Early training | Late training |
|----------------|---------------|
| Random actions | Strategic lane changes |
| No negotiation | Negotiates when blocker is close |
| Invalid JSON | Consistent structured output |
| Collides frequently | Avoids collisions |
| Doesn't use sensors | References lidar in reasoning |

---

## 9. W&B Metrics to Track

| Metric | Meaning |
|--------|---------|
| `win_rate` | % episodes reaching goal |
| `reward/mean` | Average reward per step |
| `kl_divergence` | How far policy has drifted from base |
| `format_reward` | % responses with valid JSON |
| `policy/entropy` | Exploration (high) vs exploitation (low) |
| `negotiation_rate` | % steps with negotiation attempt |

---

## 10. File Structure

```
final_project/
├── env/
│   └── negotiation_env.py     ← Environment logic, sensors, reward
├── agents/
│   ├── negotiation_agent.py   ← LLM agent, prompt, tool calls
│   └── memory.py              ← Episode memory for in-context use
├── server/
│   ├── server.py              ← FastAPI OpenEnv server
│   └── requirements.txt       ← ✅ Required for HF Spaces Docker build
├── training/
│   └── train_grpo_colab.ipynb ← Full GRPO training notebook (H100)
├── ui/
│   └── app.py                 ← Gradio simulator UI
├── docs/
│   └── DOCUMENTATION.md       ← This file
├── Dockerfile                 ← HF Spaces deployment
└── README.md
```

---

## 11. Quick Start

```bash
# Install
pip install -r server/requirements.txt

# Run environment server
uvicorn server.server:app --reload --port 7860

# Run UI
python ui/app.py

# Test environment
python -c "
from env.negotiation_env import NegotiationDrivingEnv
env = NegotiationDrivingEnv()
obs, _ = env.reset()
print(env.render())
print(env.lidar_scan())
r = env.negotiate('blocker', 'Please yield, I need to pass safely')
print('Blocker says:', r)
obs, reward, done, _, info = env.step(2)  # lane_left
print('Reward:', reward, '| Info:', info)
"
```

---

## 12. Deployment to HF Spaces

```bash
# Login
huggingface-cli login

# Push (from project root)
# The Dockerfile copies server/requirements.txt — this must exist!
git init && git add . && git commit -m "initial"
huggingface-cli repo create autonomous-driving-env --type space --sdk docker
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-driving-env
git push hf main
```