Spaces:

helshahaby
/

AD

Sleeping

App Files Files Community

AD / docs /DOCUMENTATION.md

helshahaby

Upload 13 files

3d58f38 verified 3 days ago

preview code

raw

history blame contribute delete

8.56 kB

	# 📚 Agent & RL Training Documentation
	## Autonomous Driving Multi-Agent OpenEnv

	---

	## 1. What Are the Agents?

	This project has three vehicles in the environment, each with a different policy:

	\| Agent \| Symbol \| Type \| Policy \| Learns? \|
	\|-------\|--------\|------\|--------\|---------\|
	\| Ego Vehicle \| 🚗 E \| LLM-controlled \| GRPO fine-tuned \| ✅ Yes \|
	\| Blocker Vehicle \| 🚧 B \| Rule-based \| Tries to match ego's lane \| ❌ No \|
	\| Traffic Vehicle \| 🚕 T \| Stochastic \| Random lane drift \| ❌ No \|

	---

	## 2. How the Ego Agent Thinks

	Every step, the LLM agent receives:

	```
	SYSTEM PROMPT (instructions + action space)
	+
	USER PROMPT:
	├── Current road render (ASCII grid)
	├── Lidar sensor readings
	├── Collision prediction
	├── Recent negotiation log
	└── Memory (last 3 steps)
	```

	And must output structured JSON:

	```json
	{
	"thinking": "Blocker is 3 steps ahead in center lane. I should negotiate first, then change to the left lane.",
	"negotiate": "blocker\|Please yield, I need to pass safely",
	"action": 2
	}
	```

	The `thinking` field is the chain-of-thought — rewarded for being present and meaningful. This encourages the LLM to reason before acting.

	---

	## 3. Action Space

	\| ID \| Name \| Effect \|
	\|----\|------\|--------\|
	\| 0 \| `accelerate` \| Ego moves +2 positions forward \|
	\| 1 \| `brake` \| Ego moves +1 position (slower, safer) \|
	\| 2 \| `lane_left` \| Ego shifts one lane left \|
	\| 3 \| `lane_right` \| Ego shifts one lane right \|

	---

	## 4. Sensor Tools

	The agent can call these tools to observe the world:

	### `lidar_scan()` → dict
	```json
	{
	"blocker_distance": 3,
	"blocker_lane": 1,
	"traffic_distance": 6,
	"traffic_lane": 2,
	"ego_lane": 1,
	"ego_position": 4,
	"goal_distance": 15
	}
	```

	### `predict_collision()` → dict
	```json
	{
	"blocker_threat": true,
	"traffic_threat": false,
	"immediate_collision": false
	}
	```

	---

	## 5. Negotiation System

	The ego agent can send natural language messages to other vehicles:

	```python
	# In the environment
	response = env.negotiate("blocker", "Please yield, I need to pass safely")
	# → "Yielding lane — proceed safely."
	```

	Blocker yielding logic:
	- Blocker yields if: ego is ≤4 steps away AND message contains polite words
	(`request`, `please`, `yield`, `allow`, `safe`)
	- If blocker yields: it moves out of ego's lane → ego gets +0.3 reward bonus

	Why this matters for RL: The LLM must learn when to negotiate vs. when to just act. Negotiating costs a step but can unlock reward. This creates a multi-step reasoning challenge.

	---

	## 6. Reward Structure

	\| Event \| Reward \| Why \|
	\|-------\|--------\|-----\|
	\| Reach goal (position 19) \| +10.0 \| Primary objective \|
	\| Collision \| −10.0 \| Safety constraint \|
	\| Successful lane change past blocker \| +0.5 \| Progress reward \|
	\| Negotiation causes blocker to yield \| +0.3 \| Tool use reward \|
	\| Per step (time penalty) \| −0.05 \| Encourages efficiency \|
	\| Invalid move (wall) \| −0.2 \| Constraint violation \|
	\| Valid JSON format \| +0.2 \| Structural reward \|
	\| Has `thinking` field (>15 chars) \| +0.2 \| Reasoning reward \|
	\| Has `negotiate` field \| +0.1 \| Tool awareness \|
	\| Invalid action int \| −0.1 \| Format penalty \|

	Terminal propagation: After each episode, a win bonus (+1.0) or loss penalty (−1.0) is added to all steps of that episode. This gives the policy a clear signal about whether its overall strategy was good.

	---

	## 7. How RL Training Works (GRPO)

	```
	┌─────────────────────────────────────────────────────────────┐
	│ GRPO Training Loop │
	│ │
	│ 1. ROLLOUT COLLECTION │
	│ ├── Play N games with current LLM policy │
	│ ├── Each step: LLM generates response (+ vLLM fast) │
	│ └── Collect (prompt, response, reward) tuples │
	│ │
	│ 2. REWARD COMPUTATION │
	│ ├── Environment reward (collision, goal, shaping) │
	│ ├── Format reward (JSON structure) │
	│ └── Terminal propagation (win/loss to all steps) │
	│ │
	│ 3. GRPO UPDATE │
	│ ├── num_generations=4: sample 4 responses per prompt │
	│ ├── Compute relative advantage: r_i - mean(r) │
	│ ├── Policy gradient loss with KL penalty vs base model │
	│ └── LoRA adapter weights updated │
	│ │
	│ 4. ONLINE RL (closed loop) │
	│ └── Repeat: play with updated policy → collect → update │
	└─────────────────────────────────────────────────────────────┘
	```

	### Why GRPO (not PPO)?

	GRPO (Group Relative Policy Optimization) is used because:
	- No separate value/critic network needed — simpler
	- Works well with LLMs generating text sequences
	- The "group" of N responses per prompt provides a natural baseline
	- Reward is relative: responses better than average get positive advantage

	### LoRA Fine-tuning

	We use LoRA (Low-Rank Adaptation) so we only train ~1% of parameters:

	```
	Base model weights: FROZEN (Qwen3-4B or gpt-oss-20B)
	LoRA matrices: TRAINED (q_proj, k_proj, v_proj, o_proj, gate, up, down)
	```

	This means the model retains general language ability while learning the driving task.

	---

	## 8. What the Model Learns Over Training

	\| Early training \| Late training \|
	\|----------------\|---------------\|
	\| Random actions \| Strategic lane changes \|
	\| No negotiation \| Negotiates when blocker is close \|
	\| Invalid JSON \| Consistent structured output \|
	\| Collides frequently \| Avoids collisions \|
	\| Doesn't use sensors \| References lidar in reasoning \|

	---

	## 9. W&B Metrics to Track

	\| Metric \| Meaning \|
	\|--------\|---------\|
	\| `win_rate` \| % episodes reaching goal \|
	\| `reward/mean` \| Average reward per step \|
	\| `kl_divergence` \| How far policy has drifted from base \|
	\| `format_reward` \| % responses with valid JSON \|
	\| `policy/entropy` \| Exploration (high) vs exploitation (low) \|
	\| `negotiation_rate` \| % steps with negotiation attempt \|

	---

	## 10. File Structure

	```
	final_project/
	├── env/
	│ └── negotiation_env.py ← Environment logic, sensors, reward
	├── agents/
	│ ├── negotiation_agent.py ← LLM agent, prompt, tool calls
	│ └── memory.py ← Episode memory for in-context use
	├── server/
	│ ├── server.py ← FastAPI OpenEnv server
	│ └── requirements.txt ← ✅ Required for HF Spaces Docker build
	├── training/
	│ └── train_grpo_colab.ipynb ← Full GRPO training notebook (H100)
	├── ui/
	│ └── app.py ← Gradio simulator UI
	├── docs/
	│ └── DOCUMENTATION.md ← This file
	├── Dockerfile ← HF Spaces deployment
	└── README.md
	```

	---

	## 11. Quick Start

	```bash
	# Install
	pip install -r server/requirements.txt

	# Run environment server
	uvicorn server.server:app --reload --port 7860

	# Run UI
	python ui/app.py

	# Test environment
	python -c "
	from env.negotiation_env import NegotiationDrivingEnv
	env = NegotiationDrivingEnv()
	obs, _ = env.reset()
	print(env.render())
	print(env.lidar_scan())
	r = env.negotiate('blocker', 'Please yield, I need to pass safely')
	print('Blocker says:', r)
	obs, reward, done, _, info = env.step(2) # lane_left
	print('Reward:', reward, '\| Info:', info)
	"
	```

	---

	## 12. Deployment to HF Spaces

	```bash
	# Login
	huggingface-cli login

	# Push (from project root)
	# The Dockerfile copies server/requirements.txt — this must exist!
	git init && git add . && git commit -m "initial"
	huggingface-cli repo create autonomous-driving-env --type space --sdk docker
	git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-driving-env
	git push hf main
	```