Spaces:
Sleeping
π Agent & RL Training Documentation
Autonomous Driving Multi-Agent OpenEnv
1. What Are the Agents?
This project has three vehicles in the environment, each with a different policy:
| Agent | Symbol | Type | Policy | Learns? |
|---|---|---|---|---|
| Ego Vehicle | π E | LLM-controlled | GRPO fine-tuned | β Yes |
| Blocker Vehicle | π§ B | Rule-based | Tries to match ego's lane | β No |
| Traffic Vehicle | π T | Stochastic | Random lane drift | β No |
2. How the Ego Agent Thinks
Every step, the LLM agent receives:
SYSTEM PROMPT (instructions + action space)
+
USER PROMPT:
βββ Current road render (ASCII grid)
βββ Lidar sensor readings
βββ Collision prediction
βββ Recent negotiation log
βββ Memory (last 3 steps)
And must output structured JSON:
{
"thinking": "Blocker is 3 steps ahead in center lane. I should negotiate first, then change to the left lane.",
"negotiate": "blocker|Please yield, I need to pass safely",
"action": 2
}
The thinking field is the chain-of-thought β rewarded for being present and meaningful. This encourages the LLM to reason before acting.
3. Action Space
| ID | Name | Effect |
|---|---|---|
| 0 | accelerate |
Ego moves +2 positions forward |
| 1 | brake |
Ego moves +1 position (slower, safer) |
| 2 | lane_left |
Ego shifts one lane left |
| 3 | lane_right |
Ego shifts one lane right |
4. Sensor Tools
The agent can call these tools to observe the world:
lidar_scan() β dict
{
"blocker_distance": 3,
"blocker_lane": 1,
"traffic_distance": 6,
"traffic_lane": 2,
"ego_lane": 1,
"ego_position": 4,
"goal_distance": 15
}
predict_collision() β dict
{
"blocker_threat": true,
"traffic_threat": false,
"immediate_collision": false
}
5. Negotiation System
The ego agent can send natural language messages to other vehicles:
# In the environment
response = env.negotiate("blocker", "Please yield, I need to pass safely")
# β "Yielding lane β proceed safely."
Blocker yielding logic:
- Blocker yields if: ego is β€4 steps away AND message contains polite words
(
request,please,yield,allow,safe) - If blocker yields: it moves out of ego's lane β ego gets +0.3 reward bonus
Why this matters for RL: The LLM must learn when to negotiate vs. when to just act. Negotiating costs a step but can unlock reward. This creates a multi-step reasoning challenge.
6. Reward Structure
| Event | Reward | Why |
|---|---|---|
| Reach goal (position 19) | +10.0 | Primary objective |
| Collision | β10.0 | Safety constraint |
| Successful lane change past blocker | +0.5 | Progress reward |
| Negotiation causes blocker to yield | +0.3 | Tool use reward |
| Per step (time penalty) | β0.05 | Encourages efficiency |
| Invalid move (wall) | β0.2 | Constraint violation |
| Valid JSON format | +0.2 | Structural reward |
Has thinking field (>15 chars) |
+0.2 | Reasoning reward |
Has negotiate field |
+0.1 | Tool awareness |
| Invalid action int | β0.1 | Format penalty |
Terminal propagation: After each episode, a win bonus (+1.0) or loss penalty (β1.0) is added to all steps of that episode. This gives the policy a clear signal about whether its overall strategy was good.
7. How RL Training Works (GRPO)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRPO Training Loop β
β β
β 1. ROLLOUT COLLECTION β
β βββ Play N games with current LLM policy β
β βββ Each step: LLM generates response (+ vLLM fast) β
β βββ Collect (prompt, response, reward) tuples β
β β
β 2. REWARD COMPUTATION β
β βββ Environment reward (collision, goal, shaping) β
β βββ Format reward (JSON structure) β
β βββ Terminal propagation (win/loss to all steps) β
β β
β 3. GRPO UPDATE β
β βββ num_generations=4: sample 4 responses per prompt β
β βββ Compute relative advantage: r_i - mean(r) β
β βββ Policy gradient loss with KL penalty vs base model β
β βββ LoRA adapter weights updated β
β β
β 4. ONLINE RL (closed loop) β
β βββ Repeat: play with updated policy β collect β update β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why GRPO (not PPO)?
GRPO (Group Relative Policy Optimization) is used because:
- No separate value/critic network needed β simpler
- Works well with LLMs generating text sequences
- The "group" of N responses per prompt provides a natural baseline
- Reward is relative: responses better than average get positive advantage
LoRA Fine-tuning
We use LoRA (Low-Rank Adaptation) so we only train ~1% of parameters:
Base model weights: FROZEN (Qwen3-4B or gpt-oss-20B)
LoRA matrices: TRAINED (q_proj, k_proj, v_proj, o_proj, gate, up, down)
This means the model retains general language ability while learning the driving task.
8. What the Model Learns Over Training
| Early training | Late training |
|---|---|
| Random actions | Strategic lane changes |
| No negotiation | Negotiates when blocker is close |
| Invalid JSON | Consistent structured output |
| Collides frequently | Avoids collisions |
| Doesn't use sensors | References lidar in reasoning |
9. W&B Metrics to Track
| Metric | Meaning |
|---|---|
win_rate |
% episodes reaching goal |
reward/mean |
Average reward per step |
kl_divergence |
How far policy has drifted from base |
format_reward |
% responses with valid JSON |
policy/entropy |
Exploration (high) vs exploitation (low) |
negotiation_rate |
% steps with negotiation attempt |
10. File Structure
final_project/
βββ env/
β βββ negotiation_env.py β Environment logic, sensors, reward
βββ agents/
β βββ negotiation_agent.py β LLM agent, prompt, tool calls
β βββ memory.py β Episode memory for in-context use
βββ server/
β βββ server.py β FastAPI OpenEnv server
β βββ requirements.txt β β
Required for HF Spaces Docker build
βββ training/
β βββ train_grpo_colab.ipynb β Full GRPO training notebook (H100)
βββ ui/
β βββ app.py β Gradio simulator UI
βββ docs/
β βββ DOCUMENTATION.md β This file
βββ Dockerfile β HF Spaces deployment
βββ README.md
11. Quick Start
# Install
pip install -r server/requirements.txt
# Run environment server
uvicorn server.server:app --reload --port 7860
# Run UI
python ui/app.py
# Test environment
python -c "
from env.negotiation_env import NegotiationDrivingEnv
env = NegotiationDrivingEnv()
obs, _ = env.reset()
print(env.render())
print(env.lidar_scan())
r = env.negotiate('blocker', 'Please yield, I need to pass safely')
print('Blocker says:', r)
obs, reward, done, _, info = env.step(2) # lane_left
print('Reward:', reward, '| Info:', info)
"
12. Deployment to HF Spaces
# Login
huggingface-cli login
# Push (from project root)
# The Dockerfile copies server/requirements.txt β this must exist!
git init && git add . && git commit -m "initial"
huggingface-cli repo create autonomous-driving-env --type space --sdk docker
git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/autonomous-driving-env
git push hf main