Spaces:
Build error
Build error
🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving
Complete Delivery Guide
🏗️ Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ TRAINING LOOP (Colab H100) │
│ │
│ ┌──────────────┐ prompts ┌─────────────────────────────┐ │
│ │ Unsloth │◄────────────►│ LLM (Qwen3-4B / gpt-oss) │ │
│ │ GRPO/TRL │ completions │ + LoRA Adapter │ │
│ └──────┬───────┘ └─────────────────────────────┘ │
│ │ rewards │
│ ┌──────▼───────┐ W&B │
│ │ Reward Fns │───────────► Experiment Tracking │
│ └──────┬───────┘ │
└─────────┼───────────────────────────────────────────────────────┘
│ step() / reset()
│ WebSocket
┌─────────▼───────────────────────────────────────────────────────┐
│ HF SPACES (OpenEnv Environment Server) │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Connect4Environment (FastAPI + OpenEnv v0.2.1) │ │
│ │ • 6×7 board = intersection grid │ │
│ │ • Player 1 (X) = Ego Vehicle (LLM) │ │
│ │ • Player 2 (O) = Rule-based opponent │ │
│ │ • Shaped rewards: win/loss/block/3-in-row/format │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
📁 File Structure
connect4_env/ ← HF Spaces repo (deploy this)
├── __init__.py
├── models.py ← Pydantic Action/Observation/State
├── client.py ← Connect4Env(EnvClient)
├── openenv.yaml ← Manifest
├── pyproject.toml
├── Dockerfile ← HF Spaces Docker SDK
├── README.md ← HF Space card
└── server/
├── app.py ← FastAPI entry point
├── connect4_environment.py ← Game logic + reward shaping
└── requirements.txt
connect4_grpo_training.ipynb ← Colab training notebook (H100)
🚀 Step-by-Step Deployment
Step 1 — Deploy Environment to HF Spaces
# Install OpenEnv CLI
pip install openenv-core==0.2.1
# Login to HF
huggingface-cli login
# From inside connect4_env/ directory:
cd connect4_env
openenv push --repo-id YOUR_HF_USERNAME/connect4-env
# OR manually:
# 1. Create new Space at https://huggingface.co/new-space
# 2. Set SDK = Docker, hardware = CPU Basic
# 3. Push this folder as the repo
After deployment, your env is live at:
https://YOUR_HF_USERNAME-connect4-env.hf.space
Test it:
pip install openenv-core==0.2.1
from openenv.core.env_client import EnvClient
# ... or pip install from your HF Space
Step 2 — Run Training on Northflank / Colab
Option A: Google Colab (recommended for hackathon)
- Open
connect4_grpo_training.ipynbin Colab - Set Runtime → H100 GPU
- Update
HF_SPACE_URLandHF_MODEL_REPOvariables - Run all cells
Option B: Northflank Jupyter PyTorch
- Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch
- Upload the notebook
- The environment has PyTorch + CUDA pre-installed
- Install Unsloth:
uv pip install unsloth vllm --torch-backend=auto
Step 3 — vLLM GRPO Fix (if issues)
Per hackathon notes, if GRPO vLLM runs fail:
python -m venv unsloth_env
source unsloth_env/bin/activate
pip install --upgrade pip && pip install uv
uv pip install unsloth vllm --torch-backend=auto
# Always update Unsloth:
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo
🔬 Training Pipeline Detail
Pre-training → SFT → RLHF → RL+Envs
1. BASE MODEL (Qwen3-4B or gpt-oss-20B)
Pre-trained on large text corpus
2. SFT IMPLICIT
Prompt engineering guides format:
{"thinking": "...", "column": N}
3. GRPO (RL without explicit reward model)
- num_generations=4 rollouts per prompt
- KL divergence penalty vs reference policy
- Format reward (JSON structure)
- Environment reward (win/loss/block)
4. CLOSED-LOOP ONLINE RL
- Play N games with current policy
- Collect (prompt, response, reward) tuples
- Update policy with GRPO
- Repeat → self-improvement
Reward Design
The reward function has 3 components:
| Component | Source | Value |
|---|---|---|
| Outcome | Environment (terminal) | ±10.0 |
| Shaping | Environment (per-step) | ±0.5, +0.2, -0.1 |
| Format | Local function | +0.3 |
Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw).
📊 W&B Metrics to Track
| Metric | What it shows |
|---|---|
win_rate |
% games LLM wins vs rule-based |
reward/mean |
Average per-step reward |
kl_divergence |
Policy drift from base model |
format_reward |
% responses with valid JSON |
policy/entropy |
Exploration vs exploitation |
🔧 Environment Customization
The Connect4 environment can be extended for more realistic autonomous driving:
# Add to Connect4Action:
speed: float = Field(1.0, ge=0.0, le=3.0) # vehicle speed
lane_change: int = Field(0, ge=-1, le=1) # lane change direction
# Add to reward shaping:
def _safety_reward(self) -> float:
# Penalize high-speed moves near opponent
...
# Add multi-agent (>2 vehicles):
AGENT3 = 3 # second LLM agent
📎 Key Links
- OpenEnv repo: https://github.com/meta-pytorch/OpenEnv
- Unsloth GRPO notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb
- Qwen3 GRPO (faster): https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
- TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv
- Northflank Jupyter: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be
✅ Hackathon Checklist
- OpenEnv v0.2.1 environment built
- Connect4 game logic with shaped rewards
- Multi-agent (LLM + rule-based opponent)
- Deploy to HF Spaces via
openenv push - Unsloth GRPO training notebook (H100 BF16)
- W&B experiment tracking
- Closed-loop online RL loop
- Format reward for JSON CoT reasoning
- Evaluation tournament
- Push trained model to HF Hub ← fill in after training