Spaces:
Build error
Build error
| # 🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving | |
| ## Complete Delivery Guide | |
| --- | |
| ## 🏗️ Architecture Overview | |
| ``` | |
| ┌─────────────────────────────────────────────────────────────────┐ | |
| │ TRAINING LOOP (Colab H100) │ | |
| │ │ | |
| │ ┌──────────────┐ prompts ┌─────────────────────────────┐ │ | |
| │ │ Unsloth │◄────────────►│ LLM (Qwen3-4B / gpt-oss) │ │ | |
| │ │ GRPO/TRL │ completions │ + LoRA Adapter │ │ | |
| │ └──────┬───────┘ └─────────────────────────────┘ │ | |
| │ │ rewards │ | |
| │ ┌──────▼───────┐ W&B │ | |
| │ │ Reward Fns │───────────► Experiment Tracking │ | |
| │ └──────┬───────┘ │ | |
| └─────────┼───────────────────────────────────────────────────────┘ | |
| │ step() / reset() | |
| │ WebSocket | |
| ┌─────────▼───────────────────────────────────────────────────────┐ | |
| │ HF SPACES (OpenEnv Environment Server) │ | |
| │ │ | |
| │ ┌─────────────────────────────────────────────────────────┐ │ | |
| │ │ Connect4Environment (FastAPI + OpenEnv v0.2.1) │ │ | |
| │ │ • 6×7 board = intersection grid │ │ | |
| │ │ • Player 1 (X) = Ego Vehicle (LLM) │ │ | |
| │ │ • Player 2 (O) = Rule-based opponent │ │ | |
| │ │ • Shaped rewards: win/loss/block/3-in-row/format │ │ | |
| │ └─────────────────────────────────────────────────────────┘ │ | |
| └─────────────────────────────────────────────────────────────────┘ | |
| ``` | |
| --- | |
| ## 📁 File Structure | |
| ``` | |
| connect4_env/ ← HF Spaces repo (deploy this) | |
| ├── __init__.py | |
| ├── models.py ← Pydantic Action/Observation/State | |
| ├── client.py ← Connect4Env(EnvClient) | |
| ├── openenv.yaml ← Manifest | |
| ├── pyproject.toml | |
| ├── Dockerfile ← HF Spaces Docker SDK | |
| ├── README.md ← HF Space card | |
| └── server/ | |
| ├── app.py ← FastAPI entry point | |
| ├── connect4_environment.py ← Game logic + reward shaping | |
| └── requirements.txt | |
| connect4_grpo_training.ipynb ← Colab training notebook (H100) | |
| ``` | |
| --- | |
| ## 🚀 Step-by-Step Deployment | |
| ### Step 1 — Deploy Environment to HF Spaces | |
| ```bash | |
| # Install OpenEnv CLI | |
| pip install openenv-core==0.2.1 | |
| # Login to HF | |
| huggingface-cli login | |
| # From inside connect4_env/ directory: | |
| cd connect4_env | |
| openenv push --repo-id YOUR_HF_USERNAME/connect4-env | |
| # OR manually: | |
| # 1. Create new Space at https://huggingface.co/new-space | |
| # 2. Set SDK = Docker, hardware = CPU Basic | |
| # 3. Push this folder as the repo | |
| ``` | |
| After deployment, your env is live at: | |
| `https://YOUR_HF_USERNAME-connect4-env.hf.space` | |
| Test it: | |
| ```python | |
| pip install openenv-core==0.2.1 | |
| from openenv.core.env_client import EnvClient | |
| # ... or pip install from your HF Space | |
| ``` | |
| --- | |
| ### Step 2 — Run Training on Northflank / Colab | |
| **Option A: Google Colab (recommended for hackathon)** | |
| 1. Open `connect4_grpo_training.ipynb` in Colab | |
| 2. Set Runtime → H100 GPU | |
| 3. Update `HF_SPACE_URL` and `HF_MODEL_REPO` variables | |
| 4. Run all cells | |
| **Option B: Northflank Jupyter PyTorch** | |
| 1. Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch | |
| 2. Upload the notebook | |
| 3. The environment has PyTorch + CUDA pre-installed | |
| 4. Install Unsloth: `uv pip install unsloth vllm --torch-backend=auto` | |
| --- | |
| ### Step 3 — vLLM GRPO Fix (if issues) | |
| Per hackathon notes, if GRPO vLLM runs fail: | |
| ```bash | |
| python -m venv unsloth_env | |
| source unsloth_env/bin/activate | |
| pip install --upgrade pip && pip install uv | |
| uv pip install unsloth vllm --torch-backend=auto | |
| # Always update Unsloth: | |
| pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo | |
| ``` | |
| --- | |
| ## 🔬 Training Pipeline Detail | |
| ### Pre-training → SFT → RLHF → RL+Envs | |
| ``` | |
| 1. BASE MODEL (Qwen3-4B or gpt-oss-20B) | |
| Pre-trained on large text corpus | |
| 2. SFT IMPLICIT | |
| Prompt engineering guides format: | |
| {"thinking": "...", "column": N} | |
| 3. GRPO (RL without explicit reward model) | |
| - num_generations=4 rollouts per prompt | |
| - KL divergence penalty vs reference policy | |
| - Format reward (JSON structure) | |
| - Environment reward (win/loss/block) | |
| 4. CLOSED-LOOP ONLINE RL | |
| - Play N games with current policy | |
| - Collect (prompt, response, reward) tuples | |
| - Update policy with GRPO | |
| - Repeat → self-improvement | |
| ``` | |
| ### Reward Design | |
| The reward function has 3 components: | |
| | Component | Source | Value | | |
| |-----------|--------|-------| | |
| | **Outcome** | Environment (terminal) | ±10.0 | | |
| | **Shaping** | Environment (per-step) | ±0.5, +0.2, -0.1 | | |
| | **Format** | Local function | +0.3 | | |
| Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw). | |
| --- | |
| ## 📊 W&B Metrics to Track | |
| | Metric | What it shows | | |
| |--------|---------------| | |
| | `win_rate` | % games LLM wins vs rule-based | | |
| | `reward/mean` | Average per-step reward | | |
| | `kl_divergence` | Policy drift from base model | | |
| | `format_reward` | % responses with valid JSON | | |
| | `policy/entropy` | Exploration vs exploitation | | |
| --- | |
| ## 🔧 Environment Customization | |
| The Connect4 environment can be extended for more realistic autonomous driving: | |
| ```python | |
| # Add to Connect4Action: | |
| speed: float = Field(1.0, ge=0.0, le=3.0) # vehicle speed | |
| lane_change: int = Field(0, ge=-1, le=1) # lane change direction | |
| # Add to reward shaping: | |
| def _safety_reward(self) -> float: | |
| # Penalize high-speed moves near opponent | |
| ... | |
| # Add multi-agent (>2 vehicles): | |
| AGENT3 = 3 # second LLM agent | |
| ``` | |
| --- | |
| ## 📎 Key Links | |
| - **OpenEnv repo**: https://github.com/meta-pytorch/OpenEnv | |
| - **Unsloth GRPO notebook**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb | |
| - **Qwen3 GRPO (faster)**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb | |
| - **TRL OpenEnv docs**: https://huggingface.co/docs/trl/openenv | |
| - **Northflank Jupyter**: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be | |
| --- | |
| ## ✅ Hackathon Checklist | |
| - [x] OpenEnv v0.2.1 environment built | |
| - [x] Connect4 game logic with shaped rewards | |
| - [x] Multi-agent (LLM + rule-based opponent) | |
| - [x] Deploy to HF Spaces via `openenv push` | |
| - [x] Unsloth GRPO training notebook (H100 BF16) | |
| - [x] W&B experiment tracking | |
| - [x] Closed-loop online RL loop | |
| - [x] Format reward for JSON CoT reasoning | |
| - [x] Evaluation tournament | |
| - [ ] Push trained model to HF Hub ← fill in after training | |