# 🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving ## Complete Delivery Guide --- ## 🏗️ Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ TRAINING LOOP (Colab H100) │ │ │ │ ┌──────────────┐ prompts ┌─────────────────────────────┐ │ │ │ Unsloth │◄────────────►│ LLM (Qwen3-4B / gpt-oss) │ │ │ │ GRPO/TRL │ completions │ + LoRA Adapter │ │ │ └──────┬───────┘ └─────────────────────────────┘ │ │ │ rewards │ │ ┌──────▼───────┐ W&B │ │ │ Reward Fns │───────────► Experiment Tracking │ │ └──────┬───────┘ │ └─────────┼───────────────────────────────────────────────────────┘ │ step() / reset() │ WebSocket ┌─────────▼───────────────────────────────────────────────────────┐ │ HF SPACES (OpenEnv Environment Server) │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Connect4Environment (FastAPI + OpenEnv v0.2.1) │ │ │ │ • 6×7 board = intersection grid │ │ │ │ • Player 1 (X) = Ego Vehicle (LLM) │ │ │ │ • Player 2 (O) = Rule-based opponent │ │ │ │ • Shaped rewards: win/loss/block/3-in-row/format │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## 📁 File Structure ``` connect4_env/ ← HF Spaces repo (deploy this) ├── __init__.py ├── models.py ← Pydantic Action/Observation/State ├── client.py ← Connect4Env(EnvClient) ├── openenv.yaml ← Manifest ├── pyproject.toml ├── Dockerfile ← HF Spaces Docker SDK ├── README.md ← HF Space card └── server/ ├── app.py ← FastAPI entry point ├── connect4_environment.py ← Game logic + reward shaping └── requirements.txt connect4_grpo_training.ipynb ← Colab training notebook (H100) ``` --- ## 🚀 Step-by-Step Deployment ### Step 1 — Deploy Environment to HF Spaces ```bash # Install OpenEnv CLI pip install openenv-core==0.2.1 # Login to HF huggingface-cli login # From inside connect4_env/ directory: cd connect4_env openenv push --repo-id YOUR_HF_USERNAME/connect4-env # OR manually: # 1. Create new Space at https://huggingface.co/new-space # 2. Set SDK = Docker, hardware = CPU Basic # 3. Push this folder as the repo ``` After deployment, your env is live at: `https://YOUR_HF_USERNAME-connect4-env.hf.space` Test it: ```python pip install openenv-core==0.2.1 from openenv.core.env_client import EnvClient # ... or pip install from your HF Space ``` --- ### Step 2 — Run Training on Northflank / Colab **Option A: Google Colab (recommended for hackathon)** 1. Open `connect4_grpo_training.ipynb` in Colab 2. Set Runtime → H100 GPU 3. Update `HF_SPACE_URL` and `HF_MODEL_REPO` variables 4. Run all cells **Option B: Northflank Jupyter PyTorch** 1. Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch 2. Upload the notebook 3. The environment has PyTorch + CUDA pre-installed 4. Install Unsloth: `uv pip install unsloth vllm --torch-backend=auto` --- ### Step 3 — vLLM GRPO Fix (if issues) Per hackathon notes, if GRPO vLLM runs fail: ```bash python -m venv unsloth_env source unsloth_env/bin/activate pip install --upgrade pip && pip install uv uv pip install unsloth vllm --torch-backend=auto # Always update Unsloth: pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo ``` --- ## 🔬 Training Pipeline Detail ### Pre-training → SFT → RLHF → RL+Envs ``` 1. BASE MODEL (Qwen3-4B or gpt-oss-20B) Pre-trained on large text corpus 2. SFT IMPLICIT Prompt engineering guides format: {"thinking": "...", "column": N} 3. GRPO (RL without explicit reward model) - num_generations=4 rollouts per prompt - KL divergence penalty vs reference policy - Format reward (JSON structure) - Environment reward (win/loss/block) 4. CLOSED-LOOP ONLINE RL - Play N games with current policy - Collect (prompt, response, reward) tuples - Update policy with GRPO - Repeat → self-improvement ``` ### Reward Design The reward function has 3 components: | Component | Source | Value | |-----------|--------|-------| | **Outcome** | Environment (terminal) | ±10.0 | | **Shaping** | Environment (per-step) | ±0.5, +0.2, -0.1 | | **Format** | Local function | +0.3 | Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw). --- ## 📊 W&B Metrics to Track | Metric | What it shows | |--------|---------------| | `win_rate` | % games LLM wins vs rule-based | | `reward/mean` | Average per-step reward | | `kl_divergence` | Policy drift from base model | | `format_reward` | % responses with valid JSON | | `policy/entropy` | Exploration vs exploitation | --- ## 🔧 Environment Customization The Connect4 environment can be extended for more realistic autonomous driving: ```python # Add to Connect4Action: speed: float = Field(1.0, ge=0.0, le=3.0) # vehicle speed lane_change: int = Field(0, ge=-1, le=1) # lane change direction # Add to reward shaping: def _safety_reward(self) -> float: # Penalize high-speed moves near opponent ... # Add multi-agent (>2 vehicles): AGENT3 = 3 # second LLM agent ``` --- ## 📎 Key Links - **OpenEnv repo**: https://github.com/meta-pytorch/OpenEnv - **Unsloth GRPO notebook**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb - **Qwen3 GRPO (faster)**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb - **TRL OpenEnv docs**: https://huggingface.co/docs/trl/openenv - **Northflank Jupyter**: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be --- ## ✅ Hackathon Checklist - [x] OpenEnv v0.2.1 environment built - [x] Connect4 game logic with shaped rewards - [x] Multi-agent (LLM + rule-based opponent) - [x] Deploy to HF Spaces via `openenv push` - [x] Unsloth GRPO training notebook (H100 BF16) - [x] W&B experiment tracking - [x] Closed-loop online RL loop - [x] Format reward for JSON CoT reasoning - [x] Evaluation tournament - [ ] Push trained model to HF Hub ← fill in after training