HackathonMarch2026 / HACKATHON_GUIDE.md
helshahaby's picture
Upload 6 files
185e2d2 verified
# 🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving
## Complete Delivery Guide
---
## 🏗️ Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ TRAINING LOOP (Colab H100) │
│ │
│ ┌──────────────┐ prompts ┌─────────────────────────────┐ │
│ │ Unsloth │◄────────────►│ LLM (Qwen3-4B / gpt-oss) │ │
│ │ GRPO/TRL │ completions │ + LoRA Adapter │ │
│ └──────┬───────┘ └─────────────────────────────┘ │
│ │ rewards │
│ ┌──────▼───────┐ W&B │
│ │ Reward Fns │───────────► Experiment Tracking │
│ └──────┬───────┘ │
└─────────┼───────────────────────────────────────────────────────┘
│ step() / reset()
│ WebSocket
┌─────────▼───────────────────────────────────────────────────────┐
│ HF SPACES (OpenEnv Environment Server) │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Connect4Environment (FastAPI + OpenEnv v0.2.1) │ │
│ │ • 6×7 board = intersection grid │ │
│ │ • Player 1 (X) = Ego Vehicle (LLM) │ │
│ │ • Player 2 (O) = Rule-based opponent │ │
│ │ • Shaped rewards: win/loss/block/3-in-row/format │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
---
## 📁 File Structure
```
connect4_env/ ← HF Spaces repo (deploy this)
├── __init__.py
├── models.py ← Pydantic Action/Observation/State
├── client.py ← Connect4Env(EnvClient)
├── openenv.yaml ← Manifest
├── pyproject.toml
├── Dockerfile ← HF Spaces Docker SDK
├── README.md ← HF Space card
└── server/
├── app.py ← FastAPI entry point
├── connect4_environment.py ← Game logic + reward shaping
└── requirements.txt
connect4_grpo_training.ipynb ← Colab training notebook (H100)
```
---
## 🚀 Step-by-Step Deployment
### Step 1 — Deploy Environment to HF Spaces
```bash
# Install OpenEnv CLI
pip install openenv-core==0.2.1
# Login to HF
huggingface-cli login
# From inside connect4_env/ directory:
cd connect4_env
openenv push --repo-id YOUR_HF_USERNAME/connect4-env
# OR manually:
# 1. Create new Space at https://huggingface.co/new-space
# 2. Set SDK = Docker, hardware = CPU Basic
# 3. Push this folder as the repo
```
After deployment, your env is live at:
`https://YOUR_HF_USERNAME-connect4-env.hf.space`
Test it:
```python
pip install openenv-core==0.2.1
from openenv.core.env_client import EnvClient
# ... or pip install from your HF Space
```
---
### Step 2 — Run Training on Northflank / Colab
**Option A: Google Colab (recommended for hackathon)**
1. Open `connect4_grpo_training.ipynb` in Colab
2. Set Runtime → H100 GPU
3. Update `HF_SPACE_URL` and `HF_MODEL_REPO` variables
4. Run all cells
**Option B: Northflank Jupyter PyTorch**
1. Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch
2. Upload the notebook
3. The environment has PyTorch + CUDA pre-installed
4. Install Unsloth: `uv pip install unsloth vllm --torch-backend=auto`
---
### Step 3 — vLLM GRPO Fix (if issues)
Per hackathon notes, if GRPO vLLM runs fail:
```bash
python -m venv unsloth_env
source unsloth_env/bin/activate
pip install --upgrade pip && pip install uv
uv pip install unsloth vllm --torch-backend=auto
# Always update Unsloth:
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo
```
---
## 🔬 Training Pipeline Detail
### Pre-training → SFT → RLHF → RL+Envs
```
1. BASE MODEL (Qwen3-4B or gpt-oss-20B)
Pre-trained on large text corpus
2. SFT IMPLICIT
Prompt engineering guides format:
{"thinking": "...", "column": N}
3. GRPO (RL without explicit reward model)
- num_generations=4 rollouts per prompt
- KL divergence penalty vs reference policy
- Format reward (JSON structure)
- Environment reward (win/loss/block)
4. CLOSED-LOOP ONLINE RL
- Play N games with current policy
- Collect (prompt, response, reward) tuples
- Update policy with GRPO
- Repeat → self-improvement
```
### Reward Design
The reward function has 3 components:
| Component | Source | Value |
|-----------|--------|-------|
| **Outcome** | Environment (terminal) | ±10.0 |
| **Shaping** | Environment (per-step) | ±0.5, +0.2, -0.1 |
| **Format** | Local function | +0.3 |
Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw).
---
## 📊 W&B Metrics to Track
| Metric | What it shows |
|--------|---------------|
| `win_rate` | % games LLM wins vs rule-based |
| `reward/mean` | Average per-step reward |
| `kl_divergence` | Policy drift from base model |
| `format_reward` | % responses with valid JSON |
| `policy/entropy` | Exploration vs exploitation |
---
## 🔧 Environment Customization
The Connect4 environment can be extended for more realistic autonomous driving:
```python
# Add to Connect4Action:
speed: float = Field(1.0, ge=0.0, le=3.0) # vehicle speed
lane_change: int = Field(0, ge=-1, le=1) # lane change direction
# Add to reward shaping:
def _safety_reward(self) -> float:
# Penalize high-speed moves near opponent
...
# Add multi-agent (>2 vehicles):
AGENT3 = 3 # second LLM agent
```
---
## 📎 Key Links
- **OpenEnv repo**: https://github.com/meta-pytorch/OpenEnv
- **Unsloth GRPO notebook**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb
- **Qwen3 GRPO (faster)**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
- **TRL OpenEnv docs**: https://huggingface.co/docs/trl/openenv
- **Northflank Jupyter**: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be
---
## ✅ Hackathon Checklist
- [x] OpenEnv v0.2.1 environment built
- [x] Connect4 game logic with shaped rewards
- [x] Multi-agent (LLM + rule-based opponent)
- [x] Deploy to HF Spaces via `openenv push`
- [x] Unsloth GRPO training notebook (H100 BF16)
- [x] W&B experiment tracking
- [x] Closed-loop online RL loop
- [x] Format reward for JSON CoT reasoning
- [x] Evaluation tournament
- [ ] Push trained model to HF Hub ← fill in after training