Spaces:

openenv-community
/

HackathonMarch2026

Build error

App Files Files Community

HackathonMarch2026 / HACKATHON_GUIDE.md

helshahaby

Upload 6 files

185e2d2 verified 2 days ago

preview code

raw

history blame contribute delete

8.09 kB

	# 🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving

	## Complete Delivery Guide

	---

	## 🏗️ Architecture Overview

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ TRAINING LOOP (Colab H100) │
	│ │
	│ ┌──────────────┐ prompts ┌─────────────────────────────┐ │
	│ │ Unsloth │◄────────────►│ LLM (Qwen3-4B / gpt-oss) │ │
	│ │ GRPO/TRL │ completions │ + LoRA Adapter │ │
	│ └──────┬───────┘ └─────────────────────────────┘ │
	│ │ rewards │
	│ ┌──────▼───────┐ W&B │
	│ │ Reward Fns │───────────► Experiment Tracking │
	│ └──────┬───────┘ │
	└─────────┼───────────────────────────────────────────────────────┘
	│ step() / reset()
	│ WebSocket
	┌─────────▼───────────────────────────────────────────────────────┐
	│ HF SPACES (OpenEnv Environment Server) │
	│ │
	│ ┌─────────────────────────────────────────────────────────┐ │
	│ │ Connect4Environment (FastAPI + OpenEnv v0.2.1) │ │
	│ │ • 6×7 board = intersection grid │ │
	│ │ • Player 1 (X) = Ego Vehicle (LLM) │ │
	│ │ • Player 2 (O) = Rule-based opponent │ │
	│ │ • Shaped rewards: win/loss/block/3-in-row/format │ │
	│ └─────────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────────┘
	```

	---

	## 📁 File Structure

	```
	connect4_env/ ← HF Spaces repo (deploy this)
	├── __init__.py
	├── models.py ← Pydantic Action/Observation/State
	├── client.py ← Connect4Env(EnvClient)
	├── openenv.yaml ← Manifest
	├── pyproject.toml
	├── Dockerfile ← HF Spaces Docker SDK
	├── README.md ← HF Space card
	└── server/
	├── app.py ← FastAPI entry point
	├── connect4_environment.py ← Game logic + reward shaping
	└── requirements.txt

	connect4_grpo_training.ipynb ← Colab training notebook (H100)
	```

	---

	## 🚀 Step-by-Step Deployment

	### Step 1 — Deploy Environment to HF Spaces

	```bash
	# Install OpenEnv CLI
	pip install openenv-core==0.2.1

	# Login to HF
	huggingface-cli login

	# From inside connect4_env/ directory:
	cd connect4_env
	openenv push --repo-id YOUR_HF_USERNAME/connect4-env

	# OR manually:
	# 1. Create new Space at https://huggingface.co/new-space
	# 2. Set SDK = Docker, hardware = CPU Basic
	# 3. Push this folder as the repo
	```

	After deployment, your env is live at:
	`https://YOUR_HF_USERNAME-connect4-env.hf.space`

	Test it:
	```python
	pip install openenv-core==0.2.1
	from openenv.core.env_client import EnvClient
	# ... or pip install from your HF Space
	```

	---

	### Step 2 — Run Training on Northflank / Colab

	Option A: Google Colab (recommended for hackathon)
	1. Open `connect4_grpo_training.ipynb` in Colab
	2. Set Runtime → H100 GPU
	3. Update `HF_SPACE_URL` and `HF_MODEL_REPO` variables
	4. Run all cells

	Option B: Northflank Jupyter PyTorch
	1. Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch
	2. Upload the notebook
	3. The environment has PyTorch + CUDA pre-installed
	4. Install Unsloth: `uv pip install unsloth vllm --torch-backend=auto`

	---

	### Step 3 — vLLM GRPO Fix (if issues)

	Per hackathon notes, if GRPO vLLM runs fail:
	```bash
	python -m venv unsloth_env
	source unsloth_env/bin/activate
	pip install --upgrade pip && pip install uv
	uv pip install unsloth vllm --torch-backend=auto
	# Always update Unsloth:
	pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo
	```

	---

	## 🔬 Training Pipeline Detail

	### Pre-training → SFT → RLHF → RL+Envs

	```
	1. BASE MODEL (Qwen3-4B or gpt-oss-20B)
	Pre-trained on large text corpus

	2. SFT IMPLICIT
	Prompt engineering guides format:
	{"thinking": "...", "column": N}

	3. GRPO (RL without explicit reward model)
	- num_generations=4 rollouts per prompt
	- KL divergence penalty vs reference policy
	- Format reward (JSON structure)
	- Environment reward (win/loss/block)

	4. CLOSED-LOOP ONLINE RL
	- Play N games with current policy
	- Collect (prompt, response, reward) tuples
	- Update policy with GRPO
	- Repeat → self-improvement
	```

	### Reward Design

	The reward function has 3 components:

	\| Component \| Source \| Value \|
	\|-----------\|--------\|-------\|
	\| Outcome \| Environment (terminal) \| ±10.0 \|
	\| Shaping \| Environment (per-step) \| ±0.5, +0.2, -0.1 \|
	\| Format \| Local function \| +0.3 \|

	Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw).

	---

	## 📊 W&B Metrics to Track

	\| Metric \| What it shows \|
	\|--------\|---------------\|
	\| `win_rate` \| % games LLM wins vs rule-based \|
	\| `reward/mean` \| Average per-step reward \|
	\| `kl_divergence` \| Policy drift from base model \|
	\| `format_reward` \| % responses with valid JSON \|
	\| `policy/entropy` \| Exploration vs exploitation \|

	---

	## 🔧 Environment Customization

	The Connect4 environment can be extended for more realistic autonomous driving:

	```python
	# Add to Connect4Action:
	speed: float = Field(1.0, ge=0.0, le=3.0) # vehicle speed
	lane_change: int = Field(0, ge=-1, le=1) # lane change direction

	# Add to reward shaping:
	def _safety_reward(self) -> float:
	# Penalize high-speed moves near opponent
	...

	# Add multi-agent (>2 vehicles):
	AGENT3 = 3 # second LLM agent
	```

	---

	## 📎 Key Links

	- OpenEnv repo: https://github.com/meta-pytorch/OpenEnv
	- Unsloth GRPO notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb
	- Qwen3 GRPO (faster): https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
	- TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv
	- Northflank Jupyter: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be

	---

	## ✅ Hackathon Checklist

	- [x] OpenEnv v0.2.1 environment built
	- [x] Connect4 game logic with shaped rewards
	- [x] Multi-agent (LLM + rule-based opponent)
	- [x] Deploy to HF Spaces via `openenv push`
	- [x] Unsloth GRPO training notebook (H100 BF16)
	- [x] W&B experiment tracking
	- [x] Closed-loop online RL loop
	- [x] Format reward for JSON CoT reasoning
	- [x] Evaluation tournament
	- [ ] Push trained model to HF Hub ← fill in after training