Spaces:

openenv-community
/

HackathonMarch2026

Build error

App Files Files Community

HackathonMarch2026 / HACKATHON_GUIDE.md

helshahaby

Upload 6 files

185e2d2 verified 2 days ago

preview code

raw

history blame contribute delete

8.09 kB

🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving

Complete Delivery Guide

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                     TRAINING LOOP (Colab H100)                  │
│                                                                 │
│  ┌──────────────┐   prompts    ┌─────────────────────────────┐  │
│  │  Unsloth     │◄────────────►│  LLM (Qwen3-4B / gpt-oss)  │  │
│  │  GRPO/TRL    │  completions │  + LoRA Adapter             │  │
│  └──────┬───────┘              └─────────────────────────────┘  │
│         │ rewards                                                │
│  ┌──────▼───────┐    W&B                                        │
│  │  Reward Fns  │───────────► Experiment Tracking              │
│  └──────┬───────┘                                               │
└─────────┼───────────────────────────────────────────────────────┘
          │ step() / reset()
          │ WebSocket
┌─────────▼───────────────────────────────────────────────────────┐
│              HF SPACES (OpenEnv Environment Server)             │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Connect4Environment (FastAPI + OpenEnv v0.2.1)         │   │
│  │  • 6×7 board = intersection grid                        │   │
│  │  • Player 1 (X) = Ego Vehicle (LLM)                    │   │
│  │  • Player 2 (O) = Rule-based opponent                   │   │
│  │  • Shaped rewards: win/loss/block/3-in-row/format       │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

📁 File Structure

connect4_env/                    ← HF Spaces repo (deploy this)
├── __init__.py
├── models.py                    ← Pydantic Action/Observation/State
├── client.py                    ← Connect4Env(EnvClient)
├── openenv.yaml                 ← Manifest
├── pyproject.toml
├── Dockerfile                   ← HF Spaces Docker SDK
├── README.md                    ← HF Space card
└── server/
    ├── app.py                   ← FastAPI entry point
    ├── connect4_environment.py  ← Game logic + reward shaping
    └── requirements.txt

connect4_grpo_training.ipynb     ← Colab training notebook (H100)

🚀 Step-by-Step Deployment

Step 1 — Deploy Environment to HF Spaces

# Install OpenEnv CLI
pip install openenv-core==0.2.1

# Login to HF
huggingface-cli login

# From inside connect4_env/ directory:
cd connect4_env
openenv push --repo-id YOUR_HF_USERNAME/connect4-env

# OR manually:
# 1. Create new Space at https://huggingface.co/new-space
# 2. Set SDK = Docker, hardware = CPU Basic
# 3. Push this folder as the repo

After deployment, your env is live at: https://YOUR_HF_USERNAME-connect4-env.hf.space

Test it:

pip install openenv-core==0.2.1
from openenv.core.env_client import EnvClient
# ... or pip install from your HF Space

Step 2 — Run Training on Northflank / Colab

Option A: Google Colab (recommended for hackathon)

Open connect4_grpo_training.ipynb in Colab
Set Runtime → H100 GPU
Update HF_SPACE_URL and HF_MODEL_REPO variables
Run all cells

Option B: Northflank Jupyter PyTorch

Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch
Upload the notebook
The environment has PyTorch + CUDA pre-installed
Install Unsloth: uv pip install unsloth vllm --torch-backend=auto

Step 3 — vLLM GRPO Fix (if issues)

Per hackathon notes, if GRPO vLLM runs fail:

python -m venv unsloth_env
source unsloth_env/bin/activate
pip install --upgrade pip && pip install uv
uv pip install unsloth vllm --torch-backend=auto
# Always update Unsloth:
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo

🔬 Training Pipeline Detail

Pre-training → SFT → RLHF → RL+Envs

1. BASE MODEL (Qwen3-4B or gpt-oss-20B)
   Pre-trained on large text corpus

2. SFT IMPLICIT
   Prompt engineering guides format:
   {"thinking": "...", "column": N}

3. GRPO (RL without explicit reward model)
   - num_generations=4 rollouts per prompt
   - KL divergence penalty vs reference policy
   - Format reward (JSON structure)
   - Environment reward (win/loss/block)

4. CLOSED-LOOP ONLINE RL
   - Play N games with current policy
   - Collect (prompt, response, reward) tuples
   - Update policy with GRPO
   - Repeat → self-improvement

Reward Design

The reward function has 3 components:

Component	Source	Value
Outcome	Environment (terminal)	±10.0
Shaping	Environment (per-step)	±0.5, +0.2, -0.1
Format	Local function	+0.3

Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw).

📊 W&B Metrics to Track

Metric	What it shows
`win_rate`	% games LLM wins vs rule-based
`reward/mean`	Average per-step reward
`kl_divergence`	Policy drift from base model
`format_reward`	% responses with valid JSON
`policy/entropy`	Exploration vs exploitation

🔧 Environment Customization

The Connect4 environment can be extended for more realistic autonomous driving:

# Add to Connect4Action:
speed: float = Field(1.0, ge=0.0, le=3.0)      # vehicle speed
lane_change: int = Field(0, ge=-1, le=1)        # lane change direction

# Add to reward shaping:
def _safety_reward(self) -> float:
    # Penalize high-speed moves near opponent
    ...

# Add multi-agent (>2 vehicles):
AGENT3 = 3  # second LLM agent

📎 Key Links

OpenEnv repo: https://github.com/meta-pytorch/OpenEnv
Unsloth GRPO notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb
Qwen3 GRPO (faster): https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv
Northflank Jupyter: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be

✅ Hackathon Checklist

OpenEnv v0.2.1 environment built
Connect4 game logic with shaped rewards
Multi-agent (LLM + rule-based opponent)
Deploy to HF Spaces via openenv push
Unsloth GRPO training notebook (H100 BF16)
W&B experiment tracking
Closed-loop online RL loop
Format reward for JSON CoT reasoning
Evaluation tournament
Push trained model to HF Hub ← fill in after training