HackathonMarch2026 / HACKATHON_GUIDE.md
helshahaby's picture
Upload 6 files
185e2d2 verified

🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving

Complete Delivery Guide


🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                     TRAINING LOOP (Colab H100)                  │
│                                                                 │
│  ┌──────────────┐   prompts    ┌─────────────────────────────┐  │
│  │  Unsloth     │◄────────────►│  LLM (Qwen3-4B / gpt-oss)  │  │
│  │  GRPO/TRL    │  completions │  + LoRA Adapter             │  │
│  └──────┬───────┘              └─────────────────────────────┘  │
│         │ rewards                                                │
│  ┌──────▼───────┐    W&B                                        │
│  │  Reward Fns  │───────────► Experiment Tracking              │
│  └──────┬───────┘                                               │
└─────────┼───────────────────────────────────────────────────────┘
          │ step() / reset()
          │ WebSocket
┌─────────▼───────────────────────────────────────────────────────┐
│              HF SPACES (OpenEnv Environment Server)             │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Connect4Environment (FastAPI + OpenEnv v0.2.1)         │   │
│  │  • 6×7 board = intersection grid                        │   │
│  │  • Player 1 (X) = Ego Vehicle (LLM)                    │   │
│  │  • Player 2 (O) = Rule-based opponent                   │   │
│  │  • Shaped rewards: win/loss/block/3-in-row/format       │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

📁 File Structure

connect4_env/                    ← HF Spaces repo (deploy this)
├── __init__.py
├── models.py                    ← Pydantic Action/Observation/State
├── client.py                    ← Connect4Env(EnvClient)
├── openenv.yaml                 ← Manifest
├── pyproject.toml
├── Dockerfile                   ← HF Spaces Docker SDK
├── README.md                    ← HF Space card
└── server/
    ├── app.py                   ← FastAPI entry point
    ├── connect4_environment.py  ← Game logic + reward shaping
    └── requirements.txt

connect4_grpo_training.ipynb     ← Colab training notebook (H100)

🚀 Step-by-Step Deployment

Step 1 — Deploy Environment to HF Spaces

# Install OpenEnv CLI
pip install openenv-core==0.2.1

# Login to HF
huggingface-cli login

# From inside connect4_env/ directory:
cd connect4_env
openenv push --repo-id YOUR_HF_USERNAME/connect4-env

# OR manually:
# 1. Create new Space at https://huggingface.co/new-space
# 2. Set SDK = Docker, hardware = CPU Basic
# 3. Push this folder as the repo

After deployment, your env is live at: https://YOUR_HF_USERNAME-connect4-env.hf.space

Test it:

pip install openenv-core==0.2.1
from openenv.core.env_client import EnvClient
# ... or pip install from your HF Space

Step 2 — Run Training on Northflank / Colab

Option A: Google Colab (recommended for hackathon)

  1. Open connect4_grpo_training.ipynb in Colab
  2. Set Runtime → H100 GPU
  3. Update HF_SPACE_URL and HF_MODEL_REPO variables
  4. Run all cells

Option B: Northflank Jupyter PyTorch

  1. Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch
  2. Upload the notebook
  3. The environment has PyTorch + CUDA pre-installed
  4. Install Unsloth: uv pip install unsloth vllm --torch-backend=auto

Step 3 — vLLM GRPO Fix (if issues)

Per hackathon notes, if GRPO vLLM runs fail:

python -m venv unsloth_env
source unsloth_env/bin/activate
pip install --upgrade pip && pip install uv
uv pip install unsloth vllm --torch-backend=auto
# Always update Unsloth:
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo

🔬 Training Pipeline Detail

Pre-training → SFT → RLHF → RL+Envs

1. BASE MODEL (Qwen3-4B or gpt-oss-20B)
   Pre-trained on large text corpus

2. SFT IMPLICIT
   Prompt engineering guides format:
   {"thinking": "...", "column": N}

3. GRPO (RL without explicit reward model)
   - num_generations=4 rollouts per prompt
   - KL divergence penalty vs reference policy
   - Format reward (JSON structure)
   - Environment reward (win/loss/block)

4. CLOSED-LOOP ONLINE RL
   - Play N games with current policy
   - Collect (prompt, response, reward) tuples
   - Update policy with GRPO
   - Repeat → self-improvement

Reward Design

The reward function has 3 components:

Component Source Value
Outcome Environment (terminal) ±10.0
Shaping Environment (per-step) ±0.5, +0.2, -0.1
Format Local function +0.3

Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw).


📊 W&B Metrics to Track

Metric What it shows
win_rate % games LLM wins vs rule-based
reward/mean Average per-step reward
kl_divergence Policy drift from base model
format_reward % responses with valid JSON
policy/entropy Exploration vs exploitation

🔧 Environment Customization

The Connect4 environment can be extended for more realistic autonomous driving:

# Add to Connect4Action:
speed: float = Field(1.0, ge=0.0, le=3.0)      # vehicle speed
lane_change: int = Field(0, ge=-1, le=1)        # lane change direction

# Add to reward shaping:
def _safety_reward(self) -> float:
    # Penalize high-speed moves near opponent
    ...

# Add multi-agent (>2 vehicles):
AGENT3 = 3  # second LLM agent

📎 Key Links


✅ Hackathon Checklist

  • OpenEnv v0.2.1 environment built
  • Connect4 game logic with shaped rewards
  • Multi-agent (LLM + rule-based opponent)
  • Deploy to HF Spaces via openenv push
  • Unsloth GRPO training notebook (H100 BF16)
  • W&B experiment tracking
  • Closed-loop online RL loop
  • Format reward for JSON CoT reasoning
  • Evaluation tournament
  • Push trained model to HF Hub ← fill in after training