# 🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving

## Complete Delivery Guide

---

## 🏗️ Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                     TRAINING LOOP (Colab H100)                  │
│                                                                 │
│  ┌──────────────┐   prompts    ┌─────────────────────────────┐  │
│  │  Unsloth     │◄────────────►│  LLM (Qwen3-4B / gpt-oss)  │  │
│  │  GRPO/TRL    │  completions │  + LoRA Adapter             │  │
│  └──────┬───────┘              └─────────────────────────────┘  │
│         │ rewards                                                │
│  ┌──────▼───────┐    W&B                                        │
│  │  Reward Fns  │───────────► Experiment Tracking              │
│  └──────┬───────┘                                               │
└─────────┼───────────────────────────────────────────────────────┘
          │ step() / reset()
          │ WebSocket
┌─────────▼───────────────────────────────────────────────────────┐
│              HF SPACES (OpenEnv Environment Server)             │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Connect4Environment (FastAPI + OpenEnv v0.2.1)         │   │
│  │  • 6×7 board = intersection grid                        │   │
│  │  • Player 1 (X) = Ego Vehicle (LLM)                    │   │
│  │  • Player 2 (O) = Rule-based opponent                   │   │
│  │  • Shaped rewards: win/loss/block/3-in-row/format       │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
```

---

## 📁 File Structure

```
connect4_env/                    ← HF Spaces repo (deploy this)
├── __init__.py
├── models.py                    ← Pydantic Action/Observation/State
├── client.py                    ← Connect4Env(EnvClient)
├── openenv.yaml                 ← Manifest
├── pyproject.toml
├── Dockerfile                   ← HF Spaces Docker SDK
├── README.md                    ← HF Space card
└── server/
    ├── app.py                   ← FastAPI entry point
    ├── connect4_environment.py  ← Game logic + reward shaping
    └── requirements.txt

connect4_grpo_training.ipynb     ← Colab training notebook (H100)
```

---

## 🚀 Step-by-Step Deployment

### Step 1 — Deploy Environment to HF Spaces

```bash
# Install OpenEnv CLI
pip install openenv-core==0.2.1

# Login to HF
huggingface-cli login

# From inside connect4_env/ directory:
cd connect4_env
openenv push --repo-id YOUR_HF_USERNAME/connect4-env

# OR manually:
# 1. Create new Space at https://huggingface.co/new-space
# 2. Set SDK = Docker, hardware = CPU Basic
# 3. Push this folder as the repo
```

After deployment, your env is live at:
`https://YOUR_HF_USERNAME-connect4-env.hf.space`

Test it:
```python
pip install openenv-core==0.2.1
from openenv.core.env_client import EnvClient
# ... or pip install from your HF Space
```

---

### Step 2 — Run Training on Northflank / Colab

**Option A: Google Colab (recommended for hackathon)**
1. Open `connect4_grpo_training.ipynb` in Colab
2. Set Runtime → H100 GPU
3. Update `HF_SPACE_URL` and `HF_MODEL_REPO` variables
4. Run all cells

**Option B: Northflank Jupyter PyTorch**
1. Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch
2. Upload the notebook
3. The environment has PyTorch + CUDA pre-installed
4. Install Unsloth: `uv pip install unsloth vllm --torch-backend=auto`

---

### Step 3 — vLLM GRPO Fix (if issues)

Per hackathon notes, if GRPO vLLM runs fail:
```bash
python -m venv unsloth_env
source unsloth_env/bin/activate
pip install --upgrade pip && pip install uv
uv pip install unsloth vllm --torch-backend=auto
# Always update Unsloth:
pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo
```

---

## 🔬 Training Pipeline Detail

### Pre-training → SFT → RLHF → RL+Envs

```
1. BASE MODEL (Qwen3-4B or gpt-oss-20B)
   Pre-trained on large text corpus

2. SFT IMPLICIT
   Prompt engineering guides format:
   {"thinking": "...", "column": N}

3. GRPO (RL without explicit reward model)
   - num_generations=4 rollouts per prompt
   - KL divergence penalty vs reference policy
   - Format reward (JSON structure)
   - Environment reward (win/loss/block)

4. CLOSED-LOOP ONLINE RL
   - Play N games with current policy
   - Collect (prompt, response, reward) tuples
   - Update policy with GRPO
   - Repeat → self-improvement
```

### Reward Design

The reward function has 3 components:

| Component | Source | Value |
|-----------|--------|-------|
| **Outcome** | Environment (terminal) | ±10.0 |
| **Shaping** | Environment (per-step) | ±0.5, +0.2, -0.1 |
| **Format** | Local function | +0.3 |

Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw).

---

## 📊 W&B Metrics to Track

| Metric | What it shows |
|--------|---------------|
| `win_rate` | % games LLM wins vs rule-based |
| `reward/mean` | Average per-step reward |
| `kl_divergence` | Policy drift from base model |
| `format_reward` | % responses with valid JSON |
| `policy/entropy` | Exploration vs exploitation |

---

## 🔧 Environment Customization

The Connect4 environment can be extended for more realistic autonomous driving:

```python
# Add to Connect4Action:
speed: float = Field(1.0, ge=0.0, le=3.0)      # vehicle speed
lane_change: int = Field(0, ge=-1, le=1)        # lane change direction

# Add to reward shaping:
def _safety_reward(self) -> float:
    # Penalize high-speed moves near opponent
    ...

# Add multi-agent (>2 vehicles):
AGENT3 = 3  # second LLM agent
```

---

## 📎 Key Links

- **OpenEnv repo**: https://github.com/meta-pytorch/OpenEnv
- **Unsloth GRPO notebook**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb
- **Qwen3 GRPO (faster)**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
- **TRL OpenEnv docs**: https://huggingface.co/docs/trl/openenv
- **Northflank Jupyter**: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be

---

## ✅ Hackathon Checklist

- [x] OpenEnv v0.2.1 environment built
- [x] Connect4 game logic with shaped rewards
- [x] Multi-agent (LLM + rule-based opponent)
- [x] Deploy to HF Spaces via `openenv push`
- [x] Unsloth GRPO training notebook (H100 BF16)
- [x] W&B experiment tracking
- [x] Closed-loop online RL loop
- [x] Format reward for JSON CoT reasoning
- [x] Evaluation tournament
- [ ] Push trained model to HF Hub ← fill in after training