Spaces:

openenv-community
/

HackathonMarch2026

Build error

App Files Files Community

helshahaby commited on 2 days ago

Commit

185e2d2

verified ·

1 Parent(s): b21d658

Upload 6 files

Browse files

Files changed (6) hide show

Dockerfile +24 -0
HACKATHON_GUIDE.md +215 -0
client.py +25 -0
connect4_environment.py +225 -0
connect4_grpo_training.ipynb +654 -0
models.py +45 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,24 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies
+COPY server/requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy environment source
+COPY . .
+RUN pip install -e . --no-cache-dir
+# HF Spaces runs on port 7860
+EXPOSE 7860
+# Enable web interface for HF Spaces demo
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["python", "-m", "uvicorn", "connect4_env.server.app:app", "--host", "0.0.0.0", "--port", "7860"]

HACKATHON_GUIDE.md ADDED Viewed

	@@ -0,0 +1,215 @@

+# 🚗 Meta OpenEnv Hackathon — Connect4 Multi-Agent Autonomous Driving
+## Complete Delivery Guide
+---
+## 🏗️ Architecture Overview
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     TRAINING LOOP (Colab H100)                  │
+│                                                                 │
+│  ┌──────────────┐   prompts    ┌─────────────────────────────┐  │
+│  │  Unsloth     │◄────────────►│  LLM (Qwen3-4B / gpt-oss)  │  │
+│  │  GRPO/TRL    │  completions │  + LoRA Adapter             │  │
+│  └──────┬───────┘              └─────────────────────────────┘  │
+│         │ rewards                                                │
+│  ┌──────▼───────┐    W&B                                        │
+│  │  Reward Fns  │───────────► Experiment Tracking              │
+│  └──────┬───────┘                                               │
+└─────────┼───────────────────────────────────────────────────────┘
+          │ step() / reset()
+          │ WebSocket
+┌─────────▼───────────────────────────────────────────────────────┐
+│              HF SPACES (OpenEnv Environment Server)             │
+│                                                                 │
+│  ┌─────────────────────────────────────────────────────────┐   │
+│  │  Connect4Environment (FastAPI + OpenEnv v0.2.1)         │   │
+│  │  • 6×7 board = intersection grid                        │   │
+│  │  • Player 1 (X) = Ego Vehicle (LLM)                    │   │
+│  │  • Player 2 (O) = Rule-based opponent                   │   │
+│  │  • Shaped rewards: win/loss/block/3-in-row/format       │   │
+│  └─────────────────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────────────┘
+```
+---
+## 📁 File Structure
+```
+connect4_env/                    ← HF Spaces repo (deploy this)
+├── __init__.py
+├── models.py                    ← Pydantic Action/Observation/State
+├── client.py                    ← Connect4Env(EnvClient)
+├── openenv.yaml                 ← Manifest
+├── pyproject.toml
+├── Dockerfile                   ← HF Spaces Docker SDK
+├── README.md                    ← HF Space card
+└── server/
+    ├── app.py                   ← FastAPI entry point
+    ├── connect4_environment.py  ← Game logic + reward shaping
+    └── requirements.txt
+connect4_grpo_training.ipynb     ← Colab training notebook (H100)
+```
+---
+## 🚀 Step-by-Step Deployment
+### Step 1 — Deploy Environment to HF Spaces
+```bash
+# Install OpenEnv CLI
+pip install openenv-core==0.2.1
+# Login to HF
+huggingface-cli login
+# From inside connect4_env/ directory:
+cd connect4_env
+openenv push --repo-id YOUR_HF_USERNAME/connect4-env
+# OR manually:
+# 1. Create new Space at https://huggingface.co/new-space
+# 2. Set SDK = Docker, hardware = CPU Basic
+# 3. Push this folder as the repo
+```
+After deployment, your env is live at:
+`https://YOUR_HF_USERNAME-connect4-env.hf.space`
+Test it:
+```python
+pip install openenv-core==0.2.1
+from openenv.core.env_client import EnvClient
+# ... or pip install from your HF Space
+```
+---
+### Step 2 — Run Training on Northflank / Colab
+**Option A: Google Colab (recommended for hackathon)**
+1. Open `connect4_grpo_training.ipynb` in Colab
+2. Set Runtime → H100 GPU
+3. Update `HF_SPACE_URL` and `HF_MODEL_REPO` variables
+4. Run all cells
+**Option B: Northflank Jupyter PyTorch**
+1. Go to https://app.northflank.com/t/openenv-hack-112/project/hackathon/services/jupyter-pytorch
+2. Upload the notebook
+3. The environment has PyTorch + CUDA pre-installed
+4. Install Unsloth: `uv pip install unsloth vllm --torch-backend=auto`
+---
+### Step 3 — vLLM GRPO Fix (if issues)
+Per hackathon notes, if GRPO vLLM runs fail:
+```bash
+python -m venv unsloth_env
+source unsloth_env/bin/activate
+pip install --upgrade pip && pip install uv
+uv pip install unsloth vllm --torch-backend=auto
+# Always update Unsloth:
+pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo
+```
+---
+## 🔬 Training Pipeline Detail
+### Pre-training → SFT → RLHF → RL+Envs
+```
+1. BASE MODEL (Qwen3-4B or gpt-oss-20B)
+   Pre-trained on large text corpus
+2. SFT IMPLICIT
+   Prompt engineering guides format:
+   {"thinking": "...", "column": N}
+3. GRPO (RL without explicit reward model)
+   - num_generations=4 rollouts per prompt
+   - KL divergence penalty vs reference policy
+   - Format reward (JSON structure)
+   - Environment reward (win/loss/block)
+4. CLOSED-LOOP ONLINE RL
+   - Play N games with current policy
+   - Collect (prompt, response, reward) tuples
+   - Update policy with GRPO
+   - Repeat → self-improvement
+```
+### Reward Design
+The reward function has 3 components:
+| Component | Source | Value |
+|-----------|--------|-------|
+| **Outcome** | Environment (terminal) | ±10.0 |
+| **Shaping** | Environment (per-step) | ±0.5, +0.2, -0.1 |
+| **Format** | Local function | +0.3 |
+Outcome is propagated back to all moves of a game (+1.0 win, -1.0 loss, +0.1 draw).
+---
+## 📊 W&B Metrics to Track
+| Metric | What it shows |
+|--------|---------------|
+| `win_rate` | % games LLM wins vs rule-based |
+| `reward/mean` | Average per-step reward |
+| `kl_divergence` | Policy drift from base model |
+| `format_reward` | % responses with valid JSON |
+| `policy/entropy` | Exploration vs exploitation |
+---
+## 🔧 Environment Customization
+The Connect4 environment can be extended for more realistic autonomous driving:
+```python
+# Add to Connect4Action:
+speed: float = Field(1.0, ge=0.0, le=3.0)      # vehicle speed
+lane_change: int = Field(0, ge=-1, le=1)        # lane change direction
+# Add to reward shaping:
+def _safety_reward(self) -> float:
+    # Penalize high-speed moves near opponent
+    ...
+# Add multi-agent (>2 vehicles):
+AGENT3 = 3  # second LLM agent
+```
+---
+## 📎 Key Links
+- **OpenEnv repo**: https://github.com/meta-pytorch/OpenEnv
+- **Unsloth GRPO notebook**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb
+- **Qwen3 GRPO (faster)**: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
+- **TRL OpenEnv docs**: https://huggingface.co/docs/trl/openenv
+- **Northflank Jupyter**: https://northflank.notion.site/Jupyter-Notebook-with-PyTorch-2036d14c7851802abb7ccb4a7c5c96be
+---
+## ✅ Hackathon Checklist
+- [x] OpenEnv v0.2.1 environment built
+- [x] Connect4 game logic with shaped rewards
+- [x] Multi-agent (LLM + rule-based opponent)
+- [x] Deploy to HF Spaces via `openenv push`
+- [x] Unsloth GRPO training notebook (H100 BF16)
+- [x] W&B experiment tracking
+- [x] Closed-loop online RL loop
+- [x] Format reward for JSON CoT reasoning
+- [x] Evaluation tournament
+- [ ] Push trained model to HF Hub ← fill in after training

client.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""
+Connect4 Multi-Agent Environment — Client
+OpenEnv v0.2.1 — connects to HF Space endpoint
+"""
+from openenv.core.env_client import EnvClient
+from .models import Connect4Action, Connect4Observation
+class Connect4Env(EnvClient):
+    """
+    Client for the Connect4 multi-agent driving coordination environment.
+    Usage (async):
+        async with Connect4Env(base_url="https://YOUR-HF-SPACE.hf.space") as env:
+            obs = await env.reset()
+            result = await env.step(Connect4Action(column=3))
+    Usage (sync, for TRL/Unsloth training loops):
+        with Connect4Env(base_url="...").sync() as env:
+            obs = env.reset()
+            result = env.step(Connect4Action(column=3))
+    """
+    action_type = Connect4Action
+    observation_type = Connect4Observation

connect4_environment.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""
+Connect4 Multi-Agent Environment — Server Side
+Adapted for autonomous driving scenario:
+  - Agent 1 = "Ego vehicle" (LLM being trained)
+  - Agent 2 = "Opponent vehicle" (rule-based or another LLM)
+The board represents a grid intersection control problem:
+  - Winning = successfully navigating without collision
+  - Rewards shaped for RL post-training
+"""
+import numpy as np
+from typing import Optional
+from openenv.core.environment import Environment
+from ..models import (
+    Connect4Action, Connect4Observation, Connect4State
+)
+ROWS = 6
+COLS = 7
+EMPTY = 0
+AGENT1 = 1   # Ego vehicle / LLM under training
+AGENT2 = 2   # Opponent / rule-based agent
+class Connect4Environment(Environment):
+    """
+    Connect4 as a multi-agent driving coordination environment.
+    Observation:
+      - Board state (6x7 grid)
+      - Current player turn
+      - Legal moves
+      - Last move played
+      - Game status
+    Reward shaping (for RL):
+      +10.0  → Win (ego agent connects 4)
+      -10.0  → Loss (opponent connects 4)
+       +0.5  → Blocking opponent's winning move
+       +0.2  → Creating a 3-in-a-row
+       -0.1  → Invalid move attempt
+        0.0  → Draw
+    """
+    def __init__(self):
+        super().__init__()
+        self.board: np.ndarray = np.zeros((ROWS, COLS), dtype=int)
+        self.current_player: int = AGENT1
+        self.done: bool = False
+        self.winner: Optional[int] = None
+        self.last_move: Optional[int] = None
+        self.move_history: list = []
+    # ------------------------------------------------------------------ #
+    #  OpenEnv API                                                         #
+    # ------------------------------------------------------------------ #
+    def reset(self) -> Connect4Observation:
+        self.board = np.zeros((ROWS, COLS), dtype=int)
+        self.current_player = AGENT1
+        self.done = False
+        self.winner = None
+        self.last_move = None
+        self.move_history = []
+        return self._make_observation("Game reset. Your turn — you are Player 1 (Ego Vehicle).")
+    def step(self, action: Connect4Action) -> tuple[Connect4Observation, float, bool]:
+        if self.done:
+            obs = self._make_observation("Game already finished. Call reset() to start a new game.")
+            return obs, 0.0, True
+        col = action.column
+        reward = 0.0
+        # ---- validate move ----
+        if col < 0 or col >= COLS or not self._is_valid(col):
+            obs = self._make_observation(f"Invalid move: column {col} is full or out of range.")
+            return obs, -0.1, False
+        # ---- check for blocking bonus before placing ----
+        reward += self._blocking_bonus(col)
+        # ---- place piece ----
+        row = self._drop_piece(col, self.current_player)
+        self.last_move = col
+        self.move_history.append((self.current_player, col))
+        # ---- 3-in-a-row bonus ----
+        if self._count_streak(row, col, self.current_player) >= 3:
+            reward += 0.2
+        # ---- check win ----
+        if self._check_win(self.current_player):
+            self.done = True
+            self.winner = self.current_player
+            reward += 10.0 if self.current_player == AGENT1 else -10.0
+            msg = ("🏆 Ego vehicle wins! Successful navigation."
+                   if self.current_player == AGENT1
+                   else "💥 Opponent wins. Collision occurred.")
+            obs = self._make_observation(msg)
+            return obs, reward, True
+        # ---- check draw ----
+        if self._board_full():
+            self.done = True
+            obs = self._make_observation("🤝 Draw. Stalemate — no collision, no winner.")
+            return obs, 0.0, True
+        # ---- switch player ----
+        self.current_player = AGENT2 if self.current_player == AGENT1 else AGENT1
+        msg = f"Move accepted (col {col}). Now Player {self.current_player}'s turn."
+        obs = self._make_observation(msg)
+        return obs, reward, False
+    def state(self) -> Connect4State:
+        return Connect4State(
+            episode_id=self._episode_id,
+            step_count=self._step_count,
+            current_player=self.current_player,
+            done=self.done,
+            winner=self.winner,
+            move_history=self.move_history,
+        )
+    # ------------------------------------------------------------------ #
+    #  Internal helpers                                                    #
+    # ------------------------------------------------------------------ #
+    def _make_observation(self, message: str) -> Connect4Observation:
+        return Connect4Observation(
+            board=self.board.tolist(),
+            board_str=self._render_board(),
+            current_player=self.current_player,
+            legal_moves=self._legal_moves(),
+            last_move=self.last_move,
+            done=self.done,
+            winner=self.winner,
+            message=message,
+        )
+    def _render_board(self) -> str:
+        symbols = {EMPTY: ".", AGENT1: "X", AGENT2: "O"}
+        rows = []
+        for r in range(ROWS):
+            rows.append(" ".join(symbols[self.board[r][c]] for c in range(COLS)))
+        rows.append("-" * (COLS * 2 - 1))
+        rows.append(" ".join(str(c) for c in range(COLS)))
+        return "\n".join(rows)
+    def _is_valid(self, col: int) -> bool:
+        return self.board[0][col] == EMPTY
+    def _legal_moves(self) -> list[int]:
+        return [c for c in range(COLS) if self._is_valid(c)]
+    def _drop_piece(self, col: int, player: int) -> int:
+        for row in range(ROWS - 1, -1, -1):
+            if self.board[row][col] == EMPTY:
+                self.board[row][col] = player
+                return row
+        return -1
+    def _check_win(self, player: int) -> bool:
+        b = self.board
+        # Horizontal
+        for r in range(ROWS):
+            for c in range(COLS - 3):
+                if all(b[r][c+i] == player for i in range(4)):
+                    return True
+        # Vertical
+        for r in range(ROWS - 3):
+            for c in range(COLS):
+                if all(b[r+i][c] == player for i in range(4)):
+                    return True
+        # Diagonal /
+        for r in range(3, ROWS):
+            for c in range(COLS - 3):
+                if all(b[r-i][c+i] == player for i in range(4)):
+                    return True
+        # Diagonal \
+        for r in range(ROWS - 3):
+            for c in range(COLS - 3):
+                if all(b[r+i][c+i] == player for i in range(4)):
+                    return True
+        return False
+    def _board_full(self) -> bool:
+        return all(self.board[0][c] != EMPTY for c in range(COLS))
+    def _count_streak(self, row: int, col: int, player: int) -> int:
+        """Count max consecutive pieces for player around (row, col)."""
+        directions = [(0,1),(1,0),(1,1),(1,-1)]
+        best = 1
+        for dr, dc in directions:
+            count = 1
+            for sign in [1, -1]:
+                r, c = row + sign*dr, col + sign*dc
+                while 0 <= r < ROWS and 0 <= c < COLS and self.board[r][c] == player:
+                    count += 1
+                    r += sign*dr
+                    c += sign*dc
+            best = max(best, count)
+        return best
+    def _blocking_bonus(self, col: int) -> float:
+        """+0.5 if placing here blocks opponent's 4-in-a-row."""
+        opponent = AGENT2 if self.current_player == AGENT1 else AGENT1
+        test_board = self.board.copy()
+        for row in range(ROWS - 1, -1, -1):
+            if test_board[row][col] == EMPTY:
+                test_board[row][col] = opponent
+                break
+        # Check if opponent would have won
+        b = test_board
+        for r in range(ROWS):
+            for c in range(COLS - 3):
+                if all(b[r][c+i] == opponent for i in range(4)):
+                    return 0.5
+        for r in range(ROWS - 3):
+            for c in range(COLS):
+                if all(b[r+i][c] == opponent for i in range(4)):
+                    return 0.5
+        return 0.0

connect4_grpo_training.ipynb ADDED Viewed

	@@ -0,0 +1,654 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🚗 Multi-Agent Autonomous Driving RL — Connect4 + OpenEnv v0.2.1\n",
+    "\n",
+    "**Hackathon Track:** Infra & Control, Tool & API Integration, Safety, Memory, Observability\n",
+    "\n",
+    "**Stack:**\n",
+    "- 🏗️ [OpenEnv v0.2.1](https://github.com/meta-pytorch/OpenEnv) — RL environment framework\n",
+    "- 🦥 [Unsloth](https://unsloth.ai) — fast GRPO fine-tuning (BF16, H100 optimized)\n",
+    "- 🤗 [TRL GRPO](https://huggingface.co/docs/trl) — policy optimization\n",
+    "- 📊 [W&B](https://wandb.ai) — experiment tracking\n",
+    "- ☁️ [HF Spaces](https://huggingface.co/spaces) — environment server\n",
+    "\n",
+    "**Environment:** Connect4 framed as multi-agent intersection coordination\n",
+    "- Player 1 (X) = Ego vehicle LLM (being trained)\n",
+    "- Player 2 (O) = Rule-based opponent vehicle\n",
+    "- Reward shaping encourages strategic, safe navigation decisions\n",
+    "\n",
+    "**Colab Runtime:** H100 GPU (BF16) — reduce `max_steps` for faster iteration"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 1️⃣ Install Dependencies"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install Unsloth (latest, with vLLM for fast inference)\n",
+    "import sys\n",
+    "!{sys.executable} -m pip install --upgrade pip\n",
+    "!{sys.executable} -m pip install uv\n",
+    "\n",
+    "# Use venv for stability (recommended by hackathon notes)\n",
+    "# If running issues, uncomment and run in terminal:\n",
+    "# python -m venv unsloth_env && source unsloth_env/bin/activate\n",
+    "# uv pip install unsloth vllm --torch-backend=auto\n",
+    "\n",
+    "!uv pip install unsloth vllm --torch-backend=auto\n",
+    "!uv pip install --upgrade --no-cache-dir --no-deps unsloth unsloth_zoo\n",
+    "!uv pip install openenv-core==0.2.1 wandb trl>=0.15.0 pydantic numpy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install our Connect4 environment from HF Spaces\n",
+    "# Replace YOUR_HF_USERNAME with your actual HF username after deploying\n",
+    "HF_SPACE_REPO = \"YOUR_HF_USERNAME/connect4-env\"  # <-- update this\n",
+    "HF_SPACE_URL  = f\"https://{HF_SPACE_REPO.replace('/', '-')}.hf.space\"\n",
+    "\n",
+    "!pip install git+https://huggingface.co/spaces/{HF_SPACE_REPO}\n",
+    "\n",
+    "print(f\"Environment endpoint: {HF_SPACE_URL}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 2️⃣ W&B Setup + Config"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import wandb\n",
+    "wandb.login()  # will prompt for API key\n",
+    "\n",
+    "# ─── Hyperparameters ───────────────────────────────────────────────────\n",
+    "CONFIG = {\n",
+    "    # Model\n",
+    "    \"model_name\":        \"unsloth/Qwen3-4B-unsloth-bnb-4bit\",  # fast 4-bit for Colab\n",
+    "    # \"model_name\":      \"unsloth/gpt-oss-20b-bf16\",           # H100 BF16 (hackathon default)\n",
+    "\n",
+    "    # Training\n",
+    "    \"max_steps\":         300,           # reduce to 50 for quick test\n",
+    "    \"num_generations\":   4,             # rollouts per prompt\n",
+    "    \"max_new_tokens\":    64,            # per move response\n",
+    "    \"learning_rate\":     5e-6,\n",
+    "    \"batch_size\":        2,\n",
+    "    \"gradient_accumulation_steps\": 4,\n",
+    "\n",
+    "    # LoRA\n",
+    "    \"lora_r\":            16,\n",
+    "    \"lora_alpha\":        32,\n",
+    "    \"fast_inference\":    True,          # uses vLLM for speed\n",
+    "\n",
+    "    # Environment\n",
+    "    \"env_url\":           HF_SPACE_URL,\n",
+    "    \"games_per_step\":    4,\n",
+    "    \"max_moves\":         42,            # max moves in Connect4\n",
+    "\n",
+    "    # Reward weights\n",
+    "    \"reward_win\":        10.0,\n",
+    "    \"reward_lose\":      -10.0,\n",
+    "    \"reward_block\":       0.5,\n",
+    "    \"reward_three\":       0.2,\n",
+    "    \"reward_invalid\":    -0.1,\n",
+    "    \"reward_format\":      0.3,          # bonus for correct JSON format\n",
+    "}\n",
+    "\n",
+    "run = wandb.init(\n",
+    "    project=\"openenv-connect4-autodrive\",\n",
+    "    config=CONFIG,\n",
+    "    tags=[\"connect4\", \"grpo\", \"openenv\", \"autonomous-driving\", \"multi-agent\"]\n",
+    ")\n",
+    "print(\"W&B run:\", run.url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 3️⃣ Load Model with Unsloth"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from unsloth import FastLanguageModel\n",
+    "import torch\n",
+    "\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name = CONFIG[\"model_name\"],\n",
+    "    max_seq_length = 2048,\n",
+    "    load_in_4bit = True,       # set False for BF16 on H100\n",
+    "    fast_inference = CONFIG[\"fast_inference\"],\n",
+    "    gpu_memory_utilization = 0.7,\n",
+    ")\n",
+    "\n",
+    "# Add LoRA adapters\n",
+    "model = FastLanguageModel.get_peft_model(\n",
+    "    model,\n",
+    "    r = CONFIG[\"lora_r\"],\n",
+    "    lora_alpha = CONFIG[\"lora_alpha\"],\n",
+    "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
+    "                      \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    "    lora_dropout = 0,\n",
+    "    bias = \"none\",\n",
+    "    use_gradient_checkpointing = \"unsloth\",\n",
+    "    random_state = 42,\n",
+    ")\n",
+    "\n",
+    "print(f\"✅ Model loaded: {CONFIG['model_name']}\")\n",
+    "print(f\"   Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 4️⃣ Connect4 Prompt Engineering + Reward Functions"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json, re\n",
+    "from typing import Optional\n",
+    "\n",
+    "# ─── System Prompt ──────────────────────────────────────────────────────\n",
+    "SYSTEM_PROMPT = \"\"\"You are an autonomous vehicle navigation AI (Player 1, symbol: X).\n",
+    "You are navigating a 6x7 grid intersection. Your goal is to coordinate your path\n",
+    "to create a connected route of 4 cells (Connect4) before the opponent vehicle (O).\n",
+    "\n",
+    "The board represents intersection occupancy. Each column is a lane (0-6).\n",
+    "Pieces fall to the lowest available row in each column.\n",
+    "\n",
+    "Think step by step about:\n",
+    "1. Your current formation and best extension\n",
+    "2. Opponent threats to block\n",
+    "3. The optimal column to select\n",
+    "\n",
+    "Respond ONLY with valid JSON:\n",
+    "{\"thinking\": \"<your reasoning>\", \"column\": <0-6>}\"\"\"\n",
+    "\n",
+    "\n",
+    "def format_prompt(obs_message: str, board_str: str, legal_moves: list) -> str:\n",
+    "    return f\"\"\"Current board state:\n",
+    "```\n",
+    "{board_str}\n",
+    "```\n",
+    "Legal moves (columns): {legal_moves}\n",
+    "Status: {obs_message}\n",
+    "\n",
+    "Select your move:\"\"\"\n",
+    "\n",
+    "\n",
+    "# ─── Reward Functions ───────────────────────────────────────────────────\n",
+    "def parse_llm_move(response: str) -> Optional[int]:\n",
+    "    \"\"\"Extract column from LLM JSON response.\"\"\"\n",
+    "    try:\n",
+    "        # Try direct JSON parse\n",
+    "        data = json.loads(response.strip())\n",
+    "        return int(data.get(\"column\", -1))\n",
+    "    except Exception:\n",
+    "        pass\n",
+    "    # Fallback: regex\n",
+    "    m = re.search(r'\"column\"\\s*:\\s*(\\d+)', response)\n",
+    "    if m:\n",
+    "        return int(m.group(1))\n",
+    "    # Last resort: find any digit\n",
+    "    digits = re.findall(r'\\b([0-6])\\b', response)\n",
+    "    return int(digits[-1]) if digits else None\n",
+    "\n",
+    "\n",
+    "def format_reward(response: str) -> float:\n",
+    "    \"\"\"Reward correct JSON format with thinking field.\"\"\"\n",
+    "    try:\n",
+    "        data = json.loads(response.strip())\n",
+    "        has_thinking = isinstance(data.get(\"thinking\"), str) and len(data[\"thinking\"]) > 10\n",
+    "        has_column   = isinstance(data.get(\"column\"), int)\n",
+    "        return CONFIG[\"reward_format\"] if (has_thinking and has_column) else 0.0\n",
+    "    except Exception:\n",
+    "        return -0.05  # small penalty for unparseable output\n",
+    "\n",
+    "\n",
+    "print(\"✅ Prompt and reward functions defined\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 5️⃣ Rule-Based Opponent (Player 2)"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "\n",
+    "def opponent_move(board: list, legal_moves: list) -> int:\n",
+    "    \"\"\"\n",
+    "    Rule-based opponent (Player 2 / O):\n",
+    "    1. Win if possible\n",
+    "    2. Block Player 1 winning move\n",
+    "    3. Prefer center\n",
+    "    4. Random\n",
+    "    \"\"\"\n",
+    "    ROWS, COLS = 6, 7\n",
+    "    P2, P1 = 2, 1\n",
+    "\n",
+    "    def can_win_at(b, col, player):\n",
+    "        import copy\n",
+    "        b2 = copy.deepcopy(b)\n",
+    "        for row in range(ROWS-1, -1, -1):\n",
+    "            if b2[row][col] == 0:\n",
+    "                b2[row][col] = player\n",
+    "                break\n",
+    "        # Check win\n",
+    "        for r in range(ROWS):\n",
+    "            for c in range(COLS-3):\n",
+    "                if all(b2[r][c+i] == player for i in range(4)): return True\n",
+    "        for r in range(ROWS-3):\n",
+    "            for c in range(COLS):\n",
+    "                if all(b2[r+i][c] == player for i in range(4)): return True\n",
+    "        for r in range(3, ROWS):\n",
+    "            for c in range(COLS-3):\n",
+    "                if all(b2[r-i][c+i] == player for i in range(4)): return True\n",
+    "        for r in range(ROWS-3):\n",
+    "            for c in range(COLS-3):\n",
+    "                if all(b2[r+i][c+i] == player for i in range(4)): return True\n",
+    "        return False\n",
+    "\n",
+    "    # 1. Win\n",
+    "    for col in legal_moves:\n",
+    "        if can_win_at(board, col, P2):\n",
+    "            return col\n",
+    "    # 2. Block\n",
+    "    for col in legal_moves:\n",
+    "        if can_win_at(board, col, P1):\n",
+    "            return col\n",
+    "    # 3. Center preference\n",
+    "    center_order = sorted(legal_moves, key=lambda c: abs(c - 3))\n",
+    "    return center_order[0]\n",
+    "\n",
+    "print(\"✅ Rule-based opponent defined\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 6️⃣ OpenEnv Game Loop (Environment Interaction)"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "from connect4_env import Connect4Env, Connect4Action\n",
+    "\n",
+    "async def play_game(model, tokenizer, env_url: str, verbose: bool = False):\n",
+    "    \"\"\"\n",
+    "    Run one complete Connect4 game.\n",
+    "    Returns list of (prompt, response, reward) tuples for GRPO training.\n",
+    "    \"\"\"\n",
+    "    experiences = []\n",
+    "\n",
+    "    async with Connect4Env(base_url=env_url) as env:\n",
+    "        obs = await env.reset()\n",
+    "\n",
+    "        for move_num in range(CONFIG[\"max_moves\"]):\n",
+    "            if obs.done:\n",
+    "                break\n",
+    "\n",
+    "            # ── Player 1: LLM turn ──────────────────────────────────────\n",
+    "            if obs.current_player == 1:\n",
+    "                prompt = format_prompt(obs.message, obs.board_str, obs.legal_moves)\n",
+    "                messages = [\n",
+    "                    {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "                    {\"role\": \"user\",   \"content\": prompt},\n",
+    "                ]\n",
+    "                input_ids = tokenizer.apply_chat_template(\n",
+    "                    messages, return_tensors=\"pt\", tokenize=True\n",
+    "                ).to(model.device)\n",
+    "\n",
+    "                with torch.no_grad():\n",
+    "                    output = model.generate(\n",
+    "                        input_ids,\n",
+    "                        max_new_tokens=CONFIG[\"max_new_tokens\"],\n",
+    "                        temperature=0.7,\n",
+    "                        do_sample=True,\n",
+    "                        pad_token_id=tokenizer.eos_token_id,\n",
+    "                    )\n",
+    "                response = tokenizer.decode(\n",
+    "                    output[0][input_ids.shape[1]:], skip_special_tokens=True\n",
+    "                )\n",
+    "\n",
+    "                col = parse_llm_move(response)\n",
+    "                if col is None or col not in obs.legal_moves:\n",
+    "                    col = random.choice(obs.legal_moves)  # fallback\n",
+    "                    env_reward = CONFIG[\"reward_invalid\"]\n",
+    "                else:\n",
+    "                    env_reward = 0.0  # will be updated after step\n",
+    "\n",
+    "                result = await env.step(Connect4Action(\n",
+    "                    column=col,\n",
+    "                    reasoning=response[:200]\n",
+    "                ))\n",
+    "\n",
+    "                # Accumulate rewards\n",
+    "                total_reward = (\n",
+    "                    result.reward\n",
+    "                    + format_reward(response)\n",
+    "                )\n",
+    "\n",
+    "                experiences.append({\n",
+    "                    \"prompt\":   tokenizer.apply_chat_template(messages, tokenize=False),\n",
+    "                    \"response\": response,\n",
+    "                    \"reward\":   total_reward,\n",
+    "                    \"move\":     col,\n",
+    "                    \"move_num\": move_num,\n",
+    "                })\n",
+    "\n",
+    "                obs = result.observation\n",
+    "                if verbose:\n",
+    "                    print(f\"P1 move {col} | reward {total_reward:.2f}\")\n",
+    "                    print(obs.board_str)\n",
+    "\n",
+    "            # ── Player 2: Rule-based opponent ───────────────────────────\n",
+    "            else:\n",
+    "                col = opponent_move(obs.board, obs.legal_moves)\n",
+    "                result = await env.step(Connect4Action(column=col))\n",
+    "                obs = result.observation\n",
+    "                if verbose:\n",
+    "                    print(f\"P2 move {col}\")\n",
+    "\n",
+    "    # Terminal reward propagation — assign game outcome to all moves\n",
+    "    if obs.winner == 1:\n",
+    "        outcome_bonus = 1.0\n",
+    "    elif obs.winner == 2:\n",
+    "        outcome_bonus = -1.0\n",
+    "    else:\n",
+    "        outcome_bonus = 0.1  # draw is slightly positive\n",
+    "\n",
+    "    for exp in experiences:\n",
+    "        exp[\"reward\"] += outcome_bonus\n",
+    "\n",
+    "    return experiences, obs.winner\n",
+    "\n",
+    "\n",
+    "# Quick sanity test (1 game, no training)\n",
+    "print(\"Running test game...\")\n",
+    "test_exps, winner = asyncio.run(\n",
+    "    play_game(model, tokenizer, CONFIG[\"env_url\"], verbose=True)\n",
+    ")\n",
+    "print(f\"\\nTest game winner: Player {winner} | Experiences collected: {len(test_exps)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 7️⃣ GRPO Training Loop (Unsloth + TRL)"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import GRPOTrainer, GRPOConfig\n",
+    "from datasets import Dataset\n",
+    "\n",
+    "# ─── Build initial dataset from self-play ───────────────────────────────\n",
+    "print(\"Collecting initial self-play data...\")\n",
+    "all_experiences = []\n",
+    "wins = 0\n",
+    "\n",
+    "for game_i in range(CONFIG[\"games_per_step\"]):\n",
+    "    exps, winner = asyncio.run(\n",
+    "        play_game(model, tokenizer, CONFIG[\"env_url\"])\n",
+    "    )\n",
+    "    all_experiences.extend(exps)\n",
+    "    if winner == 1:\n",
+    "        wins += 1\n",
+    "    print(f\"  Game {game_i+1}/{CONFIG['games_per_step']} | winner={winner}\")\n",
+    "\n",
+    "print(f\"\\nInitial win rate: {wins}/{CONFIG['games_per_step']} = {wins/CONFIG['games_per_step']:.1%}\")\n",
+    "wandb.log({\"initial_win_rate\": wins / CONFIG[\"games_per_step\"]})\n",
+    "\n",
+    "# Convert to HF Dataset\n",
+    "dataset = Dataset.from_list([\n",
+    "    {\"prompt\": e[\"prompt\"], \"reward\": e[\"reward\"]}\n",
+    "    for e in all_experiences\n",
+    "])\n",
+    "print(f\"Dataset size: {len(dataset)} samples\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ─── Reward function for GRPO Trainer ───────────────────────────────────\n",
+    "# GRPO expects: reward_funcs that take (prompts, completions) -> list[float]\n",
+    "\n",
+    "# We pre-computed rewards via env interaction, stored in dataset\n",
+    "# This function provides FORMAT reward during GRPO rollouts\n",
+    "def grpo_reward_format(completions, **kwargs) -> list[float]:\n",
+    "    return [format_reward(c) for c in completions]\n",
+    "\n",
+    "\n",
+    "# ─── GRPO Config ────────────────────────────────────────────────────────\n",
+    "grpo_config = GRPOConfig(\n",
+    "    output_dir=\"./connect4-grpo-checkpoints\",\n",
+    "    num_train_epochs=1,\n",
+    "    max_steps=CONFIG[\"max_steps\"],\n",
+    "    per_device_train_batch_size=CONFIG[\"batch_size\"],\n",
+    "    gradient_accumulation_steps=CONFIG[\"gradient_accumulation_steps\"],\n",
+    "    learning_rate=CONFIG[\"learning_rate\"],\n",
+    "    num_generations=CONFIG[\"num_generations\"],\n",
+    "    max_new_tokens=CONFIG[\"max_new_tokens\"],\n",
+    "    max_prompt_length=1024,\n",
+    "    bf16=True,\n",
+    "    logging_steps=10,\n",
+    "    save_steps=100,\n",
+    "    report_to=\"wandb\",\n",
+    "    run_name=f\"connect4-grpo-{CONFIG['model_name'].split('/')[-1]}\",\n",
+    "    # GRPO-specific\n",
+    "    use_vllm=CONFIG[\"fast_inference\"],\n",
+    "    vllm_gpu_memory_utilization=0.3,\n",
+    "    temperature=0.7,\n",
+    "    kl_coef=0.01,\n",
+    ")\n",
+    "\n",
+    "# ─── Trainer ────────────────────────────────────────────────────────────\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    processing_class=tokenizer,\n",
+    "    reward_funcs=[grpo_reward_format],\n",
+    "    args=grpo_config,\n",
+    "    train_dataset=dataset,\n",
+    ")\n",
+    "\n",
+    "print(\"✅ GRPO Trainer initialized\")\n",
+    "print(f\"   max_steps: {CONFIG['max_steps']}\")\n",
+    "print(f\"   fast_inference (vLLM): {CONFIG['fast_inference']}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ─── Run Training ────────────────────────────────────────────────────────\n",
+    "print(\"🚀 Starting GRPO training...\")\n",
+    "trainer.train()\n",
+    "print(\"✅ Training complete!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 8️⃣ Online RL Loop — Closed-Loop Self-Play Training"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"[OPTIONAL - Advanced]\n",
+    "Online RL: alternate between:\n",
+    "  (a) collecting fresh game data with current model\n",
+    "  (b) GRPO update on fresh data\n",
+    "\n",
+    "This implements closed-loop learning — the key advantage of RL + Envs.\n",
+    "\"\"\"\n",
+    "\n",
+    "ONLINE_ITERATIONS = 5  # number of collect → train cycles\n",
+    "\n",
+    "win_rates = []\n",
+    "\n",
+    "for iteration in range(ONLINE_ITERATIONS):\n",
+    "    print(f\"\\n{'='*50}\")\n",
+    "    print(f\"Online RL Iteration {iteration+1}/{ONLINE_ITERATIONS}\")\n",
+    "    print('='*50)\n",
+    "\n",
+    "    # ── Collect fresh experience ──────────────────────────────────────\n",
+    "    fresh_exps = []\n",
+    "    wins = 0\n",
+    "    for _ in range(CONFIG[\"games_per_step\"]):\n",
+    "        exps, winner = asyncio.run(\n",
+    "            play_game(model, tokenizer, CONFIG[\"env_url\"])\n",
+    "        )\n",
+    "        fresh_exps.extend(exps)\n",
+    "        if winner == 1: wins += 1\n",
+    "\n",
+    "    win_rate = wins / CONFIG[\"games_per_step\"]\n",
+    "    win_rates.append(win_rate)\n",
+    "    print(f\"Win rate: {win_rate:.1%}\")\n",
+    "    wandb.log({\"win_rate\": win_rate, \"iteration\": iteration})\n",
+    "\n",
+    "    # ── Update dataset ────────────────────────────────────────────────\n",
+    "    fresh_dataset = Dataset.from_list([\n",
+    "        {\"prompt\": e[\"prompt\"], \"reward\": e[\"reward\"]}\n",
+    "        for e in fresh_exps\n",
+    "    ])\n",
+    "\n",
+    "    # ── Short GRPO update on fresh data ──────────────────────────────\n",
+    "    trainer.train_dataset = fresh_dataset\n",
+    "    trainer.args.max_steps = 50  # short update per iteration\n",
+    "    trainer.train()\n",
+    "\n",
+    "print(f\"\\nFinal win rates across iterations: {win_rates}\")\n",
+    "print(f\"Improvement: {win_rates[0]:.1%} → {win_rates[-1]:.1%}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": ["## 9️⃣ Save & Push to HF Hub"]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save LoRA adapter\n",
+    "model.save_pretrained(\"connect4-grpo-adapter\")\n",
+    "tokenizer.save_pretrained(\"connect4-grpo-adapter\")\n",
+    "\n",
+    "# Push to HF Hub\n",
+    "HF_MODEL_REPO = \"YOUR_HF_USERNAME/connect4-autonomous-driving-grpo\"  # <-- update\n",
+    "model.push_to_hub(HF_MODEL_REPO)\n",
+    "tokenizer.push_to_hub(HF_MODEL_REPO)\n",
+    "\n",
+    "# Save merged model (optional, for inference)\n",
+    "# model.save_pretrained_merged(\"connect4-merged\", tokenizer)\n",
+    "\n",
+    "print(f\"✅ Model pushed to: https://huggingface.co/{HF_MODEL_REPO}\")\n",
+    "wandb.finish()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 📊 Evaluation\n",
+    "Test the trained model against the rule-based opponent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Evaluation: 20-game tournament\n",
+    "FastLanguageModel.for_inference(model)  # switch to inference mode\n",
+    "\n",
+    "EVAL_GAMES = 20\n",
+    "results = {1: 0, 2: 0, None: 0}\n",
+    "\n",
+    "for i in range(EVAL_GAMES):\n",
+    "    _, winner = asyncio.run(play_game(model, tokenizer, CONFIG[\"env_url\"]))\n",
+    "    results[winner] = results.get(winner, 0) + 1\n",
+    "    print(f\"Game {i+1:2d}: winner = Player {winner}\")\n",
+    "\n",
+    "print(f\"\\n{'='*40}\")\n",
+    "print(f\"EVALUATION RESULTS ({EVAL_GAMES} games)\")\n",
+    "print(f\"  LLM wins  (P1): {results[1]:2d} ({results[1]/EVAL_GAMES:.1%})\")\n",
+    "print(f\"  Rule wins (P2): {results[2]:2d} ({results[2]/EVAL_GAMES:.1%})\")\n",
+    "print(f\"  Draws        : {results[None]:2d} ({results.get(None,0)/EVAL_GAMES:.1%})\")\n",
+    "\n",
+    "wandb.log({\n",
+    "    \"eval_win_rate\": results[1] / EVAL_GAMES,\n",
+    "    \"eval_loss_rate\": results[2] / EVAL_GAMES,\n",
+    "    \"eval_draw_rate\": results.get(None, 0) / EVAL_GAMES,\n",
+    "})"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
+  },
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "H100",
+   "provenance": []
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

models.py ADDED Viewed

	@@ -0,0 +1,45 @@

+"""
+Connect4 Multi-Agent Environment — Models
+OpenEnv v0.2.1
+"""
+from typing import Optional
+from pydantic import Field
+from openenv.core.models import Action, Observation, State
+class Connect4Action(Action):
+    """Action: choose which column to drop a piece into (0–6)."""
+    column: int = Field(
+        ...,
+        ge=0,
+        le=6,
+        description="Column index (0-6) to drop the piece into",
+    )
+    reasoning: Optional[str] = Field(
+        None,
+        description="LLM chain-of-thought reasoning for this move (used for reward shaping)",
+    )
+class Connect4Observation(Observation):
+    """Full observation returned after each step."""
+    board: list[list[int]] = Field(..., description="6x7 board as nested list (0=empty, 1=P1, 2=P2)")
+    board_str: str = Field(..., description="Human-readable ASCII board")
+    current_player: int = Field(..., description="Which player moves next (1 or 2)")
+    legal_moves: list[int] = Field(..., description="List of valid column indices")
+    last_move: Optional[int] = Field(None, description="Column of the last move played")
+    done: bool = Field(False, description="Whether the game has ended")
+    winner: Optional[int] = Field(None, description="Winner (1 or 2) or None if ongoing/draw")
+    message: str = Field("", description="Human-readable status message")
+class Connect4State(State):
+    """Episode-level state metadata."""
+    current_player: int = Field(1)
+    done: bool = Field(False)
+    winner: Optional[int] = Field(None)
+    move_history: list[tuple[int, int]] = Field(
+        default_factory=list,
+        description="List of (player, column) tuples"
+    )