chessecon

Runtime error

App Files Files Community

suvasis commited on 2 days ago

Commit

a93cec9

1 Parent(s): e4d7d50

added huggingfacehub README

Browse files

Files changed (1) hide show

README.md +308 -92

README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 title: ChessEcon
 emoji: ♟️
 colorFrom: indigo
-colorTo: purple
 sdk: docker
 app_port: 8000
 tags:
@@ -15,40 +15,55 @@ tags:
   - economy
   - two-player
   - game
 license: apache-2.0
 ---
-# ♟️ ChessEcon — OpenEnv 0.1 Compliant Chess Economy Environment
-> **Self-hosted environment** — the live API runs on AdaBoost AI infrastructure.
-> Update this URL if the domain changes.
-**Live API base URL:** `https://chessecon.adaboost.io`
-**env_info:** `https://chessecon.adaboost.io/env/env_info`
 **Dashboard:** `https://chessecon-ui.adaboost.io`
-**Swagger docs:** `https://chessecon.adaboost.io/docs`
 ---
-**Two competing LLM agents play chess for economic stakes.**
-White = `Qwen/Qwen2.5-0.5B-Instruct` (trainable) | Black = `meta-llama/Llama-3.2-1B-Instruct` (fixed)
-Both agents pay an entry fee each game. The winner earns a prize pool.
-The White agent is trained live with **GRPO** (Group Relative Policy Optimisation).
 ---
 ## OpenEnv 0.1 API
-This environment is fully compliant with the [OpenEnv 0.1 spec](https://github.com/huggingface/openenv).
 | Endpoint | Method | Description |
 |---|---|---|
-| `/env/reset` | `POST` | Start a new episode, deduct entry fees, return initial observation |
-| `/env/step` | `POST` | Apply one move (UCI or SAN), return reward + next observation |
-| `/env/state` | `GET` | Inspect current board state — read-only, no side effects |
 | `/env/env_info` | `GET` | Environment metadata for HF Hub discoverability |
-| `/ws` | `WS` | Real-time event stream for the live dashboard |
 | `/health` | `GET` | Health check + model load status |
 | `/docs` | `GET` | Interactive Swagger UI |
@@ -63,17 +78,17 @@ BASE = "https://chessecon.adaboost.io"
 # 1. Start a new episode
 reset = httpx.post(f"{BASE}/env/reset").json()
-print(reset["observation"]["fen"])             # starting FEN
-print(reset["observation"]["legal_moves_uci"]) # all legal moves
-# 2. Play moves
 step = httpx.post(f"{BASE}/env/step", json={"action": "e2e4"}).json()
-print(step["observation"]["fen"])   # board after move
 print(step["reward"])               # per-step reward signal
-print(step["terminated"])           # True if game is over
-print(step["truncated"])            # True if move limit hit
-# 3. Inspect state (non-destructive)
 state = httpx.get(f"{BASE}/env/state").json()
 print(state["step_count"])          # moves played so far
 print(state["status"])              # "active" | "terminated" | "idle"
@@ -81,63 +96,78 @@ print(state["status"])              # "active" | "terminated" | "idle"
 # 4. Environment metadata
 info = httpx.get(f"{BASE}/env/env_info").json()
 print(info["openenv_version"])      # "0.1"
-print(info["agents"])               # white/black model IDs
 ```
-### Drop-in Client for TRL / verl / SkyRL
 ```python
 import httpx
-class ChessEconClient:
-    """OpenEnv 0.1 client — compatible with TRL, verl, SkyRL."""
     def __init__(self, base_url: str = "https://chessecon.adaboost.io"):
         self.base = base_url.rstrip("/")
-        self.client = httpx.Client(timeout=30)
-    def reset(self, seed=None):
         payload = {"seed": seed} if seed is not None else {}
-        r = self.client.post(f"{self.base}/env/reset", json=payload)
         r.raise_for_status()
-        data = r.json()
-        return data["observation"], data["info"]
-    def step(self, action: str):
-        r = self.client.post(f"{self.base}/env/step", json={"action": action})
         r.raise_for_status()
-        data = r.json()
-        return (
-            data["observation"],
-            data["reward"],
-            data["terminated"],
-            data["truncated"],
-            data["info"],
-        )
-    def state(self):
-        return self.client.get(f"{self.base}/env/state").json()
-    def env_info(self):
-        return self.client.get(f"{self.base}/env/env_info").json()
-# Usage
-env = ChessEconClient()
 obs, info = env.reset()
 while True:
-    action = obs["legal_moves_uci"][0]          # replace with your policy
     obs, reward, terminated, truncated, info = env.step(action)
     if terminated or truncated:
         break
 ```
 ---
 ## Observation Schema
-Every response wraps a `ChessObservation` object:
 ```json
 {
@@ -147,7 +177,7 @@ Every response wraps a `ChessObservation` object:
     "move_number": 1,
     "last_move_uci": "e2e4",
     "last_move_san": "e4",
-    "legal_moves_uci": ["e7e5", "d7d5", "g8f6"],
     "is_check": false,
     "wallet_white": 90.0,
     "wallet_black": 90.0,
@@ -158,11 +188,11 @@ Every response wraps a `ChessObservation` object:
 }
 ```
-### Step Response
 ```json
 {
-  "observation": { "...": "see above" },
   "reward": 0.01,
   "terminated": false,
   "truncated": false,
@@ -170,11 +200,11 @@ Every response wraps a `ChessObservation` object:
 }
 ```
-### State Response
 ```json
 {
-  "observation": { "...": "see above" },
   "episode_id": "ep-42",
   "step_count": 1,
   "status": "active",
@@ -182,69 +212,255 @@ Every response wraps a `ChessObservation` object:
 }
 ```
 ---
 ## Reward Structure
-| Event | Reward | Notes |
 |---|---|---|
-| Legal move | `+0.01` | Every valid move |
-| Move gives check | `+0.05` | Additional bonus |
-| Capture | `+0.10` | Additional bonus |
-| Win (checkmate) | `+1.00` | Terminal |
 | Loss | `-1.00` | Terminal |
 | Draw | `0.00` | Terminal |
-| Illegal move | `-0.10` | Episode continues |
-Combined reward: `0.4 × game_reward + 0.6 × economic_reward`
 ---
 ## Economy Model
 | Parameter | Value |
 |---|---|
 | Starting wallet | 100 units |
 | Entry fee | 10 units per agent per game |
 | Prize pool | 18 units (90% of 2 × entry fee) |
-| Draw refund | 5 units each |
 ---
 ## Architecture
 ```
-External RL Trainers (TRL / verl / SkyRL)
-          │  HTTP
-          ▼
-┌─────────────────────────────────────────────┐
-│         OpenEnv 0.1 HTTP API                │
-│  POST /env/reset  POST /env/step            │
-│  GET  /env/state  GET  /env/env_info        │
-│         asyncio.Lock — thread safe          │
-└──────────────┬──────────────────────────────┘
-               │
-       ┌───────┴────────┐
-       ▼                ▼
-┌─────────────┐  ┌──────────────┐
-│ Chess Engine│  │Economy Engine│
-│ python-chess│  │Wallets · Fees│
-│ FEN · UCI   │  │Prize Pool    │
-└──────┬──────┘  └──────────────┘
-       │
-  ┌────┴─────┐
-  ▼          ▼
-♔ Qwen     ♚ Llama
-0.5B       1B
-GRPO↑      Fixed
 ```
 ---
-## Hardware
-Self-hosted on AdaBoost AI infrastructure:
-- 4× NVIDIA RTX 3070 (lambda-quad)
-- Models loaded in 4-bit quantization
-Built by [AdaBoost AI](https://adaboost.io) · Hackathon 2026

 title: ChessEcon
 emoji: ♟️
 colorFrom: indigo
+colorTo: yellow
 sdk: docker
 app_port: 8000
 tags:
   - economy
   - two-player
   - game
+  - textarena
+  - llm-training
 license: apache-2.0
 ---
+<div align="center">
+# ♟️ ChessEcon
+### Multi-Agent Chess Economy · OpenEnv 0.1 · GRPO Live Training
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-0.1-blueviolet?style=flat-square)](https://github.com/huggingface/openenv)
+[![TextArena](https://img.shields.io/badge/TextArena-compatible-orange?style=flat-square)](https://github.com/textarena)
+[![License](https://img.shields.io/badge/license-Apache--2.0-green?style=flat-square)](LICENSE)
+[![Hackathon](https://img.shields.io/badge/Hackathon-2026-gold?style=flat-square)](https://adaboost.io)
+**Live API:** `https://chessecon.adaboost.io`
 **Dashboard:** `https://chessecon-ui.adaboost.io`
+**Swagger:** `https://chessecon.adaboost.io/docs`
+**env_info:** `https://chessecon.adaboost.io/env/env_info`
+</div>
 ---
+## Overview
+ChessEcon is a **two-player LLM chess environment** where agents compete for economic stakes, fully compliant with the [OpenEnv 0.1](https://github.com/huggingface/openenv) specification.
+Two language models play chess head-to-head. Each game costs an entry fee. The winner earns a prize pool. The White agent trains **live** using **GRPO** (Group Relative Policy Optimisation) — every game updates the policy weights in real-time. A Bloomberg-style dashboard streams all activity via WebSocket.
+| Agent | Model | Role |
+|---|---|---|
+| ♔ White | `Qwen/Qwen2.5-0.5B-Instruct` | **Trainable** — GRPO updates every game |
+| ♚ Black | `meta-llama/Llama-3.2-1B-Instruct` | **Fixed opponent** — frozen weights |
 ---
 ## OpenEnv 0.1 API
+All endpoints are compatible with TRL, verl, SkyRL, and any OpenEnv 0.1 trainer.
 | Endpoint | Method | Description |
 |---|---|---|
+| `/env/reset` | `POST` | Start new episode · deduct entry fees · return initial observation |
+| `/env/step` | `POST` | Apply one move (UCI or SAN) · return reward + next observation |
+| `/env/state` | `GET` | Read current board state — non-destructive |
 | `/env/env_info` | `GET` | Environment metadata for HF Hub discoverability |
+| `/ws` | `WebSocket` | Real-time event stream (moves, rewards, GRPO metrics) |
 | `/health` | `GET` | Health check + model load status |
 | `/docs` | `GET` | Interactive Swagger UI |
 # 1. Start a new episode
 reset = httpx.post(f"{BASE}/env/reset").json()
+print(reset["observation"]["fen"])              # starting position
+print(reset["observation"]["legal_moves_uci"])  # all legal moves in UCI
+# 2. Play a move (UCI or SAN accepted)
 step = httpx.post(f"{BASE}/env/step", json={"action": "e2e4"}).json()
+print(step["observation"]["fen"])   # updated board
 print(step["reward"])               # per-step reward signal
+print(step["terminated"])           # True when game ends
+print(step["truncated"])            # True if move limit reached
+# 3. Inspect current state (read-only)
 state = httpx.get(f"{BASE}/env/state").json()
 print(state["step_count"])          # moves played so far
 print(state["status"])              # "active" | "terminated" | "idle"
 # 4. Environment metadata
 info = httpx.get(f"{BASE}/env/env_info").json()
 print(info["openenv_version"])      # "0.1"
+print(info["agents"])               # model IDs for white/black
 ```
+---
+## Drop-in Client (TRL / verl / SkyRL)
 ```python
 import httpx
+class ChessEconEnv:
+    """
+    OpenEnv 0.1 client for ChessEcon.
+    Compatible with TRL, verl, SkyRL, and any gym-style RL trainer.
+    """
     def __init__(self, base_url: str = "https://chessecon.adaboost.io"):
         self.base = base_url.rstrip("/")
+        self.http = httpx.Client(timeout=30)
+    def reset(self, seed: int | None = None) -> tuple[dict, dict]:
         payload = {"seed": seed} if seed is not None else {}
+        r = self.http.post(f"{self.base}/env/reset", json=payload)
         r.raise_for_status()
+        d = r.json()
+        return d["observation"], d["info"]
+    def step(self, action: str) -> tuple[dict, float, bool, bool, dict]:
+        """
+        Args:
+            action: Move in UCI (e.g. "e2e4") or SAN (e.g. "e4")
+        Returns:
+            (observation, reward, terminated, truncated, info)
+        """
+        r = self.http.post(f"{self.base}/env/step", json={"action": action})
         r.raise_for_status()
+        d = r.json()
+        return (d["observation"], d["reward"], d["terminated"], d["truncated"], d["info"])
+    def state(self) -> dict:
+        return self.http.get(f"{self.base}/env/state").json()
+    def env_info(self) -> dict:
+        return self.http.get(f"{self.base}/env/env_info").json()
+    def close(self):
+        self.http.close()
+# Example: random rollout
+import random
+env = ChessEconEnv()
 obs, info = env.reset()
+total_reward = 0.0
 while True:
+    action = random.choice(obs["legal_moves_uci"])  # replace with your policy
     obs, reward, terminated, truncated, info = env.step(action)
+    total_reward += reward
     if terminated or truncated:
+        print(f"Game over | result={info.get('result')} | total_reward={total_reward:.3f}")
         break
+env.close()
 ```
 ---
 ## Observation Schema
+Every response from `/env/reset`, `/env/step`, and `/env/state` contains a `ChessObservation`:
 ```json
 {
     "move_number": 1,
     "last_move_uci": "e2e4",
     "last_move_san": "e4",
+    "legal_moves_uci": ["e7e5", "d7d5", "g8f6", "..."],
     "is_check": false,
     "wallet_white": 90.0,
     "wallet_black": 90.0,
 }
 ```
+### `/env/step` Response
 ```json
 {
+  "observation": { "...": "ChessObservation — see above" },
   "reward": 0.01,
   "terminated": false,
   "truncated": false,
 }
 ```
+### `/env/state` Response
 ```json
 {
+  "observation": { "...": "ChessObservation — see above" },
   "episode_id": "ep-42",
   "step_count": 1,
   "status": "active",
 }
 ```
+### `/env/env_info` Response
+```json
+{
+  "openenv_version": "0.1",
+  "environment_id": "chessecon-v1",
+  "name": "ChessEcon",
+  "description": "Multi-agent chess economy with live GRPO training",
+  "action_space": "text",
+  "observation_space": "text",
+  "reward_range": [-1.0, 1.0],
+  "max_steps": 40,
+  "agents": {
+    "white": "Qwen/Qwen2.5-0.5B-Instruct",
+    "black": "meta-llama/Llama-3.2-1B-Instruct"
+  },
+  "tags": ["chess", "multi-agent", "economy", "grpo", "openenv"]
+}
+```
 ---
 ## Reward Structure
+Per-step rewards are issued after every move. Terminal rewards are issued at game end.
+| Event | Reward | Type |
 |---|---|---|
+| Legal move played | `+0.01` | Per-step |
+| Move delivers check | `+0.05` | Per-step bonus |
+| Capture | `+0.10` | Per-step bonus |
+| Win (checkmate / material adj.) | `+1.00` | Terminal |
 | Loss | `-1.00` | Terminal |
 | Draw | `0.00` | Terminal |
+| Illegal move attempted | `-0.10` | Per-step penalty |
+> **Combined reward formula:**
+> `R = 0.4 × game_reward + 0.6 × economic_reward`
+>
+> `economic_reward = (prize_income − entry_fee) / entry_fee`
+### Material Adjudication
+Games reaching the move limit are adjudicated by material count (Q=9, R=5, B=3, N=3, P=1). The side with superior material wins — ensuring every game produces a decisive `+1` / `-1` signal for GRPO training.
 ---
 ## Economy Model
+Both agents pay into a shared prize pool each game, creating zero-sum economic incentives aligned with game outcome.
 | Parameter | Value |
 |---|---|
 | Starting wallet | 100 units |
 | Entry fee | 10 units per agent per game |
 | Prize pool | 18 units (90% of 2 × entry fee) |
+| Win payout | +18 units → net **+8** |
+| Draw payout | +9 units each → net **−1** |
+| Loss payout | +0 units → net **−10** |
+---
+## GRPO Training
+The White agent (`Qwen2.5-0.5B`) trains live using Group Relative Policy Optimisation:
+```
+Per-game update:
+  1. White generates moves: sample log π_θ(a | s) at each position
+  2. Reference log-probs log π_ref(a | s) computed from frozen snapshot
+  3. Terminal reward R ∈ {+1, 0, −1} from material adjudication
+  4. Advantage: A = (R − mean_R) / (std_R + ε)
+  5. Clipped surrogate: L = −min(ratio·A, clip(ratio, 0.8, 1.2)·A)
+  6. KL penalty: KL(π_θ ∥ π_ref), diff clamped to [−10, 10]
+  7. Total: L_total = L + β·KL,  β = 0.04
+  8. AdamW update, grad-norm clip max_norm=1.0
+```
+| Hyperparameter | Value |
+|---|---|
+| LoRA rank | 8 |
+| LoRA target modules | `q_proj`, `v_proj` |
+| Learning rate | `1e-5` |
+| KL coefficient β | `0.04` |
+| Update frequency | Every 1 game |
+| Checkpoint frequency | Every 100 steps |
+| Optimizer | AdamW |
+| Gradient clip | `max_norm=1.0` |
 ---
 ## Architecture
 ```
+┌──────────────────────────────────────────────────────────────┐
+│               External RL Trainers                           │
+│         TRL · verl · SkyRL · custom OpenEnv clients         │
+└──────────────────────┬───────────────────────────────────────┘
+                       │ HTTP  POST /env/reset  /env/step
+                       │       GET  /env/state  /env/env_info
+                       ▼
+┌──────────────────────────────────────────────────────────────┐
+│                  FastAPI WebSocket Server                    │
+│  ┌──────────────────────┐   ┌───────────────────────────┐   │
+│  │  OpenEnv 0.1 Router  │   │  WebSocket  /ws           │   │
+│  │  asyncio.Lock        │   │  broadcast() → dashboard  │   │
+│  └──────────┬───────────┘   └───────────────────────────┘   │
+│             │                                               │
+│  ┌──────────▼───────────┐   ┌───────────────────────────┐   │
+│  │   Chess Engine        │   │   Economy Engine          │   │
+│  │   python-chess        │   │   Wallets · Entry fees    │   │
+│  │   FEN · UCI · SAN     │   │   Prize pool · P&L        │   │
+│  └──────────┬───────────┘   └───────────────────────────┘   │
+│             │                                               │
+│  ┌──────────▼───────────┐   ┌───────────────────────────┐   │
+│  │  ♔ White Agent        │   │  ♚ Black Agent (fixed)    │   │
+│  │  Qwen2.5-0.5B         │   │  Llama-3.2-1B             │   │
+│  │  LoRA r=8             │   │  Frozen weights           │   │
+│  └──────────┬───────────┘   └───────────────────────────┘   │
+│             │                                               │
+│  ┌──────────▼───────────┐                                   │
+│  │  GRPO Trainer         │──▶  /checkpoints/step_N         │
+│  │  PPO-clip + KL        │                                   │
+│  │  AdamW  LR=1e-5       │                                   │
+│  └──────────────────────┘                                   │
+└──────────────────────┬───────────────────────────────────────┘
+                       │ WebSocket broadcast()
+                       ▼
+┌──────────────────────────────────────────────────────────────┐
+│              React Dashboard (nginx)                         │
+│  Live Board · Wallet History · GRPO Metrics · P&L Chart     │
+│  Architecture View · Live Event Feed                        │
+└──────────────────────────────────────────────────────────────┘
 ```
 ---
+## WebSocket Event Stream
+Connect to `wss://chessecon.adaboost.io/ws` for real-time events:
+```python
+import asyncio, json, websockets
+async def watch():
+    async with websockets.connect("wss://chessecon.adaboost.io/ws") as ws:
+        async for raw in ws:
+            msg = json.loads(raw)
+            match msg["type"]:
+                case "move":
+                    print(f"{msg['data']['player']} plays {msg['data']['move']}")
+                case "game_end":
+                    d = msg["data"]
+                    print(f"Game over: {d['result']} | reward={d['reward']}")
+                case "training_step":
+                    d = msg["data"]
+                    print(f"GRPO step {d['step']} | loss={d['loss']:.4f} kl={d['kl_div']:.4f}")
+                case "status":
+                    print(f"Snapshot: game #{msg['data']['game_id']}")
+asyncio.run(watch())
+```
+### Event Types
+| Type | Key Fields |
+|---|---|
+| `status` | `game_id`, `wallet_white`, `wallet_black`, `grpo_step` |
+| `game_start` | `game_id`, `wallet_white`, `wallet_black`, `prize_pool` |
+| `move` | `player`, `move`, `uci`, `fen`, `move_number` |
+| `game_end` | `result`, `reward`, `wallet_white`, `wallet_black`, `net_pnl_white` |
+| `training_step` | `step`, `loss`, `reward`, `kl_div`, `win_rate` |
+---
+## Running Locally
+```bash
+git clone https://huggingface.co/spaces/adaboost-ai/chessecon
+cd chessecon
+# Download models (first run only — requires HF token for Llama)
+python3 -c "
+from huggingface_hub import snapshot_download
+snapshot_download('Qwen/Qwen2.5-0.5B-Instruct',
+    local_dir='training/models/Qwen_Qwen2.5-0.5B-Instruct')
+snapshot_download('meta-llama/Llama-3.2-1B-Instruct',
+    local_dir='training/models/meta-llama_Llama-3.2-1B-Instruct')
+"
+# Start backend + dashboard
+docker-compose up -d
+# API:       http://localhost:8008
+# Dashboard: http://localhost:3006
+# Docs:      http://localhost:8008/docs
+```
+### Key Environment Variables
+| Variable | Default | Description |
+|---|---|---|
+| `WHITE_MODEL` | `/models/Qwen_...` | Path to White model |
+| `BLACK_MODEL` | `/models/meta-llama_...` | Path to Black model |
+| `DEVICE` | `cuda` | `cuda` or `cpu` |
+| `MAX_MOVES` | `15` | Moves before material adjudication |
+| `MOVE_DELAY` | `0.05` | Seconds between moves |
+| `ENTRY_FEE` | `10` | Units per agent per game |
+| `PRIZE_POOL_FRACTION` | `0.9` | Fraction of 2×entry returned as prize |
+| `GRPO_LR` | `1e-5` | AdamW learning rate |
+| `GRPO_KL_COEFF` | `0.04` | KL divergence penalty β |
+| `LORA_RANK` | `8` | LoRA adapter rank |
+---
+## Hardware Requirements
+| Config | Minimum |
+|---|---|
+| CPU-only | 8 GB RAM · `DEVICE=cpu` |
+| GPU (recommended) | 8 GB VRAM · CUDA 11.8+ |
+| Dev server | 4× NVIDIA RTX 3070 (lambda-quad) |
+---
+## Citation
+```bibtex
+@software{chessecon2026,
+  title   = {ChessEcon: Multi-Agent Chess Economy with Live GRPO Training},
+  author  = {AdaBoost AI},
+  year    = {2026},
+  url     = {https://huggingface.co/spaces/adaboost-ai/chessecon},
+  note    = {OpenEnv 0.1 · TextArena + Meta OpenEnv · Hackathon 2026}
+}
+```
+---
+## Links
+- **Live Dashboard:** [chessecon-ui.adaboost.io](https://chessecon-ui.adaboost.io)
+- **API + Swagger:** [chessecon.adaboost.io/docs](https://chessecon.adaboost.io/docs)
+- **AdaBoost AI:** [adaboost.io](https://adaboost.io)
+- **OpenEnv Spec:** [github.com/huggingface/openenv](https://github.com/huggingface/openenv)
+- **GRPO Paper:** [DeepSeek-R1 (arXiv 2501.12599)](https://arxiv.org/abs/2501.12599)
+---
+<div align="center">
+Built by <a href="https://adaboost.io">AdaBoost AI</a> · TextArena + Meta OpenEnv + GRPO · Hackathon 2026
+</div>