chessecon / README.md
suvasis's picture
added huggingfacehub README
87a189e
metadata
title: ChessEcon
emoji: ♟️
colorFrom: indigo
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - chess
  - multi-agent
  - grpo
  - rl-environment
  - economy
  - two-player
  - game
  - textarena
  - llm-training
license: apache-2.0

♟️ ChessEcon

Multi-Agent Chess Economy · OpenEnv 0.1 · GRPO Live Training

OpenEnv TextArena License Hackathon

Live API: https://chessecon.adaboost.io Dashboard: https://chessecon-ui.adaboost.io Swagger: https://chessecon.adaboost.io/docs env_info: https://chessecon.adaboost.io/env/env_info


Overview

ChessEcon is a two-player LLM chess environment where agents compete for economic stakes, fully compliant with the OpenEnv 0.1 specification.

Two language models play chess head-to-head. Each game costs an entry fee. The winner earns a prize pool. The White agent trains live using GRPO (Group Relative Policy Optimisation) — every game updates the policy weights in real-time. A Bloomberg-style dashboard streams all activity via WebSocket.

Agent Model Role
♔ White Qwen/Qwen2.5-0.5B-Instruct Trainable — GRPO updates every game
♚ Black meta-llama/Llama-3.2-1B-Instruct Fixed opponent — frozen weights

OpenEnv 0.1 API

All endpoints are compatible with TRL, verl, SkyRL, and any OpenEnv 0.1 trainer.

Endpoint Method Description
/env/reset POST Start new episode · deduct entry fees · return initial observation
/env/step POST Apply one move (UCI or SAN) · return reward + next observation
/env/state GET Read current board state — non-destructive
/env/env_info GET Environment metadata for HF Hub discoverability
/ws WebSocket Real-time event stream (moves, rewards, GRPO metrics)
/health GET Health check + model load status
/docs GET Interactive Swagger UI

Quick Start

import httpx

BASE = "https://chessecon.adaboost.io"

# 1. Start a new episode
reset = httpx.post(f"{BASE}/env/reset").json()
print(reset["observation"]["fen"])              # starting position
print(reset["observation"]["legal_moves_uci"])  # all legal moves in UCI

# 2. Play a move (UCI or SAN accepted)
step = httpx.post(f"{BASE}/env/step", json={"action": "e2e4"}).json()
print(step["observation"]["fen"])   # updated board
print(step["reward"])               # per-step reward signal
print(step["terminated"])           # True when game ends
print(step["truncated"])            # True if move limit reached

# 3. Inspect current state (read-only)
state = httpx.get(f"{BASE}/env/state").json()
print(state["step_count"])          # moves played so far
print(state["status"])              # "active" | "terminated" | "idle"

# 4. Environment metadata
info = httpx.get(f"{BASE}/env/env_info").json()
print(info["openenv_version"])      # "0.1"
print(info["agents"])               # model IDs for white/black

Drop-in Client (TRL / verl / SkyRL)

import httpx

class ChessEconEnv:
    """
    OpenEnv 0.1 client for ChessEcon.
    Compatible with TRL, verl, SkyRL, and any gym-style RL trainer.
    """

    def __init__(self, base_url: str = "https://chessecon.adaboost.io"):
        self.base = base_url.rstrip("/")
        self.http = httpx.Client(timeout=30)

    def reset(self, seed: int | None = None) -> tuple[dict, dict]:
        payload = {"seed": seed} if seed is not None else {}
        r = self.http.post(f"{self.base}/env/reset", json=payload)
        r.raise_for_status()
        d = r.json()
        return d["observation"], d["info"]

    def step(self, action: str) -> tuple[dict, float, bool, bool, dict]:
        """
        Args:
            action: Move in UCI (e.g. "e2e4") or SAN (e.g. "e4")
        Returns:
            (observation, reward, terminated, truncated, info)
        """
        r = self.http.post(f"{self.base}/env/step", json={"action": action})
        r.raise_for_status()
        d = r.json()
        return (d["observation"], d["reward"], d["terminated"], d["truncated"], d["info"])

    def state(self) -> dict:
        return self.http.get(f"{self.base}/env/state").json()

    def env_info(self) -> dict:
        return self.http.get(f"{self.base}/env/env_info").json()

    def close(self):
        self.http.close()


# Example: random rollout
import random

env = ChessEconEnv()
obs, info = env.reset()
total_reward = 0.0

while True:
    action = random.choice(obs["legal_moves_uci"])  # replace with your policy
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    if terminated or truncated:
        print(f"Game over | result={info.get('result')} | total_reward={total_reward:.3f}")
        break

env.close()

Observation Schema

Every response from /env/reset, /env/step, and /env/state contains a ChessObservation:

{
  "observation": {
    "fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq - 0 1",
    "turn": "black",
    "move_number": 1,
    "last_move_uci": "e2e4",
    "last_move_san": "e4",
    "legal_moves_uci": ["e7e5", "d7d5", "g8f6", "..."],
    "is_check": false,
    "wallet_white": 90.0,
    "wallet_black": 90.0,
    "white_model": "Qwen/Qwen2.5-0.5B-Instruct",
    "black_model": "meta-llama/Llama-3.2-1B-Instruct",
    "info": {}
  }
}

/env/step Response

{
  "observation": { "...": "ChessObservation — see above" },
  "reward": 0.01,
  "terminated": false,
  "truncated": false,
  "info": { "san": "e4", "uci": "e2e4", "move_number": 1 }
}

/env/state Response

{
  "observation": { "...": "ChessObservation — see above" },
  "episode_id": "ep-42",
  "step_count": 1,
  "status": "active",
  "info": {}
}

/env/env_info Response

{
  "openenv_version": "0.1",
  "environment_id": "chessecon-v1",
  "name": "ChessEcon",
  "description": "Multi-agent chess economy with live GRPO training",
  "action_space": "text",
  "observation_space": "text",
  "reward_range": [-1.0, 1.0],
  "max_steps": 40,
  "agents": {
    "white": "Qwen/Qwen2.5-0.5B-Instruct",
    "black": "meta-llama/Llama-3.2-1B-Instruct"
  },
  "tags": ["chess", "multi-agent", "economy", "grpo", "openenv"]
}

Reward Structure

Per-step rewards are issued after every move. Terminal rewards are issued at game end.

Event Reward Type
Legal move played +0.01 Per-step
Move delivers check +0.05 Per-step bonus
Capture +0.10 Per-step bonus
Win (checkmate / material adj.) +1.00 Terminal
Loss -1.00 Terminal
Draw 0.00 Terminal
Illegal move attempted -0.10 Per-step penalty

Combined reward formula: R = 0.4 × game_reward + 0.6 × economic_reward

economic_reward = (prize_income − entry_fee) / entry_fee

Material Adjudication

Games reaching the move limit are adjudicated by material count (Q=9, R=5, B=3, N=3, P=1). The side with superior material wins — ensuring every game produces a decisive +1 / -1 signal for GRPO training.


Economy Model

Both agents pay into a shared prize pool each game, creating zero-sum economic incentives aligned with game outcome.

Parameter Value
Starting wallet 100 units
Entry fee 10 units per agent per game
Prize pool 18 units (90% of 2 × entry fee)
Win payout +18 units → net +8
Draw payout +9 units each → net −1
Loss payout +0 units → net −10

GRPO Training

The White agent (Qwen2.5-0.5B) trains live using Group Relative Policy Optimisation:

Per-game update:
  1. White generates moves: sample log π_θ(a | s) at each position
  2. Reference log-probs log π_ref(a | s) computed from frozen snapshot
  3. Terminal reward R ∈ {+1, 0, −1} from material adjudication
  4. Advantage: A = (R − mean_R) / (std_R + ε)
  5. Clipped surrogate: L = −min(ratio·A, clip(ratio, 0.8, 1.2)·A)
  6. KL penalty: KL(π_θ ∥ π_ref), diff clamped to [−10, 10]
  7. Total: L_total = L + β·KL,  β = 0.04
  8. AdamW update, grad-norm clip max_norm=1.0
Hyperparameter Value
LoRA rank 8
LoRA target modules q_proj, v_proj
Learning rate 1e-5
KL coefficient β 0.04
Update frequency Every 1 game
Checkpoint frequency Every 100 steps
Optimizer AdamW
Gradient clip max_norm=1.0

Architecture

┌──────────────────────────────────────────────────────────────┐
│               External RL Trainers                           │
│         TRL · verl · SkyRL · custom OpenEnv clients         │
└──────────────────────┬───────────────────────────────────────┘
                       │ HTTP  POST /env/reset  /env/step
                       │       GET  /env/state  /env/env_info
                       ▼
┌──────────────────────────────────────────────────────────────┐
│                  FastAPI WebSocket Server                    │
│  ┌──────────────────────┐   ┌───────────────────────────┐   │
│  │  OpenEnv 0.1 Router  │   │  WebSocket  /ws           │   │
│  │  asyncio.Lock        │   │  broadcast() → dashboard  │   │
│  └──────────┬───────────┘   └───────────────────────────┘   │
│             │                                               │
│  ┌──────────▼───────────┐   ┌───────────────────────────┐   │
│  │   Chess Engine        │   │   Economy Engine          │   │
│  │   python-chess        │   │   Wallets · Entry fees    │   │
│  │   FEN · UCI · SAN     │   │   Prize pool · P&L        │   │
│  └──────────┬───────────┘   └───────────────────────────┘   │
│             │                                               │
│  ┌──────────▼───────────┐   ┌───────────────────────────┐   │
│  │  ♔ White Agent        │   │  ♚ Black Agent (fixed)    │   │
│  │  Qwen2.5-0.5B         │   │  Llama-3.2-1B             │   │
│  │  LoRA r=8             │   │  Frozen weights           │   │
│  └──────────┬───────────┘   └───────────────────────────┘   │
│             │                                               │
│  ┌──────────▼───────────┐                                   │
│  │  GRPO Trainer         │──▶  /checkpoints/step_N         │
│  │  PPO-clip + KL        │                                   │
│  │  AdamW  LR=1e-5       │                                   │
│  └──────────────────────┘                                   │
└──────────────────────┬───────────────────────────────────────┘
                       │ WebSocket broadcast()
                       ▼
┌──────────────────────────────────────────────────────────────┐
│              React Dashboard (nginx)                         │
│  Live Board · Wallet History · GRPO Metrics · P&L Chart     │
│  Architecture View · Live Event Feed                        │
└──────────────────────────────────────────────────────────────┘

WebSocket Event Stream

Connect to wss://chessecon.adaboost.io/ws for real-time events:

import asyncio, json, websockets

async def watch():
    async with websockets.connect("wss://chessecon.adaboost.io/ws") as ws:
        async for raw in ws:
            msg = json.loads(raw)
            match msg["type"]:
                case "move":
                    print(f"{msg['data']['player']} plays {msg['data']['move']}")
                case "game_end":
                    d = msg["data"]
                    print(f"Game over: {d['result']} | reward={d['reward']}")
                case "training_step":
                    d = msg["data"]
                    print(f"GRPO step {d['step']} | loss={d['loss']:.4f} kl={d['kl_div']:.4f}")
                case "status":
                    print(f"Snapshot: game #{msg['data']['game_id']}")

asyncio.run(watch())

Event Types

Type Key Fields
status game_id, wallet_white, wallet_black, grpo_step
game_start game_id, wallet_white, wallet_black, prize_pool
move player, move, uci, fen, move_number
game_end result, reward, wallet_white, wallet_black, net_pnl_white
training_step step, loss, reward, kl_div, win_rate

Models

ChessEcon uses two publicly available HuggingFace models:

Agent Model Card Size Local Path
♔ White (trainable) Qwen/Qwen2.5-0.5B-Instruct 943 MB training/models/Qwen_Qwen2.5-0.5B-Instruct/
♚ Black (fixed) meta-llama/Llama-3.2-1B-Instruct 2.4 GB training/models/meta-llama_Llama-3.2-1B-Instruct/

Note: Llama-3.2-1B-Instruct requires a HuggingFace account with Meta's license accepted at meta-llama/Llama-3.2-1B-Instruct. Generate a token at huggingface.co/settings/tokens.

Download Commands

Option A — Python (recommended):

from huggingface_hub import snapshot_download

# White agent — Qwen2.5-0.5B-Instruct (no token required)
snapshot_download(
    repo_id="Qwen/Qwen2.5-0.5B-Instruct",
    local_dir="training/models/Qwen_Qwen2.5-0.5B-Instruct",
    local_dir_use_symlinks=False,
)

# Black agent — Llama-3.2-1B-Instruct (requires HF token + Meta license)
snapshot_download(
    repo_id="meta-llama/Llama-3.2-1B-Instruct",
    local_dir="training/models/meta-llama_Llama-3.2-1B-Instruct",
    local_dir_use_symlinks=False,
    token="hf_YOUR_TOKEN_HERE",
)

Option B — huggingface-cli:

# Install CLI if needed
pip install huggingface_hub

# White agent (no token)
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct \
  --local-dir training/models/Qwen_Qwen2.5-0.5B-Instruct

# Black agent (token required)
huggingface-cli login   # paste your HF token when prompted
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \
  --local-dir training/models/meta-llama_Llama-3.2-1B-Instruct

Option C — git lfs:

git lfs install

# White agent
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct \
  training/models/Qwen_Qwen2.5-0.5B-Instruct

# Black agent (must be logged in: huggingface-cli login)
git clone https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct \
  training/models/meta-llama_Llama-3.2-1B-Instruct

Verify Downloads

# Expected files after download:
ls training/models/Qwen_Qwen2.5-0.5B-Instruct/
# config.json  generation_config.json  model.safetensors  tokenizer*.json  ...

ls training/models/meta-llama_Llama-3.2-1B-Instruct/
# config.json  generation_config.json  model.safetensors  tokenizer*.json  ...

# Check sizes
du -sh training/models/Qwen_Qwen2.5-0.5B-Instruct/model.safetensors
# → 943M

du -sh training/models/meta-llama_Llama-3.2-1B-Instruct/model.safetensors
# → 2.4G

Running Locally

git clone https://huggingface.co/spaces/adaboost-ai/chessecon
cd chessecon

# 1. Download models (see Models section above)

# 2. Start backend + dashboard
docker-compose up -d

# API:       http://localhost:8008
# Dashboard: http://localhost:3006
# Docs:      http://localhost:8008/docs

Key Environment Variables

Variable Default Description
WHITE_MODEL /models/Qwen_... Path to White model
BLACK_MODEL /models/meta-llama_... Path to Black model
DEVICE cuda cuda or cpu
MAX_MOVES 15 Moves before material adjudication
MOVE_DELAY 0.05 Seconds between moves
ENTRY_FEE 10 Units per agent per game
PRIZE_POOL_FRACTION 0.9 Fraction of 2×entry returned as prize
GRPO_LR 1e-5 AdamW learning rate
GRPO_KL_COEFF 0.04 KL divergence penalty β
LORA_RANK 8 LoRA adapter rank

Hardware Requirements

Config Minimum
CPU-only 8 GB RAM · DEVICE=cpu
GPU (recommended) 8 GB VRAM · CUDA 11.8+
Dev server 4× NVIDIA RTX 3070 (lambda-quad)

Citation

@software{chessecon2026,
  title   = {ChessEcon: Multi-Agent Chess Economy with Live GRPO Training},
  author  = {AdaBoost AI},
  year    = {2026},
  url     = {https://huggingface.co/spaces/adaboost-ai/chessecon},
  note    = {OpenEnv 0.1 · TextArena + Meta OpenEnv · Hackathon 2026}
}

Links


Built by AdaBoost AI · TextArena + Meta OpenEnv + GRPO · Hackathon 2026