--- title: Carrom RL Env emoji: ๐ŸŽฏ colorFrom: yellow colorTo: red sdk: docker pinned: false app_port: 8000 tags: - openenv --- # Carrom RL Env ## About **Carrom RL Env** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible, physics-based reinforcement-learning environment for the South Asian board game Carrom. Pieces slide on a Pymunk-simulated board with Coulomb (boric-acid-style) kinetic friction, and every shot is scored under the full International Carrom Federation (ICF) rule set โ€” due rule, queen cover, foul handling, colour-based turn continuation. The environment ships with LLM-friendly text actions (`"aim at queen_0 with strong force from centre"`), rich text-summary observations that include live rule reminders, and a **green-agent evaluator** ([AgentBeats](https://rdi.berkeley.edu/agentx-agentbeats)-style) that owns the task suite and scoring so any policy โ€” random, heuristic, LLM-behind-an-API, or a freshly GRPO-trained model โ€” can be benchmarked head-to-head on a consistent ICF-compliance score. Deploys as a single Docker container exposing both a FastAPI + WebSocket OpenEnv API at the root and a live Gradio board at the same URL for human or LLM auto-play. ## Features - **Coulomb board friction** โ€” per-body `velocity_func` applies constant deceleration (not viscous drag), matching pieces on a boric-acid-powdered carrom surface - **ICF-compliant rules** โ€” due rule, queen cover, foul handling, color-based turn continuation - **LLM-friendly** โ€” text actions (`"aim at queen_0 with strong force"`) and rich board-state observations with rule reminders - **Multi-agent** โ€” single-agent API with automatic scripted opponent turns - **Green Agent (evaluator)** โ€” task suite + ICF-aware scoring for purple-agent benchmarking, ร  la [AgentBeats](https://rdi.berkeley.edu/agentx-agentbeats) - **Deterministic** โ€” seeded resets for reproducible experiments - **OpenEnv standard** โ€” `reset()` / `step()` / `state()` API with WebSocket support ## Installation ```bash pip install -e . ``` Optional rendering: ```bash pip install -e ".[render]" ``` ## Quick Start ### As a client (connecting to a running Space) ```python import asyncio from client import CarromEnv from carrom_env.models import Action async def main(): async with CarromEnv(base_url="https://your-space.hf.space") as env: result = await env.reset() print(result.observation.text_summary) result = await env.step(Action(placement_x=0.0, angle=0.1, force=0.6)) print(f"Reward: {result.reward}, Done: {result.done}") asyncio.run(main()) ``` Synchronous usage: ```python from client import CarromEnv from carrom_env.models import Action with CarromEnv(base_url="http://localhost:8000").sync() as env: result = env.reset() result = env.step(Action(placement_x=0.0, angle=0.0, force=0.6)) ``` ### Local development ```python from carrom_env.env import CarromEnv from carrom_env.models import Action env = CarromEnv(seed=42) obs = env.reset() action = Action(placement_x=0.0, angle=0.0, force=0.6) obs, reward, terminated, truncated, info = env.step(action) ``` ### Text actions (for LLM agents) ```python action = Action(action_type="text", text="aim at queen_0 with strong force from center") obs, reward, terminated, truncated, info = env.step(action) ``` ## Game Rules (ICF-Compliant) This environment implements the key rules from the **International Carrom Federation (ICF)**. ### Board & Pieces - **9 black coins**, **9 white coins**, **1 queen** (red) โ€” 19 pieces total - **Agent plays white**; opponent plays black - Four corner pockets ### Shooting - On each turn the player places their striker anywhere on their baseline and shoots - Striker placement is automatically nudged away from any coin sitting on the baseline ### Scoring & Turn Continuation - Pocket **your own colour** โ†’ +1 point, take another turn - Pocket the **queen** โ†’ +3 points; you must then "cover" it (see below) - Miss (no own coin pocketed) โ†’ turn passes to opponent ### Due Rule - If you pocket your **opponent's colour**, that coin is returned to the board centre - You score **nothing** for it and your turn **ends** โ€” even if you also pocketed own coins on the same shot, turn continuation only applies to own-colour pockets ### Queen Cover Rule - After pocketing the queen you must pocket **one of your own coins** on the same shot or on your next turn to "cover" it - If you fail to cover, the queen is returned to the board centre and your queen points are reversed ### Foul - Pocketing the **striker** is a foul - One of your previously pocketed coins is returned to the board centre - Your turn ends and passes to the opponent ### Win Condition All coins cleared from the board โ†’ game ends; the player with the higher score wins. ### ICF Compliance Table | Rule | Status | Notes | |------|--------|-------| | 9 black + 9 white + 1 queen | โœ… | Full piece complement | | Agent = white, Opponent = black | โœ… | Enforced throughout | | Score 1 pt per own coin | โœ… | | | Queen = 3 pts | โœ… | Simplified from ICF face-value (1โ€“9) | | Due rule โ€” opponent's coin returns to centre, no score, turn ends | โœ… | | | Queen cover rule โ€” cover on same/next shot or queen returns | โœ… | | | Foul โ€” striker pocketed returns own coin, ends turn | โœ… | | | Turn continuation on own-colour pocket only | โœ… | Due coins do not extend turn | | Baseline shooting with obstruction check | โœ… | Striker nudged clear of coins | | Coulomb board friction (boric-acid surface, ฮผ_k โ‰ˆ 0.04) | โœ… | `BOARD_DECEL = 2.5 units/sยฒ` via `velocity_func` | | Elastic rubber cushion walls | โœ… | `ELASTICITY = 0.92` | | Pocket capture (no corner dead zones) | โœ… | `pocket_capture_radius = 0.09` decoupled from wall gap | | Numbered coin scoring (ICF 1โ€“9 per colour) | โŒ | Simplified to 1 pt per coin | | Touch-coin / out-of-turn penalties | โŒ | Not applicable for AI agents | ## Physics Design ### Coulomb Board Friction Real carrom boards are dusted with boric acid powder giving a kinetic friction coefficient of roughly ฮผ_k โ‰ˆ 0.02โ€“0.05. Unlike viscous drag (speed-proportional), sliding friction produces **constant deceleration** regardless of a piece's current speed. This environment implements Coulomb friction via Pymunk's `body.velocity_func` callback on every piece and the striker: ``` a_friction = BOARD_DECEL # 2.5 units/sยฒ โ€” equivalent to ฮผ_k โ‰ˆ 0.04 on a normalised board ``` With `BOARD_DECEL = 2.5`: - **Full-force shot** (vโ‚€ โ‰ˆ 5 units/s): pieces settle in ~2 seconds after bouncing - **Medium shot** (vโ‚€ โ‰ˆ 2.5 units/s): pieces settle in ~1 second - The simulation ends early once all pieces drop below `SETTLE_VELOCITY = 0.02 units/s` ### Contact Physics Shape-to-shape contact friction (`FRICTION = 0.15`) handles the interaction between colliding pieces and between pieces and the rubber-cushioned walls. Collision restitution is `ELASTICITY = 0.92`, reflecting the near-elastic bounce of polished wooden pieces off a rubber cushion. ### Pocket Detection Geometry Pocket capture uses **two separate radii** to handle a subtle geometry problem: | Field | Value | Purpose | |-------|-------|---------| | `pocket_radius` | `0.06` | Visual pocket size; also the wall gap at each corner | | `pocket_capture_radius` | `0.09` | Radius within which a piece is considered pocketed | **Why they differ:** walls have a `wall_thickness = 0.02` and end at `pocket_radius` from each corner. Pymunk segments have rounded endcaps, so a piece (radius `0.03`) rolling along a wall is constrained to stay at distance `โ‰ฅ 0.05` from the wall endcap. A piece can therefore come to rest at e.g. `(-0.44, -0.45)` โ€” inside the pocket gap but at distance `โ‰ˆ 0.078` from the corner, which was **outside the old `0.06` detection radius** (a "dead zone"). `pocket_capture_radius = pocket_radius + coin_radius = 0.09` fires as soon as the coin's edge reaches the pocket rim, eliminating the dead zone. ## Action Space | Field | Type | Description | |-------|------|-------------| | `action_type` | `str` | `"numeric"` (default) or `"text"` for natural-language actions | | `placement_x` | `float` | Striker placement along baseline `[-0.4, 0.4]`, 0 = center | | `angle` | `float` | Shot angle in radians, 0 = straight ahead toward +y | | `force` | `float` | Normalized shot force in `[0, 1]` | | `text` | `str` | Natural-language shot description (when `action_type="text"`) | ## Observation | Field | Type | Description | |-------|------|-------------| | `positions` | `List[List[float]]` | `[N, 2]` positions for striker + coins | | `velocities` | `List[List[float]]` | `[N, 2]` velocities | | `pocketed` | `List[bool]` | `[N]` pocketed flags | | `agent_score` | `int` | Agent's current score | | `opponent_score` | `int` | Opponent's current score | | `current_player` | `str` | `"agent"` or `"opponent"` | | `remaining_coins` | `int` | Coins still on the board | | `coins` | `List[CoinInfo]` | Per-coin details with nearest pocket info | | `text_summary` | `str` | Rich text board state for LLM prompting (includes rule reminders) | ## Reward Design | Event | Reward | Description | |-------|--------|-------------| | Each agent turn | โˆ’0.01 | Small negative to encourage efficiency | | Own coin potted | +1.0 | Per own-colour coin pocketed | | Queen potted | +3.0 | Queen is worth 3ร— | | Due coin (opponent's colour potted) | โˆ’0.3 | Coin returned to centre; teaches avoidance | | Foul (striker pocketed) | โˆ’1.5 | Score โˆ’1 plus โˆ’0.5 extra penalty | | Win (cleared board, agent leads) | +5.0 | Bonus for winning | | Loss (cleared board, opponent leads) | โˆ’2.0 | Penalty for losing | | Opponent scores | โˆ’0.5ร— | Partial penalty when opponent pots own coins | ## `info` Dict Keys | Key | Type | Description | |-----|------|-------------| | `sim_steps` | `float` | Physics steps taken this turn | | `energy` | `float` | Cumulative kinetic energy this turn | | `coin_potted` | `float` | Own coins pocketed this turn | | `due_coins` | `float` | Opponent's coins returned to centre (due rule) | | `foul` | `float` | 1.0 if striker was pocketed | | `queen_cover_pending` | `bool` | True if queen cover is still required | | `placement_x_actual` | `float` | Actual striker x after obstruction nudge | ## Green Agent (Evaluator) In the [AgentBeats / AgentX](https://rdi.berkeley.edu/agentx-agentbeats) taxonomy: - ๐ŸŸข **Green Agent** โ€” evaluator: defines tasks, environment, and scoring - ๐ŸŸฃ **Purple Agent** โ€” competitor: the AI being tested (any `Callable[[Observation], Action]`) - ๐Ÿ”ด **Red Agent** โ€” adversarial tester (not used here) `GreenCarromAgent` is the green agent for this benchmark. It owns: 1. **A task suite** โ€” curated seeded boards across `easy` / `standard` / `hard` tiers 2. **The environment** โ€” wraps `CarromEnv` with full ICF rules 3. **Scoring** โ€” ICF-aware metrics (reward, win rate, ICF compliance from dues/fouls) plus compute efficiency ```python from carrom_env.green_agent import GreenCarromAgent, Task def my_purple_agent(obs): return Action(placement_x=0.0, angle=0.1, force=0.6) # Default suite: 3 easy + 3 standard + 3 hard tasks evaluator = GreenCarromAgent() report = evaluator.evaluate(my_purple_agent, verbose=True) print(report.summary()) # {'n_tasks': 9, 'avg_reward': ..., 'win_rate': ..., 'icf_compliance': ..., # 'efficiency_score': ..., ...} # Or define a custom suite tasks = [Task(task_id="focus", seed=0, max_turns=30, tier="standard")] report = GreenCarromAgent(tasks=tasks).evaluate(my_purple_agent) report.by_tier() # per-tier breakdown ``` ### Scoring metrics | Metric | Type | Description | |--------|------|-------------| | `avg_reward` | game | Mean episode reward across the suite | | `win_rate` | game | Fraction of tasks where agent beat the opponent | | `avg_coins_potted` | game | Mean own-coins pocketed per task | | `avg_dues` | ICF | Mean opponent-coin pockets per task (lower = better) | | `avg_fouls` | ICF | Mean strikers pocketed per task (lower = better) | | `icf_compliance` | ICF | `1 โˆ’ (dues + fouls) / turns` โ€” fraction of shots obeying ICF rules | | `total_sim_steps` | compute | Total physics steps across all tasks | | `efficiency_score` | compute | Coins potted per 1000 sim steps | ### What `max_turns` counts `Task.max_turns` (and `MAX_STEPS` in the inference script) counts **combined agent + opponent turns** โ€” the env's internal turn counter increments once for every played shot, on either side. A setting of `max_turns=200` therefore caps the episode at ~100 agent shots + ~100 heuristic-opponent shots. Set it to `400` if you want roughly 200 agent shots, or pass a custom `Task` with whatever cap you need. ## Benchmark Results Full inference logs live in [`inference_runs/`](inference_runs/). ### MiniMaxAI/MiniMax-M2.5-fast (Nebius) 1 task ร— 200 turns (seed `0`), via `https://api.tokenfactory.us-central1.nebius.com/v1`. Full log: [`inference_runs/minimax-m2.5-fast_200turns_inference.log`](inference_runs/minimax-m2.5-fast_200turns_inference.log) | Purple Agent | Reward | Win% | Coins | Dues | ICF% | Efficiency | |---|---:|---:|---:|---:|---:|---:| | Random | โˆ’4.21 | 0% | 4.0 | 2.00 | 98% | 0.368 | | Heuristic (ICF-aware) | โˆ’6.31 | 0% | 4.0 | 0.00 | 100% | 0.337 | | **LLM ยท MiniMax-M2.5-fast** | **+4.07** | **100%** | **8.0** | 1.00 | 97% | **0.759** | MiniMax-M2.5-fast wins the board at 8 coins potted with 1 due and 5 fouls, beating both baselines on reward and efficiency. The heuristic tanks to โˆ’6.31 because it's aggressive about shooting at white coins but hands the opponent easy board position on misses. ### nvidia/Nemotron-3-Super-120b-a12b (Nebius) 1 task ร— 200 turns (seed `0`), via `https://api.tokenfactory.us-central1.nebius.com/v1`. Full log: [`inference_runs/nemotron-3-super-120b_200turns_inference.log`](inference_runs/nemotron-3-super-120b_200turns_inference.log) Wall-clock: 2001.6 s (~33 min) for the full LLM episode. | Purple Agent | Reward | Win% | Coins | Dues | ICF% | Efficiency | |---|---:|---:|---:|---:|---:|---:| | Random | +4.17 | 100% | 8.0 | 1.00 | 100% | 0.712 | | Heuristic (ICF-aware) | โˆ’6.31 | 0% | 4.0 | 0.00 | 100% | 0.337 | | **LLM ยท Nemotron-3-Super-120b** | **+11.94** | **100%** | **13.0** | 2.00 | 97% | **1.179** | Nemotron โ€” NVIDIA's 120B hybrid-MoE reasoning model โ€” outscores every baseline on every game metric and has the highest compute efficiency I've seen so far at **1.18 coins potted per 1 000 sim steps**. One parse-failure fallback over 106 agent shots, 2 dues, 4 fouls. The extra reasoning horsepower shows up as more aggressive scoring play than MiniMax (+11.94 vs +4.07 reward on the same seed) at a slight cost in ICF compliance (2 dues vs 1). ### Frontier-model comparison (seed `0`) | Model | Reward | Coins | Dues | Fouls | ICF % | Efficiency | |---|---:|---:|---:|---:|---:|---:| | MiniMax-M2.5-fast | +4.07 | 8 | 1 | 5 | 97% | 0.759 | | **Nemotron-3-Super-120b** | **+11.94** | **13** | 2 | **4** | 97% | **1.179** | ## Setup & Usage ### Prerequisites - Python 3.10+ - Docker (for containerized deployment) - `pip install openenv-core pymunk` ### Local Development ```bash pip install -e ".[dev]" PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload ``` ### Docker ```bash docker build -t carrom-env:latest . docker run -p 8000:8000 carrom-env:latest ``` ### Baseline Inference ```bash export API_BASE_URL="https://api-inference.huggingface.co/v1" export MODEL_NAME="Qwen/Qwen3-4B" export HF_TOKEN="hf_..." python inference.py ``` ## Project Structure ``` carrom_rl_env/ โ”œโ”€โ”€ __init__.py # Module exports โ”œโ”€โ”€ carrom_env/ โ”‚ โ”œโ”€โ”€ __init__.py # Package exports โ”‚ โ”œโ”€โ”€ env.py # CarromEnv (physics + ICF game logic) โ”‚ โ”œโ”€โ”€ models.py # Action, Observation, State models โ”‚ โ”œโ”€โ”€ constants.py # Board config + physics constants (BOARD_DECEL, FRICTION, โ€ฆ) โ”‚ โ””โ”€โ”€ green_agent.py # Green Agent (evaluator: task suite + ICF-aware scoring) โ”œโ”€โ”€ client.py # CarromEnv (EnvClient) โ”œโ”€โ”€ inference.py # Baseline inference script โ”œโ”€โ”€ server/ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”œโ”€โ”€ carrom_environment.py # Server-side Environment wrapper โ”‚ โ””โ”€โ”€ app.py # FastAPI application โ”œโ”€โ”€ examples/ โ”‚ โ”œโ”€โ”€ train_stub.py # Quick demo โ”‚ โ”œโ”€โ”€ grpo_utils.py # GRPO training utilities โ”‚ โ””โ”€โ”€ grpo_carrom_tutorial.ipynb # Training notebook โ”œโ”€โ”€ tests/ โ”‚ โ””โ”€โ”€ test_env_basic.py # Test suite โ”œโ”€โ”€ openenv.yaml # OpenEnv manifest โ”œโ”€โ”€ pyproject.toml # Dependencies โ”œโ”€โ”€ Dockerfile # Container image โ””โ”€โ”€ README.md # This file ```