carrom_rl_env / README.md
bpHigh's picture
Nemotron-3-Super-120b results + HF blog draft
9296d7f
metadata
title: Carrom RL Env
emoji: 🎯
colorFrom: yellow
colorTo: red
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv

Carrom RL Env

About

Carrom RL Env is an OpenEnv-compatible, physics-based reinforcement-learning environment for the South Asian board game Carrom. Pieces slide on a Pymunk-simulated board with Coulomb (boric-acid-style) kinetic friction, and every shot is scored under the full International Carrom Federation (ICF) rule set β€” due rule, queen cover, foul handling, colour-based turn continuation.

The environment ships with LLM-friendly text actions ("aim at queen_0 with strong force from centre"), rich text-summary observations that include live rule reminders, and a green-agent evaluator (AgentBeats-style) that owns the task suite and scoring so any policy β€” random, heuristic, LLM-behind-an-API, or a freshly GRPO-trained model β€” can be benchmarked head-to-head on a consistent ICF-compliance score. Deploys as a single Docker container exposing both a FastAPI + WebSocket OpenEnv API at the root and a live Gradio board at the same URL for human or LLM auto-play.

Features

  • Coulomb board friction β€” per-body velocity_func applies constant deceleration (not viscous drag), matching pieces on a boric-acid-powdered carrom surface
  • ICF-compliant rules β€” due rule, queen cover, foul handling, color-based turn continuation
  • LLM-friendly β€” text actions ("aim at queen_0 with strong force") and rich board-state observations with rule reminders
  • Multi-agent β€” single-agent API with automatic scripted opponent turns
  • Green Agent (evaluator) β€” task suite + ICF-aware scoring for purple-agent benchmarking, Γ  la AgentBeats
  • Deterministic β€” seeded resets for reproducible experiments
  • OpenEnv standard β€” reset() / step() / state() API with WebSocket support

Installation

pip install -e .

Optional rendering:

pip install -e ".[render]"

Quick Start

As a client (connecting to a running Space)

import asyncio
from client import CarromEnv
from carrom_env.models import Action

async def main():
    async with CarromEnv(base_url="https://your-space.hf.space") as env:
        result = await env.reset()
        print(result.observation.text_summary)

        result = await env.step(Action(placement_x=0.0, angle=0.1, force=0.6))
        print(f"Reward: {result.reward}, Done: {result.done}")

asyncio.run(main())

Synchronous usage:

from client import CarromEnv
from carrom_env.models import Action

with CarromEnv(base_url="http://localhost:8000").sync() as env:
    result = env.reset()
    result = env.step(Action(placement_x=0.0, angle=0.0, force=0.6))

Local development

from carrom_env.env import CarromEnv
from carrom_env.models import Action

env = CarromEnv(seed=42)
obs = env.reset()

action = Action(placement_x=0.0, angle=0.0, force=0.6)
obs, reward, terminated, truncated, info = env.step(action)

Text actions (for LLM agents)

action = Action(action_type="text", text="aim at queen_0 with strong force from center")
obs, reward, terminated, truncated, info = env.step(action)

Game Rules (ICF-Compliant)

This environment implements the key rules from the International Carrom Federation (ICF).

Board & Pieces

  • 9 black coins, 9 white coins, 1 queen (red) β€” 19 pieces total
  • Agent plays white; opponent plays black
  • Four corner pockets

Shooting

  • On each turn the player places their striker anywhere on their baseline and shoots
  • Striker placement is automatically nudged away from any coin sitting on the baseline

Scoring & Turn Continuation

  • Pocket your own colour β†’ +1 point, take another turn
  • Pocket the queen β†’ +3 points; you must then "cover" it (see below)
  • Miss (no own coin pocketed) β†’ turn passes to opponent

Due Rule

  • If you pocket your opponent's colour, that coin is returned to the board centre
  • You score nothing for it and your turn ends β€” even if you also pocketed own coins on the same shot, turn continuation only applies to own-colour pockets

Queen Cover Rule

  • After pocketing the queen you must pocket one of your own coins on the same shot or on your next turn to "cover" it
  • If you fail to cover, the queen is returned to the board centre and your queen points are reversed

Foul

  • Pocketing the striker is a foul
  • One of your previously pocketed coins is returned to the board centre
  • Your turn ends and passes to the opponent

Win Condition

All coins cleared from the board β†’ game ends; the player with the higher score wins.

ICF Compliance Table

Rule Status Notes
9 black + 9 white + 1 queen βœ… Full piece complement
Agent = white, Opponent = black βœ… Enforced throughout
Score 1 pt per own coin βœ…
Queen = 3 pts βœ… Simplified from ICF face-value (1–9)
Due rule β€” opponent's coin returns to centre, no score, turn ends βœ…
Queen cover rule β€” cover on same/next shot or queen returns βœ…
Foul β€” striker pocketed returns own coin, ends turn βœ…
Turn continuation on own-colour pocket only βœ… Due coins do not extend turn
Baseline shooting with obstruction check βœ… Striker nudged clear of coins
Coulomb board friction (boric-acid surface, ΞΌ_k β‰ˆ 0.04) βœ… BOARD_DECEL = 2.5 units/sΒ² via velocity_func
Elastic rubber cushion walls βœ… ELASTICITY = 0.92
Pocket capture (no corner dead zones) βœ… pocket_capture_radius = 0.09 decoupled from wall gap
Numbered coin scoring (ICF 1–9 per colour) ❌ Simplified to 1 pt per coin
Touch-coin / out-of-turn penalties ❌ Not applicable for AI agents

Physics Design

Coulomb Board Friction

Real carrom boards are dusted with boric acid powder giving a kinetic friction coefficient of roughly ΞΌ_k β‰ˆ 0.02–0.05. Unlike viscous drag (speed-proportional), sliding friction produces constant deceleration regardless of a piece's current speed.

This environment implements Coulomb friction via Pymunk's body.velocity_func callback on every piece and the striker:

a_friction = BOARD_DECEL   # 2.5 units/sΒ² β€” equivalent to ΞΌ_k β‰ˆ 0.04 on a normalised board

With BOARD_DECEL = 2.5:

  • Full-force shot (vβ‚€ β‰ˆ 5 units/s): pieces settle in ~2 seconds after bouncing
  • Medium shot (vβ‚€ β‰ˆ 2.5 units/s): pieces settle in ~1 second
  • The simulation ends early once all pieces drop below SETTLE_VELOCITY = 0.02 units/s

Contact Physics

Shape-to-shape contact friction (FRICTION = 0.15) handles the interaction between colliding pieces and between pieces and the rubber-cushioned walls. Collision restitution is ELASTICITY = 0.92, reflecting the near-elastic bounce of polished wooden pieces off a rubber cushion.

Pocket Detection Geometry

Pocket capture uses two separate radii to handle a subtle geometry problem:

Field Value Purpose
pocket_radius 0.06 Visual pocket size; also the wall gap at each corner
pocket_capture_radius 0.09 Radius within which a piece is considered pocketed

Why they differ: walls have a wall_thickness = 0.02 and end at pocket_radius from each corner. Pymunk segments have rounded endcaps, so a piece (radius 0.03) rolling along a wall is constrained to stay at distance β‰₯ 0.05 from the wall endcap. A piece can therefore come to rest at e.g. (-0.44, -0.45) β€” inside the pocket gap but at distance β‰ˆ 0.078 from the corner, which was outside the old 0.06 detection radius (a "dead zone"). pocket_capture_radius = pocket_radius + coin_radius = 0.09 fires as soon as the coin's edge reaches the pocket rim, eliminating the dead zone.

Action Space

Field Type Description
action_type str "numeric" (default) or "text" for natural-language actions
placement_x float Striker placement along baseline [-0.4, 0.4], 0 = center
angle float Shot angle in radians, 0 = straight ahead toward +y
force float Normalized shot force in [0, 1]
text str Natural-language shot description (when action_type="text")

Observation

Field Type Description
positions List[List[float]] [N, 2] positions for striker + coins
velocities List[List[float]] [N, 2] velocities
pocketed List[bool] [N] pocketed flags
agent_score int Agent's current score
opponent_score int Opponent's current score
current_player str "agent" or "opponent"
remaining_coins int Coins still on the board
coins List[CoinInfo] Per-coin details with nearest pocket info
text_summary str Rich text board state for LLM prompting (includes rule reminders)

Reward Design

Event Reward Description
Each agent turn βˆ’0.01 Small negative to encourage efficiency
Own coin potted +1.0 Per own-colour coin pocketed
Queen potted +3.0 Queen is worth 3Γ—
Due coin (opponent's colour potted) βˆ’0.3 Coin returned to centre; teaches avoidance
Foul (striker pocketed) βˆ’1.5 Score βˆ’1 plus βˆ’0.5 extra penalty
Win (cleared board, agent leads) +5.0 Bonus for winning
Loss (cleared board, opponent leads) βˆ’2.0 Penalty for losing
Opponent scores βˆ’0.5Γ— Partial penalty when opponent pots own coins

info Dict Keys

Key Type Description
sim_steps float Physics steps taken this turn
energy float Cumulative kinetic energy this turn
coin_potted float Own coins pocketed this turn
due_coins float Opponent's coins returned to centre (due rule)
foul float 1.0 if striker was pocketed
queen_cover_pending bool True if queen cover is still required
placement_x_actual float Actual striker x after obstruction nudge

Green Agent (Evaluator)

In the AgentBeats / AgentX taxonomy:

  • 🟒 Green Agent β€” evaluator: defines tasks, environment, and scoring
  • 🟣 Purple Agent β€” competitor: the AI being tested (any Callable[[Observation], Action])
  • πŸ”΄ Red Agent β€” adversarial tester (not used here)

GreenCarromAgent is the green agent for this benchmark. It owns:

  1. A task suite β€” curated seeded boards across easy / standard / hard tiers
  2. The environment β€” wraps CarromEnv with full ICF rules
  3. Scoring β€” ICF-aware metrics (reward, win rate, ICF compliance from dues/fouls) plus compute efficiency
from carrom_env.green_agent import GreenCarromAgent, Task

def my_purple_agent(obs):
    return Action(placement_x=0.0, angle=0.1, force=0.6)

# Default suite: 3 easy + 3 standard + 3 hard tasks
evaluator = GreenCarromAgent()
report = evaluator.evaluate(my_purple_agent, verbose=True)
print(report.summary())
# {'n_tasks': 9, 'avg_reward': ..., 'win_rate': ..., 'icf_compliance': ...,
#  'efficiency_score': ..., ...}

# Or define a custom suite
tasks = [Task(task_id="focus", seed=0, max_turns=30, tier="standard")]
report = GreenCarromAgent(tasks=tasks).evaluate(my_purple_agent)
report.by_tier()  # per-tier breakdown

Scoring metrics

Metric Type Description
avg_reward game Mean episode reward across the suite
win_rate game Fraction of tasks where agent beat the opponent
avg_coins_potted game Mean own-coins pocketed per task
avg_dues ICF Mean opponent-coin pockets per task (lower = better)
avg_fouls ICF Mean strikers pocketed per task (lower = better)
icf_compliance ICF 1 βˆ’ (dues + fouls) / turns β€” fraction of shots obeying ICF rules
total_sim_steps compute Total physics steps across all tasks
efficiency_score compute Coins potted per 1000 sim steps

What max_turns counts

Task.max_turns (and MAX_STEPS in the inference script) counts combined agent + opponent turns β€” the env's internal turn counter increments once for every played shot, on either side. A setting of max_turns=200 therefore caps the episode at ~100 agent shots + ~100 heuristic-opponent shots. Set it to 400 if you want roughly 200 agent shots, or pass a custom Task with whatever cap you need.

Benchmark Results

Full inference logs live in inference_runs/.

MiniMaxAI/MiniMax-M2.5-fast (Nebius)

1 task Γ— 200 turns (seed 0), via https://api.tokenfactory.us-central1.nebius.com/v1. Full log: inference_runs/minimax-m2.5-fast_200turns_inference.log

Purple Agent Reward Win% Coins Dues ICF% Efficiency
Random βˆ’4.21 0% 4.0 2.00 98% 0.368
Heuristic (ICF-aware) βˆ’6.31 0% 4.0 0.00 100% 0.337
LLM Β· MiniMax-M2.5-fast +4.07 100% 8.0 1.00 97% 0.759

MiniMax-M2.5-fast wins the board at 8 coins potted with 1 due and 5 fouls, beating both baselines on reward and efficiency. The heuristic tanks to βˆ’6.31 because it's aggressive about shooting at white coins but hands the opponent easy board position on misses.

nvidia/Nemotron-3-Super-120b-a12b (Nebius)

1 task Γ— 200 turns (seed 0), via https://api.tokenfactory.us-central1.nebius.com/v1. Full log: inference_runs/nemotron-3-super-120b_200turns_inference.log Wall-clock: 2001.6 s (~33 min) for the full LLM episode.

Purple Agent Reward Win% Coins Dues ICF% Efficiency
Random +4.17 100% 8.0 1.00 100% 0.712
Heuristic (ICF-aware) βˆ’6.31 0% 4.0 0.00 100% 0.337
LLM Β· Nemotron-3-Super-120b +11.94 100% 13.0 2.00 97% 1.179

Nemotron β€” NVIDIA's 120B hybrid-MoE reasoning model β€” outscores every baseline on every game metric and has the highest compute efficiency I've seen so far at 1.18 coins potted per 1 000 sim steps. One parse-failure fallback over 106 agent shots, 2 dues, 4 fouls. The extra reasoning horsepower shows up as more aggressive scoring play than MiniMax (+11.94 vs +4.07 reward on the same seed) at a slight cost in ICF compliance (2 dues vs 1).

Frontier-model comparison (seed 0)

Model Reward Coins Dues Fouls ICF % Efficiency
MiniMax-M2.5-fast +4.07 8 1 5 97% 0.759
Nemotron-3-Super-120b +11.94 13 2 4 97% 1.179

Setup & Usage

Prerequisites

  • Python 3.10+
  • Docker (for containerized deployment)
  • pip install openenv-core pymunk

Local Development

pip install -e ".[dev]"
PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Docker

docker build -t carrom-env:latest .
docker run -p 8000:8000 carrom-env:latest

Baseline Inference

export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen3-4B"
export HF_TOKEN="hf_..."
python inference.py

Project Structure

carrom_rl_env/
β”œβ”€β”€ __init__.py              # Module exports
β”œβ”€β”€ carrom_env/
β”‚   β”œβ”€β”€ __init__.py          # Package exports
β”‚   β”œβ”€β”€ env.py               # CarromEnv (physics + ICF game logic)
β”‚   β”œβ”€β”€ models.py            # Action, Observation, State models
β”‚   β”œβ”€β”€ constants.py         # Board config + physics constants (BOARD_DECEL, FRICTION, …)
β”‚   └── green_agent.py       # Green Agent (evaluator: task suite + ICF-aware scoring)
β”œβ”€β”€ client.py                # CarromEnv (EnvClient)
β”œβ”€β”€ inference.py             # Baseline inference script
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ carrom_environment.py # Server-side Environment wrapper
β”‚   └── app.py               # FastAPI application
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ train_stub.py        # Quick demo
β”‚   β”œβ”€β”€ grpo_utils.py        # GRPO training utilities
β”‚   └── grpo_carrom_tutorial.ipynb  # Training notebook
β”œβ”€β”€ tests/
β”‚   └── test_env_basic.py    # Test suite
β”œβ”€β”€ openenv.yaml             # OpenEnv manifest
β”œβ”€β”€ pyproject.toml           # Dependencies
β”œβ”€β”€ Dockerfile               # Container image
└── README.md                # This file