echoboi
Deploy Discovery Environment API
8c464db

Discovery Environment

A benchmark for evaluating LLM agents on scientific rule discovery. The agent observes a black-box dynamical system through an MCP interface and must infer the exact update rule governing its evolution.

Overview

Each problem is a 2D cellular automaton with a hidden update rule. The agent can:

  • Query system metadata (get_system_info)
  • Generate random initial states (random_state)
  • Simulate forward from any state (simulate)
  • Submit candidate rules and receive accuracy feedback (submit_rule)

The agent never sees the source code of the rule. It must design experiments, collect data, form hypotheses, and iteratively refine its submission.

Problems

ID Name Grid Values Difficulty Key Challenge
G01 Modular Clock 20x20 0-4 Medium Conditional rule + modular arithmetic
G02 Asymmetric Survival 30x30 0-1 Easy-Medium XOR birth condition, split neighborhoods
G03 Gradient Life 20x20 0-9 Medium Median + mean across two neighborhood types
G04 Territorial Grid 30x30 0-2 Medium Two-species competition with priority rules
G05 Tired Life 30x30 0-1 Hard Hidden fatigue state (GoL + memory)
G06 Directional Flow 30x30 0-1 Hard Hidden per-cell orientation
G07 Parity Cascade 30x30 0-1 Medium-Hard Dual algebra: XOR parity + even sum
G08 Phase-Locked Grid 20x20 0-7 Medium-Hard Local synchronization + periodic global reset

Difficulty Axes

Each problem is characterized along five axes:

  • d_R (rule complexity): Number of distinct sub-rules or conditions
  • k (neighborhood complexity): How many neighborhood types are involved
  • v (value complexity): Richness of the cell value space
  • alpha (hidden state): Amount of non-visible state affecting dynamics
  • rho (regularity): How predictable/structured the rule is

Architecture

discovery_env/
  base.py           # BaseEnvironment ABC
  registry.py       # Problem registry (get_problem, list_problems)
  scoring.py        # Scoring: accuracy, parsimony (delta-DL), efficiency
  grids/
    g01_modular_clock.py
    g02_asymmetric_survival.py
    g03_gradient_life.py
    g04_territorial.py
    g05_tired_life.py
    g06_directional_flow.py
    g07_parity_cascade.py
    g08_phase_locked.py

discovery_env_server/
  server.py          # MCP server exposing tools to the agent

Scoring

Each submission is scored on three components (max total: 1.3):

Functional Accuracy (0.0 - 1.0, weight: 77%)

The fraction of 500 held-out random test states where the agent's predict_next(state) exactly matches the true next state (element-wise np.array_equal).

Parsimony Bonus (0.0 - 0.2, weight: 15%)

Based on delta description length (delta-DL): the difference between the agent's code length and the reference (optimal) code length, after stripping comments.

agent_dl   = len(strip_comments(agent_code))
ref_dl     = len(strip_comments(reference_code))
delta_dl   = max(0, agent_dl - ref_dl)
parsimony  = 0.2 * max(0, 1 - delta_dl / 300)

If the agent writes code as concise as the reference, it gets the full 0.2 bonus. Code 300+ characters longer than the reference gets 0.

Efficiency Bonus (0.0 - 0.1, weight: 8%)

Rewards fewer MCP queries (simulate, random_state, etc.):

efficiency = 0.1 * max(0, 1 - queries_used / 60)

Usage

Python API

from discovery_env import get_problem, list_problems

# List all problems
for p in list_problems():
    print(p["id"], p["name"], p["difficulty"])

# Create an environment
env = get_problem("G01")
env.reset(seed=42)

# Observe state
state = env.get_state()  # numpy array

# Step forward
next_state = env.step(1)

# Score a submission
from discovery_env import score_submission
result = score_submission(
    "<law>def predict_next(s): ...</law>",
    env,
    queries_used=10,
)
print(result["functional_accuracy"], result["total"])

MCP Server

The MCP server exposes the environment as tools for an LLM agent:

PROBLEM_ID=G01 python discovery_env_server/server.py

Tools available to the agent:

  • get_system_info() — grid dimensions, value range, description
  • random_state(seed) — generate a random initial condition
  • simulate(state_json, n_steps) — evolve a state forward, get trajectory
  • submit_rule(code) — submit predict_next function, get accuracy feedback

Docker (Isolated Agent Runs)

# Build the image
docker build -t discovery-env .

# Run an agent on problem G01 with Claude Sonnet
./launch_docker_agents.sh G01

# Monitor via dashboard
python experiments/dashboard.py

Hidden State Problems

Some problems (G05, G06, G08) have internal state not visible in a single snapshot:

  • G05 (Tired Life): Hidden fatigue counters. For single-step scoring, fatigue is assumed to be zero (equivalent to standard Game of Life).
  • G06 (Directional Flow): Hidden per-cell orientations. The orientation map is fixed per session but invisible to the agent.
  • G08 (Phase-Locked): Has a periodic global reset every 5 steps. Single-step scoring tests only the local rule (no global reset).

Adding New Problems

  1. Create discovery_env/grids/g09_your_problem.py
  2. Subclass BaseEnvironment
  3. Implement: _init_state, _step_once, get_state, get_state_shape, _set_visible_state, _true_step, reference_code
  4. Register in registry.py
  5. Export in grids/__init__.py