Spaces:

echoboi
/

discovery-env

Running

App Files Files Community

discovery-env / discovery_env /README.md

echoboi

Deploy Discovery Environment API

8c464db 2 months ago

preview code

raw

history blame contribute delete

5.76 kB

Discovery Environment

A benchmark for evaluating LLM agents on scientific rule discovery. The agent observes a black-box dynamical system through an MCP interface and must infer the exact update rule governing its evolution.

Overview

Each problem is a 2D cellular automaton with a hidden update rule. The agent can:

Query system metadata (get_system_info)
Generate random initial states (random_state)
Simulate forward from any state (simulate)
Submit candidate rules and receive accuracy feedback (submit_rule)

The agent never sees the source code of the rule. It must design experiments, collect data, form hypotheses, and iteratively refine its submission.

Problems

ID	Name	Grid	Values	Difficulty	Key Challenge
G01	Modular Clock	20x20	0-4	Medium	Conditional rule + modular arithmetic
G02	Asymmetric Survival	30x30	0-1	Easy-Medium	XOR birth condition, split neighborhoods
G03	Gradient Life	20x20	0-9	Medium	Median + mean across two neighborhood types
G04	Territorial Grid	30x30	0-2	Medium	Two-species competition with priority rules
G05	Tired Life	30x30	0-1	Hard	Hidden fatigue state (GoL + memory)
G06	Directional Flow	30x30	0-1	Hard	Hidden per-cell orientation
G07	Parity Cascade	30x30	0-1	Medium-Hard	Dual algebra: XOR parity + even sum
G08	Phase-Locked Grid	20x20	0-7	Medium-Hard	Local synchronization + periodic global reset

Difficulty Axes

Each problem is characterized along five axes:

d_R (rule complexity): Number of distinct sub-rules or conditions
k (neighborhood complexity): How many neighborhood types are involved
v (value complexity): Richness of the cell value space
alpha (hidden state): Amount of non-visible state affecting dynamics
rho (regularity): How predictable/structured the rule is

Architecture

discovery_env/
  base.py           # BaseEnvironment ABC
  registry.py       # Problem registry (get_problem, list_problems)
  scoring.py        # Scoring: accuracy, parsimony (delta-DL), efficiency
  grids/
    g01_modular_clock.py
    g02_asymmetric_survival.py
    g03_gradient_life.py
    g04_territorial.py
    g05_tired_life.py
    g06_directional_flow.py
    g07_parity_cascade.py
    g08_phase_locked.py

discovery_env_server/
  server.py          # MCP server exposing tools to the agent

Scoring

Each submission is scored on three components (max total: 1.3):

Functional Accuracy (0.0 - 1.0, weight: 77%)

The fraction of 500 held-out random test states where the agent's predict_next(state) exactly matches the true next state (element-wise np.array_equal).

Parsimony Bonus (0.0 - 0.2, weight: 15%)

Based on delta description length (delta-DL): the difference between the agent's code length and the reference (optimal) code length, after stripping comments.

agent_dl   = len(strip_comments(agent_code))
ref_dl     = len(strip_comments(reference_code))
delta_dl   = max(0, agent_dl - ref_dl)
parsimony  = 0.2 * max(0, 1 - delta_dl / 300)

If the agent writes code as concise as the reference, it gets the full 0.2 bonus. Code 300+ characters longer than the reference gets 0.

Efficiency Bonus (0.0 - 0.1, weight: 8%)

Rewards fewer MCP queries (simulate, random_state, etc.):

efficiency = 0.1 * max(0, 1 - queries_used / 60)

Usage

Python API

from discovery_env import get_problem, list_problems

# List all problems
for p in list_problems():
    print(p["id"], p["name"], p["difficulty"])

# Create an environment
env = get_problem("G01")
env.reset(seed=42)

# Observe state
state = env.get_state()  # numpy array

# Step forward
next_state = env.step(1)

# Score a submission
from discovery_env import score_submission
result = score_submission(
    "<law>def predict_next(s): ...</law>",
    env,
    queries_used=10,
)
print(result["functional_accuracy"], result["total"])

MCP Server

The MCP server exposes the environment as tools for an LLM agent:

PROBLEM_ID=G01 python discovery_env_server/server.py

Tools available to the agent:

get_system_info() — grid dimensions, value range, description
random_state(seed) — generate a random initial condition
simulate(state_json, n_steps) — evolve a state forward, get trajectory
submit_rule(code) — submit predict_next function, get accuracy feedback

Docker (Isolated Agent Runs)

# Build the image
docker build -t discovery-env .

# Run an agent on problem G01 with Claude Sonnet
./launch_docker_agents.sh G01

# Monitor via dashboard
python experiments/dashboard.py

Hidden State Problems

Some problems (G05, G06, G08) have internal state not visible in a single snapshot:

G05 (Tired Life): Hidden fatigue counters. For single-step scoring, fatigue is assumed to be zero (equivalent to standard Game of Life).
G06 (Directional Flow): Hidden per-cell orientations. The orientation map is fixed per session but invisible to the agent.
G08 (Phase-Locked): Has a periodic global reset every 5 steps. Single-step scoring tests only the local rule (no global reset).

Adding New Problems

Create discovery_env/grids/g09_your_problem.py
Subclass BaseEnvironment
Implement: _init_state, _step_once, get_state, get_state_shape, _set_visible_state, _true_step, reference_code
Register in registry.py
Export in grids/__init__.py