Spaces:
Running
Discovery Environment
A benchmark for evaluating LLM agents on scientific rule discovery. The agent observes a black-box dynamical system through an MCP interface and must infer the exact update rule governing its evolution.
Overview
Each problem is a 2D cellular automaton with a hidden update rule. The agent can:
- Query system metadata (
get_system_info) - Generate random initial states (
random_state) - Simulate forward from any state (
simulate) - Submit candidate rules and receive accuracy feedback (
submit_rule)
The agent never sees the source code of the rule. It must design experiments, collect data, form hypotheses, and iteratively refine its submission.
Problems
| ID | Name | Grid | Values | Difficulty | Key Challenge |
|---|---|---|---|---|---|
| G01 | Modular Clock | 20x20 | 0-4 | Medium | Conditional rule + modular arithmetic |
| G02 | Asymmetric Survival | 30x30 | 0-1 | Easy-Medium | XOR birth condition, split neighborhoods |
| G03 | Gradient Life | 20x20 | 0-9 | Medium | Median + mean across two neighborhood types |
| G04 | Territorial Grid | 30x30 | 0-2 | Medium | Two-species competition with priority rules |
| G05 | Tired Life | 30x30 | 0-1 | Hard | Hidden fatigue state (GoL + memory) |
| G06 | Directional Flow | 30x30 | 0-1 | Hard | Hidden per-cell orientation |
| G07 | Parity Cascade | 30x30 | 0-1 | Medium-Hard | Dual algebra: XOR parity + even sum |
| G08 | Phase-Locked Grid | 20x20 | 0-7 | Medium-Hard | Local synchronization + periodic global reset |
Difficulty Axes
Each problem is characterized along five axes:
- d_R (rule complexity): Number of distinct sub-rules or conditions
- k (neighborhood complexity): How many neighborhood types are involved
- v (value complexity): Richness of the cell value space
- alpha (hidden state): Amount of non-visible state affecting dynamics
- rho (regularity): How predictable/structured the rule is
Architecture
discovery_env/
base.py # BaseEnvironment ABC
registry.py # Problem registry (get_problem, list_problems)
scoring.py # Scoring: accuracy, parsimony (delta-DL), efficiency
grids/
g01_modular_clock.py
g02_asymmetric_survival.py
g03_gradient_life.py
g04_territorial.py
g05_tired_life.py
g06_directional_flow.py
g07_parity_cascade.py
g08_phase_locked.py
discovery_env_server/
server.py # MCP server exposing tools to the agent
Scoring
Each submission is scored on three components (max total: 1.3):
Functional Accuracy (0.0 - 1.0, weight: 77%)
The fraction of 500 held-out random test states where the agent's predict_next(state) exactly matches the true next state (element-wise np.array_equal).
Parsimony Bonus (0.0 - 0.2, weight: 15%)
Based on delta description length (delta-DL): the difference between the agent's code length and the reference (optimal) code length, after stripping comments.
agent_dl = len(strip_comments(agent_code))
ref_dl = len(strip_comments(reference_code))
delta_dl = max(0, agent_dl - ref_dl)
parsimony = 0.2 * max(0, 1 - delta_dl / 300)
If the agent writes code as concise as the reference, it gets the full 0.2 bonus. Code 300+ characters longer than the reference gets 0.
Efficiency Bonus (0.0 - 0.1, weight: 8%)
Rewards fewer MCP queries (simulate, random_state, etc.):
efficiency = 0.1 * max(0, 1 - queries_used / 60)
Usage
Python API
from discovery_env import get_problem, list_problems
# List all problems
for p in list_problems():
print(p["id"], p["name"], p["difficulty"])
# Create an environment
env = get_problem("G01")
env.reset(seed=42)
# Observe state
state = env.get_state() # numpy array
# Step forward
next_state = env.step(1)
# Score a submission
from discovery_env import score_submission
result = score_submission(
"<law>def predict_next(s): ...</law>",
env,
queries_used=10,
)
print(result["functional_accuracy"], result["total"])
MCP Server
The MCP server exposes the environment as tools for an LLM agent:
PROBLEM_ID=G01 python discovery_env_server/server.py
Tools available to the agent:
get_system_info()— grid dimensions, value range, descriptionrandom_state(seed)— generate a random initial conditionsimulate(state_json, n_steps)— evolve a state forward, get trajectorysubmit_rule(code)— submitpredict_nextfunction, get accuracy feedback
Docker (Isolated Agent Runs)
# Build the image
docker build -t discovery-env .
# Run an agent on problem G01 with Claude Sonnet
./launch_docker_agents.sh G01
# Monitor via dashboard
python experiments/dashboard.py
Hidden State Problems
Some problems (G05, G06, G08) have internal state not visible in a single snapshot:
- G05 (Tired Life): Hidden fatigue counters. For single-step scoring, fatigue is assumed to be zero (equivalent to standard Game of Life).
- G06 (Directional Flow): Hidden per-cell orientations. The orientation map is fixed per session but invisible to the agent.
- G08 (Phase-Locked): Has a periodic global reset every 5 steps. Single-step scoring tests only the local rule (no global reset).
Adding New Problems
- Create
discovery_env/grids/g09_your_problem.py - Subclass
BaseEnvironment - Implement:
_init_state,_step_once,get_state,get_state_shape,_set_visible_state,_true_step,reference_code - Register in
registry.py - Export in
grids/__init__.py