Spaces:
Sleeping
title: Scientific Hypothesis Lab
emoji: π¬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
Scientific Hypothesis Lab -- OpenEnv Environment
An RL environment where agents discover hidden causal rules through systematic experimentation. Built for the OpenEnv Hub.
What it does
Each episode, the agent is presented with a set of abstract variables (e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world. Variable names are deliberately opaque so agents cannot leverage pretrained real-world knowledge -- they must reason purely from experimental evidence.
The hidden rules span 8 single-parent function types (linear, threshold, inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear), multi-parent interaction rules (additive, multiplicative, min, max), and optional hidden confounders that inject unexplainable correlated noise.
The agent must:
- Design experiments -- probe variable relationships using interventions, correlations, counterfactuals, or passive observations
- Update beliefs from noisy experimental results
- Submit a hypothesis -- a structured description of the discovered causal rules
The environment rewards informative experiments, precise hypotheses, calibrated confidence, and efficient budget use.
Quick Start
# Install dependencies
pip install -e .
# Run the server locally
uvicorn server.app:app --port 8000
# In another terminal, run the baseline agent
export OPENAI_API_KEY=sk-...
python baseline_inference.py
Using the Client
from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType
# Async usage
async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
result = await env.reset(noise_level="low", domain="system_alpha")
obs = result.observation
# Run an intervention
result = await env.run_intervention(
control_variable=obs.available_variables[0],
control_value=5.0,
target_variable=obs.available_variables[1],
)
print(result.observation.system_message)
# Submit hypothesis
result = await env.submit_hypothesis(
hypothesis_text="Beta = 2.1 * Alpha + 3.0",
confidence=0.85,
)
print(f"Score: {result.observation.total_episode_reward}")
# Sync usage
env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
with env:
result = env.reset(noise_level="low")
...
File Structure
hypothesis_lab/
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Project metadata and dependencies
βββ requirements.txt # Pip fallback dependencies
βββ README.md # This file
βββ models.py # Pydantic Action / Observation / State models
βββ client.py # Typed EnvClient for agents and trainers
βββ __init__.py # Module exports
βββ baseline_inference.py # Baseline agent using OpenAI API
βββ Dockerfile # For HF Spaces deployment
βββ server/
β βββ __init__.py
β βββ app.py # FastAPI server (create_app entry point)
β βββ hypothesis_lab_environment.py # Core environment logic
β βββ causal_world.py # Hidden causal graph generator
β βββ rubric.py # Multi-component reward engine
βββ tasks/
β βββ __init__.py
β βββ task_easy.py # Easy: 2 vars, low noise, 12 budget
β βββ task_medium.py # Medium: 3 vars, medium noise, 10 budget
β βββ task_hard.py # Hard: 4 vars, high noise, 8 budget
βββ tests/
βββ __init__.py
βββ test_environment.py # Unit + integration tests
Action Space
HypLabAction has two modes:
| Field | Type | Description |
|---|---|---|
action_type |
"experiment" or "submit" |
What the agent is doing |
experiment_type |
"intervention", "correlation", "counterfactual", "passive" |
Experiment kind (experiment mode) |
control_variable |
str |
Variable to set/vary |
control_value |
float |
Value to set (intervention/counterfactual) |
control_range |
[min, max, n] |
Sweep range (correlation only) |
target_variable |
str |
Variable to observe |
hypothesis_text |
str |
Free-text hypothesis (submit mode) |
hypothesis_equations |
list[str] |
Structured equations (submit mode) |
confidence |
float [0,1] |
Self-reported confidence (submit mode) |
Observation Space
HypLabObservation always contains:
system_message: Human-readable text the LLM readsavailable_variables: Variable names in this episodebudget_remaining: Steps leftdone: Whether episode endedreward: Step reward
On experiment steps: result_value, noise_sigma, info_gain_reward, is_redundant
On submit: accuracy_score, precision_bonus, calibration_score, efficiency_bonus, contradiction_penalty, total_episode_reward, ground_truth_revealed
Causal Rule Types
The hidden world can contain any of these relationship types:
| Rule | Formula | Shape |
|---|---|---|
| Linear | y = a*x + b |
Straight line |
| Threshold | y = high if x > t else low |
Step function |
| Inverse | y = a / x |
Hyperbola |
| Quadratic | y = a*xΒ² + b*x + c |
Parabola |
| Exponential | y = a * exp(k*x) |
Growth/decay |
| Logarithmic | y = a * ln(x) + b |
Diminishing returns |
| Saturating | y = Vmax * x / (Km + x) |
Plateau (Michaelis-Menten) |
| Piecewise-linear | Two slopes with a knot | Regime change |
Additionally, some effects may depend on two parents via interaction rules (additive, multiplicative, min, max), and hidden confounders may inject correlated noise the agent cannot explain.
Reward Components
| Signal | Value | What it trains |
|---|---|---|
| Information gain | +0.05 to +0.25/step | Designing informative experiments |
| Redundant experiment | -0.10 | Not wasting budget |
| Hypothesis accuracy | 0.0 to +1.0 | Getting the right answer |
| Precision bonus | +0.10 | Quantitative, falsifiable claims |
| Calibration score | 0.0 to +0.20 | Knowing what you don't know |
| Efficiency bonus | +0.15 | Submitting early when confident |
| Contradiction penalty | -0.50 | Contradicting the experimental setup |
Tasks (3 difficulty levels)
| Task | Noise | Variables | Budget | Domain | Key Challenge |
|---|---|---|---|---|---|
| Easy | 0.05 | 2 | 12 | system_alpha | Single-edge discovery |
| Medium | 0.20 | 3 | 10 | Random | Multi-edge, noisy signals |
| Hard | 0.50 | 4 | 8 | Random | Complex graph + interactions, tight budget |
Each task has a deterministic grader that returns a score in [0.0, 1.0].
Design Decisions
Abstract variable names: Variables are named Alpha, Beta, Gamma (or V1, V2, V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents from using pretrained knowledge of real-world physics/economics/biology to shortcut the reasoning process. The agent must reason purely from experimental data.
Diverse rule types: With 8 single-parent types plus interaction rules, the agent cannot memorize a small set of templates. Many rule types look similar in narrow ranges (e.g. exponential β linear for small x), forcing the agent to design discriminating experiments.
Deploy to HF Spaces
openenv push --org your-org --token $HF_TOKEN
Run Tests
pytest tests/ -v
Baseline Scores
Baseline agent (gpt-4o-mini, temperature=0.3):
| Task | Score |
|---|---|
| Easy | ~0.65 |
| Medium | ~0.40 |
| Hard | ~0.25 |
| Average | ~0.43 |
These scores are reproducible via python baseline_inference.py with the same model and seed.