Labexperiment / README.md
Sbhimraj's picture
Add application file
aab0192
|
Raw
History Blame Contribute Delete
7.89 kB
metadata
title: Scientific Hypothesis Lab
emoji: πŸ”¬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Scientific Hypothesis Lab -- OpenEnv Environment

An RL environment where agents discover hidden causal rules through systematic experimentation. Built for the OpenEnv Hub.

What it does

Each episode, the agent is presented with a set of abstract variables (e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world. Variable names are deliberately opaque so agents cannot leverage pretrained real-world knowledge -- they must reason purely from experimental evidence.

The hidden rules span 8 single-parent function types (linear, threshold, inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear), multi-parent interaction rules (additive, multiplicative, min, max), and optional hidden confounders that inject unexplainable correlated noise.

The agent must:

  1. Design experiments -- probe variable relationships using interventions, correlations, counterfactuals, or passive observations
  2. Update beliefs from noisy experimental results
  3. Submit a hypothesis -- a structured description of the discovered causal rules

The environment rewards informative experiments, precise hypotheses, calibrated confidence, and efficient budget use.

Quick Start

# Install dependencies
pip install -e .

# Run the server locally
uvicorn server.app:app --port 8000

# In another terminal, run the baseline agent
export OPENAI_API_KEY=sk-...
python baseline_inference.py

Using the Client

from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType

# Async usage
async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(noise_level="low", domain="system_alpha")
    obs = result.observation

    # Run an intervention
    result = await env.run_intervention(
        control_variable=obs.available_variables[0],
        control_value=5.0,
        target_variable=obs.available_variables[1],
    )
    print(result.observation.system_message)

    # Submit hypothesis
    result = await env.submit_hypothesis(
        hypothesis_text="Beta = 2.1 * Alpha + 3.0",
        confidence=0.85,
    )
    print(f"Score: {result.observation.total_episode_reward}")

# Sync usage
env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
with env:
    result = env.reset(noise_level="low")
    ...

File Structure

hypothesis_lab/
β”œβ”€β”€ openenv.yaml              # OpenEnv manifest
β”œβ”€β”€ pyproject.toml             # Project metadata and dependencies
β”œβ”€β”€ requirements.txt           # Pip fallback dependencies
β”œβ”€β”€ README.md                  # This file
β”œβ”€β”€ models.py                  # Pydantic Action / Observation / State models
β”œβ”€β”€ client.py                  # Typed EnvClient for agents and trainers
β”œβ”€β”€ __init__.py                # Module exports
β”œβ”€β”€ baseline_inference.py      # Baseline agent using OpenAI API
β”œβ”€β”€ Dockerfile                 # For HF Spaces deployment
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                 # FastAPI server (create_app entry point)
β”‚   β”œβ”€β”€ hypothesis_lab_environment.py  # Core environment logic
β”‚   β”œβ”€β”€ causal_world.py        # Hidden causal graph generator
β”‚   └── rubric.py              # Multi-component reward engine
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ task_easy.py           # Easy: 2 vars, low noise, 12 budget
β”‚   β”œβ”€β”€ task_medium.py         # Medium: 3 vars, medium noise, 10 budget
β”‚   └── task_hard.py           # Hard: 4 vars, high noise, 8 budget
└── tests/
    β”œβ”€β”€ __init__.py
    └── test_environment.py    # Unit + integration tests

Action Space

HypLabAction has two modes:

Field Type Description
action_type "experiment" or "submit" What the agent is doing
experiment_type "intervention", "correlation", "counterfactual", "passive" Experiment kind (experiment mode)
control_variable str Variable to set/vary
control_value float Value to set (intervention/counterfactual)
control_range [min, max, n] Sweep range (correlation only)
target_variable str Variable to observe
hypothesis_text str Free-text hypothesis (submit mode)
hypothesis_equations list[str] Structured equations (submit mode)
confidence float [0,1] Self-reported confidence (submit mode)

Observation Space

HypLabObservation always contains:

  • system_message: Human-readable text the LLM reads
  • available_variables: Variable names in this episode
  • budget_remaining: Steps left
  • done: Whether episode ended
  • reward: Step reward

On experiment steps: result_value, noise_sigma, info_gain_reward, is_redundant

On submit: accuracy_score, precision_bonus, calibration_score, efficiency_bonus, contradiction_penalty, total_episode_reward, ground_truth_revealed

Causal Rule Types

The hidden world can contain any of these relationship types:

Rule Formula Shape
Linear y = a*x + b Straight line
Threshold y = high if x > t else low Step function
Inverse y = a / x Hyperbola
Quadratic y = a*xΒ² + b*x + c Parabola
Exponential y = a * exp(k*x) Growth/decay
Logarithmic y = a * ln(x) + b Diminishing returns
Saturating y = Vmax * x / (Km + x) Plateau (Michaelis-Menten)
Piecewise-linear Two slopes with a knot Regime change

Additionally, some effects may depend on two parents via interaction rules (additive, multiplicative, min, max), and hidden confounders may inject correlated noise the agent cannot explain.

Reward Components

Signal Value What it trains
Information gain +0.05 to +0.25/step Designing informative experiments
Redundant experiment -0.10 Not wasting budget
Hypothesis accuracy 0.0 to +1.0 Getting the right answer
Precision bonus +0.10 Quantitative, falsifiable claims
Calibration score 0.0 to +0.20 Knowing what you don't know
Efficiency bonus +0.15 Submitting early when confident
Contradiction penalty -0.50 Contradicting the experimental setup

Tasks (3 difficulty levels)

Task Noise Variables Budget Domain Key Challenge
Easy 0.05 2 12 system_alpha Single-edge discovery
Medium 0.20 3 10 Random Multi-edge, noisy signals
Hard 0.50 4 8 Random Complex graph + interactions, tight budget

Each task has a deterministic grader that returns a score in [0.0, 1.0].

Design Decisions

Abstract variable names: Variables are named Alpha, Beta, Gamma (or V1, V2, V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents from using pretrained knowledge of real-world physics/economics/biology to shortcut the reasoning process. The agent must reason purely from experimental data.

Diverse rule types: With 8 single-parent types plus interaction rules, the agent cannot memorize a small set of templates. Many rule types look similar in narrow ranges (e.g. exponential β‰ˆ linear for small x), forcing the agent to design discriminating experiments.

Deploy to HF Spaces

openenv push --org your-org --token $HF_TOKEN

Run Tests

pytest tests/ -v

Baseline Scores

Baseline agent (gpt-4o-mini, temperature=0.3):

Task Score
Easy ~0.65
Medium ~0.40
Hard ~0.25
Average ~0.43

These scores are reproducible via python baseline_inference.py with the same model and seed.