Spaces:

Sumukh66
/

Labexperiment

Sleeping

App Files Files Community

Labexperiment / README.md

Sbhimraj

Add application file

aab0192 3 months ago

preview code

Raw

History Blame Contribute Delete

7.89 kB

metadata

title: Scientific Hypothesis Lab
emoji: 🔬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Scientific Hypothesis Lab -- OpenEnv Environment

An RL environment where agents discover hidden causal rules through systematic experimentation. Built for the OpenEnv Hub.

What it does

Each episode, the agent is presented with a set of abstract variables (e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world. Variable names are deliberately opaque so agents cannot leverage pretrained real-world knowledge -- they must reason purely from experimental evidence.

The hidden rules span 8 single-parent function types (linear, threshold, inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear), multi-parent interaction rules (additive, multiplicative, min, max), and optional hidden confounders that inject unexplainable correlated noise.

The agent must:

Design experiments -- probe variable relationships using interventions, correlations, counterfactuals, or passive observations
Update beliefs from noisy experimental results
Submit a hypothesis -- a structured description of the discovered causal rules

The environment rewards informative experiments, precise hypotheses, calibrated confidence, and efficient budget use.

Quick Start

# Install dependencies
pip install -e .

# Run the server locally
uvicorn server.app:app --port 8000

# In another terminal, run the baseline agent
export OPENAI_API_KEY=sk-...
python baseline_inference.py

Using the Client

from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType

# Async usage
async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(noise_level="low", domain="system_alpha")
    obs = result.observation

    # Run an intervention
    result = await env.run_intervention(
        control_variable=obs.available_variables[0],
        control_value=5.0,
        target_variable=obs.available_variables[1],
    )
    print(result.observation.system_message)

    # Submit hypothesis
    result = await env.submit_hypothesis(
        hypothesis_text="Beta = 2.1 * Alpha + 3.0",
        confidence=0.85,
    )
    print(f"Score: {result.observation.total_episode_reward}")

# Sync usage
env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
with env:
    result = env.reset(noise_level="low")
    ...

File Structure

hypothesis_lab/
├── openenv.yaml              # OpenEnv manifest
├── pyproject.toml             # Project metadata and dependencies
├── requirements.txt           # Pip fallback dependencies
├── README.md                  # This file
├── models.py                  # Pydantic Action / Observation / State models
├── client.py                  # Typed EnvClient for agents and trainers
├── __init__.py                # Module exports
├── baseline_inference.py      # Baseline agent using OpenAI API
├── Dockerfile                 # For HF Spaces deployment
├── server/
│   ├── __init__.py
│   ├── app.py                 # FastAPI server (create_app entry point)
│   ├── hypothesis_lab_environment.py  # Core environment logic
│   ├── causal_world.py        # Hidden causal graph generator
│   └── rubric.py              # Multi-component reward engine
├── tasks/
│   ├── __init__.py
│   ├── task_easy.py           # Easy: 2 vars, low noise, 12 budget
│   ├── task_medium.py         # Medium: 3 vars, medium noise, 10 budget
│   └── task_hard.py           # Hard: 4 vars, high noise, 8 budget
└── tests/
    ├── __init__.py
    └── test_environment.py    # Unit + integration tests

Action Space

HypLabAction has two modes:

Field	Type	Description
`action_type`	`"experiment"` or `"submit"`	What the agent is doing
`experiment_type`	`"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"`	Experiment kind (experiment mode)
`control_variable`	`str`	Variable to set/vary
`control_value`	`float`	Value to set (intervention/counterfactual)
`control_range`	`[min, max, n]`	Sweep range (correlation only)
`target_variable`	`str`	Variable to observe
`hypothesis_text`	`str`	Free-text hypothesis (submit mode)
`hypothesis_equations`	`list[str]`	Structured equations (submit mode)
`confidence`	`float [0,1]`	Self-reported confidence (submit mode)

Observation Space

HypLabObservation always contains:

system_message: Human-readable text the LLM reads
available_variables: Variable names in this episode
budget_remaining: Steps left
done: Whether episode ended
reward: Step reward

On experiment steps: result_value, noise_sigma, info_gain_reward, is_redundant

On submit: accuracy_score, precision_bonus, calibration_score, efficiency_bonus, contradiction_penalty, total_episode_reward, ground_truth_revealed

Causal Rule Types

The hidden world can contain any of these relationship types:

Rule	Formula	Shape
Linear	`y = a*x + b`	Straight line
Threshold	`y = high if x > t else low`	Step function
Inverse	`y = a / x`	Hyperbola
Quadratic	`y = ax² + bx + c`	Parabola
Exponential	`y = a * exp(k*x)`	Growth/decay
Logarithmic	`y = a * ln(x) + b`	Diminishing returns
Saturating	`y = Vmax * x / (Km + x)`	Plateau (Michaelis-Menten)
Piecewise-linear	Two slopes with a knot	Regime change

Additionally, some effects may depend on two parents via interaction rules (additive, multiplicative, min, max), and hidden confounders may inject correlated noise the agent cannot explain.

Reward Components

Signal	Value	What it trains
Information gain	+0.05 to +0.25/step	Designing informative experiments
Redundant experiment	-0.10	Not wasting budget
Hypothesis accuracy	0.0 to +1.0	Getting the right answer
Precision bonus	+0.10	Quantitative, falsifiable claims
Calibration score	0.0 to +0.20	Knowing what you don't know
Efficiency bonus	+0.15	Submitting early when confident
Contradiction penalty	-0.50	Contradicting the experimental setup

Tasks (3 difficulty levels)

Task	Noise	Variables	Budget	Domain	Key Challenge
Easy	0.05	2	12	system_alpha	Single-edge discovery
Medium	0.20	3	10	Random	Multi-edge, noisy signals
Hard	0.50	4	8	Random	Complex graph + interactions, tight budget

Each task has a deterministic grader that returns a score in [0.0, 1.0].

Design Decisions

Abstract variable names: Variables are named Alpha, Beta, Gamma (or V1, V2, V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents from using pretrained knowledge of real-world physics/economics/biology to shortcut the reasoning process. The agent must reason purely from experimental data.

Diverse rule types: With 8 single-parent types plus interaction rules, the agent cannot memorize a small set of templates. Many rule types look similar in narrow ranges (e.g. exponential ≈ linear for small x), forcing the agent to design discriminating experiments.

Deploy to HF Spaces

openenv push --org your-org --token $HF_TOKEN

Run Tests

pytest tests/ -v

Baseline Scores

Baseline agent (gpt-4o-mini, temperature=0.3):

Task	Score
Easy	~0.65
Medium	~0.40
Hard	~0.25
Average	~0.43

These scores are reproducible via python baseline_inference.py with the same model and seed.