---
title: Scientific Hypothesis Lab
emoji: 🔬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Scientific Hypothesis Lab -- OpenEnv Environment

An RL environment where agents discover hidden causal rules through systematic
experimentation. Built for the [OpenEnv Hub](https://huggingface.co/openenv).

## What it does

Each episode, the agent is presented with a set of **abstract** variables
(e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world.
Variable names are deliberately opaque so agents cannot leverage pretrained
real-world knowledge -- they must reason purely from experimental evidence.

The hidden rules span **8 single-parent function types** (linear, threshold,
inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear),
**multi-parent interaction rules** (additive, multiplicative, min, max), and
optional **hidden confounders** that inject unexplainable correlated noise.

The agent must:

1. **Design experiments** -- probe variable relationships using interventions,
   correlations, counterfactuals, or passive observations
2. **Update beliefs** from noisy experimental results
3. **Submit a hypothesis** -- a structured description of the discovered causal rules

The environment rewards informative experiments, precise hypotheses, calibrated
confidence, and efficient budget use.

## Quick Start

```bash
# Install dependencies
pip install -e .

# Run the server locally
uvicorn server.app:app --port 8000

# In another terminal, run the baseline agent
export OPENAI_API_KEY=sk-...
python baseline_inference.py
```

### Using the Client

```python
from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType

# Async usage
async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(noise_level="low", domain="system_alpha")
    obs = result.observation

    # Run an intervention
    result = await env.run_intervention(
        control_variable=obs.available_variables[0],
        control_value=5.0,
        target_variable=obs.available_variables[1],
    )
    print(result.observation.system_message)

    # Submit hypothesis
    result = await env.submit_hypothesis(
        hypothesis_text="Beta = 2.1 * Alpha + 3.0",
        confidence=0.85,
    )
    print(f"Score: {result.observation.total_episode_reward}")

# Sync usage
env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
with env:
    result = env.reset(noise_level="low")
    ...
```

## File Structure

```
hypothesis_lab/
├── openenv.yaml              # OpenEnv manifest
├── pyproject.toml             # Project metadata and dependencies
├── requirements.txt           # Pip fallback dependencies
├── README.md                  # This file
├── models.py                  # Pydantic Action / Observation / State models
├── client.py                  # Typed EnvClient for agents and trainers
├── __init__.py                # Module exports
├── baseline_inference.py      # Baseline agent using OpenAI API
├── Dockerfile                 # For HF Spaces deployment
├── server/
│   ├── __init__.py
│   ├── app.py                 # FastAPI server (create_app entry point)
│   ├── hypothesis_lab_environment.py  # Core environment logic
│   ├── causal_world.py        # Hidden causal graph generator
│   └── rubric.py              # Multi-component reward engine
├── tasks/
│   ├── __init__.py
│   ├── task_easy.py           # Easy: 2 vars, low noise, 12 budget
│   ├── task_medium.py         # Medium: 3 vars, medium noise, 10 budget
│   └── task_hard.py           # Hard: 4 vars, high noise, 8 budget
└── tests/
    ├── __init__.py
    └── test_environment.py    # Unit + integration tests
```

## Action Space

**HypLabAction** has two modes:

| Field | Type | Description |
|---|---|---|
| `action_type` | `"experiment"` or `"submit"` | What the agent is doing |
| `experiment_type` | `"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"` | Experiment kind (experiment mode) |
| `control_variable` | `str` | Variable to set/vary |
| `control_value` | `float` | Value to set (intervention/counterfactual) |
| `control_range` | `[min, max, n]` | Sweep range (correlation only) |
| `target_variable` | `str` | Variable to observe |
| `hypothesis_text` | `str` | Free-text hypothesis (submit mode) |
| `hypothesis_equations` | `list[str]` | Structured equations (submit mode) |
| `confidence` | `float [0,1]` | Self-reported confidence (submit mode) |

## Observation Space

**HypLabObservation** always contains:
- `system_message`: Human-readable text the LLM reads
- `available_variables`: Variable names in this episode
- `budget_remaining`: Steps left
- `done`: Whether episode ended
- `reward`: Step reward

On experiment steps: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant`

On submit: `accuracy_score`, `precision_bonus`, `calibration_score`, `efficiency_bonus`, `contradiction_penalty`, `total_episode_reward`, `ground_truth_revealed`

## Causal Rule Types

The hidden world can contain any of these relationship types:

| Rule | Formula | Shape |
|---|---|---|
| Linear | `y = a*x + b` | Straight line |
| Threshold | `y = high if x > t else low` | Step function |
| Inverse | `y = a / x` | Hyperbola |
| Quadratic | `y = a*x² + b*x + c` | Parabola |
| Exponential | `y = a * exp(k*x)` | Growth/decay |
| Logarithmic | `y = a * ln(x) + b` | Diminishing returns |
| Saturating | `y = Vmax * x / (Km + x)` | Plateau (Michaelis-Menten) |
| Piecewise-linear | Two slopes with a knot | Regime change |

Additionally, some effects may depend on **two parents** via interaction rules
(additive, multiplicative, min, max), and **hidden confounders** may inject
correlated noise the agent cannot explain.

## Reward Components

| Signal | Value | What it trains |
|---|---|---|
| Information gain | +0.05 to +0.25/step | Designing informative experiments |
| Redundant experiment | -0.10 | Not wasting budget |
| Hypothesis accuracy | 0.0 to +1.0 | Getting the right answer |
| Precision bonus | +0.10 | Quantitative, falsifiable claims |
| Calibration score | 0.0 to +0.20 | Knowing what you don't know |
| Efficiency bonus | +0.15 | Submitting early when confident |
| Contradiction penalty | -0.50 | Contradicting the experimental setup |

## Tasks (3 difficulty levels)

| Task | Noise | Variables | Budget | Domain | Key Challenge |
|---|---|---|---|---|---|
| Easy | 0.05 | 2 | 12 | system_alpha | Single-edge discovery |
| Medium | 0.20 | 3 | 10 | Random | Multi-edge, noisy signals |
| Hard | 0.50 | 4 | 8 | Random | Complex graph + interactions, tight budget |

Each task has a deterministic grader that returns a score in [0.0, 1.0].

## Design Decisions

**Abstract variable names:** Variables are named Alpha, Beta, Gamma (or V1, V2,
V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents
from using pretrained knowledge of real-world physics/economics/biology to
shortcut the reasoning process. The agent must reason purely from experimental
data.

**Diverse rule types:** With 8 single-parent types plus interaction rules, the
agent cannot memorize a small set of templates. Many rule types look similar in
narrow ranges (e.g. exponential ≈ linear for small x), forcing the agent to
design discriminating experiments.

## Deploy to HF Spaces

```bash
openenv push --org your-org --token $HF_TOKEN
```

## Run Tests

```bash
pytest tests/ -v
```

## Baseline Scores

Baseline agent (gpt-4o-mini, temperature=0.3):

| Task | Score |
|---|---|
| Easy | ~0.65 |
| Medium | ~0.40 |
| Hard | ~0.25 |
| Average | ~0.43 |

These scores are reproducible via `python baseline_inference.py` with the same model and seed.