File size: 7,888 Bytes
460a77d
aab0192
 
 
 
460a77d
 
aab0192
 
 
 
460a77d
 
aab0192
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
title: Scientific Hypothesis Lab
emoji: πŸ”¬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Scientific Hypothesis Lab -- OpenEnv Environment

An RL environment where agents discover hidden causal rules through systematic
experimentation. Built for the [OpenEnv Hub](https://huggingface.co/openenv).

## What it does

Each episode, the agent is presented with a set of **abstract** variables
(e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world.
Variable names are deliberately opaque so agents cannot leverage pretrained
real-world knowledge -- they must reason purely from experimental evidence.

The hidden rules span **8 single-parent function types** (linear, threshold,
inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear),
**multi-parent interaction rules** (additive, multiplicative, min, max), and
optional **hidden confounders** that inject unexplainable correlated noise.

The agent must:

1. **Design experiments** -- probe variable relationships using interventions,
   correlations, counterfactuals, or passive observations
2. **Update beliefs** from noisy experimental results
3. **Submit a hypothesis** -- a structured description of the discovered causal rules

The environment rewards informative experiments, precise hypotheses, calibrated
confidence, and efficient budget use.

## Quick Start

```bash
# Install dependencies
pip install -e .

# Run the server locally
uvicorn server.app:app --port 8000

# In another terminal, run the baseline agent
export OPENAI_API_KEY=sk-...
python baseline_inference.py
```

### Using the Client

```python
from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType

# Async usage
async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
    result = await env.reset(noise_level="low", domain="system_alpha")
    obs = result.observation

    # Run an intervention
    result = await env.run_intervention(
        control_variable=obs.available_variables[0],
        control_value=5.0,
        target_variable=obs.available_variables[1],
    )
    print(result.observation.system_message)

    # Submit hypothesis
    result = await env.submit_hypothesis(
        hypothesis_text="Beta = 2.1 * Alpha + 3.0",
        confidence=0.85,
    )
    print(f"Score: {result.observation.total_episode_reward}")

# Sync usage
env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
with env:
    result = env.reset(noise_level="low")
    ...
```

## File Structure

```
hypothesis_lab/
β”œβ”€β”€ openenv.yaml              # OpenEnv manifest
β”œβ”€β”€ pyproject.toml             # Project metadata and dependencies
β”œβ”€β”€ requirements.txt           # Pip fallback dependencies
β”œβ”€β”€ README.md                  # This file
β”œβ”€β”€ models.py                  # Pydantic Action / Observation / State models
β”œβ”€β”€ client.py                  # Typed EnvClient for agents and trainers
β”œβ”€β”€ __init__.py                # Module exports
β”œβ”€β”€ baseline_inference.py      # Baseline agent using OpenAI API
β”œβ”€β”€ Dockerfile                 # For HF Spaces deployment
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                 # FastAPI server (create_app entry point)
β”‚   β”œβ”€β”€ hypothesis_lab_environment.py  # Core environment logic
β”‚   β”œβ”€β”€ causal_world.py        # Hidden causal graph generator
β”‚   └── rubric.py              # Multi-component reward engine
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ task_easy.py           # Easy: 2 vars, low noise, 12 budget
β”‚   β”œβ”€β”€ task_medium.py         # Medium: 3 vars, medium noise, 10 budget
β”‚   └── task_hard.py           # Hard: 4 vars, high noise, 8 budget
└── tests/
    β”œβ”€β”€ __init__.py
    └── test_environment.py    # Unit + integration tests
```

## Action Space

**HypLabAction** has two modes:

| Field | Type | Description |
|---|---|---|
| `action_type` | `"experiment"` or `"submit"` | What the agent is doing |
| `experiment_type` | `"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"` | Experiment kind (experiment mode) |
| `control_variable` | `str` | Variable to set/vary |
| `control_value` | `float` | Value to set (intervention/counterfactual) |
| `control_range` | `[min, max, n]` | Sweep range (correlation only) |
| `target_variable` | `str` | Variable to observe |
| `hypothesis_text` | `str` | Free-text hypothesis (submit mode) |
| `hypothesis_equations` | `list[str]` | Structured equations (submit mode) |
| `confidence` | `float [0,1]` | Self-reported confidence (submit mode) |

## Observation Space

**HypLabObservation** always contains:
- `system_message`: Human-readable text the LLM reads
- `available_variables`: Variable names in this episode
- `budget_remaining`: Steps left
- `done`: Whether episode ended
- `reward`: Step reward

On experiment steps: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant`

On submit: `accuracy_score`, `precision_bonus`, `calibration_score`, `efficiency_bonus`, `contradiction_penalty`, `total_episode_reward`, `ground_truth_revealed`

## Causal Rule Types

The hidden world can contain any of these relationship types:

| Rule | Formula | Shape |
|---|---|---|
| Linear | `y = a*x + b` | Straight line |
| Threshold | `y = high if x > t else low` | Step function |
| Inverse | `y = a / x` | Hyperbola |
| Quadratic | `y = a*xΒ² + b*x + c` | Parabola |
| Exponential | `y = a * exp(k*x)` | Growth/decay |
| Logarithmic | `y = a * ln(x) + b` | Diminishing returns |
| Saturating | `y = Vmax * x / (Km + x)` | Plateau (Michaelis-Menten) |
| Piecewise-linear | Two slopes with a knot | Regime change |

Additionally, some effects may depend on **two parents** via interaction rules
(additive, multiplicative, min, max), and **hidden confounders** may inject
correlated noise the agent cannot explain.

## Reward Components

| Signal | Value | What it trains |
|---|---|---|
| Information gain | +0.05 to +0.25/step | Designing informative experiments |
| Redundant experiment | -0.10 | Not wasting budget |
| Hypothesis accuracy | 0.0 to +1.0 | Getting the right answer |
| Precision bonus | +0.10 | Quantitative, falsifiable claims |
| Calibration score | 0.0 to +0.20 | Knowing what you don't know |
| Efficiency bonus | +0.15 | Submitting early when confident |
| Contradiction penalty | -0.50 | Contradicting the experimental setup |

## Tasks (3 difficulty levels)

| Task | Noise | Variables | Budget | Domain | Key Challenge |
|---|---|---|---|---|---|
| Easy | 0.05 | 2 | 12 | system_alpha | Single-edge discovery |
| Medium | 0.20 | 3 | 10 | Random | Multi-edge, noisy signals |
| Hard | 0.50 | 4 | 8 | Random | Complex graph + interactions, tight budget |

Each task has a deterministic grader that returns a score in [0.0, 1.0].

## Design Decisions

**Abstract variable names:** Variables are named Alpha, Beta, Gamma (or V1, V2,
V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents
from using pretrained knowledge of real-world physics/economics/biology to
shortcut the reasoning process. The agent must reason purely from experimental
data.

**Diverse rule types:** With 8 single-parent types plus interaction rules, the
agent cannot memorize a small set of templates. Many rule types look similar in
narrow ranges (e.g. exponential β‰ˆ linear for small x), forcing the agent to
design discriminating experiments.

## Deploy to HF Spaces

```bash
openenv push --org your-org --token $HF_TOKEN
```

## Run Tests

```bash
pytest tests/ -v
```

## Baseline Scores

Baseline agent (gpt-4o-mini, temperature=0.3):

| Task | Score |
|---|---|
| Easy | ~0.65 |
| Medium | ~0.40 |
| Hard | ~0.25 |
| Average | ~0.43 |

These scores are reproducible via `python baseline_inference.py` with the same model and seed.