Spaces:
Sleeping
Sleeping
File size: 7,888 Bytes
460a77d aab0192 460a77d aab0192 460a77d aab0192 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | ---
title: Scientific Hypothesis Lab
emoji: π¬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Scientific Hypothesis Lab -- OpenEnv Environment
An RL environment where agents discover hidden causal rules through systematic
experimentation. Built for the [OpenEnv Hub](https://huggingface.co/openenv).
## What it does
Each episode, the agent is presented with a set of **abstract** variables
(e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world.
Variable names are deliberately opaque so agents cannot leverage pretrained
real-world knowledge -- they must reason purely from experimental evidence.
The hidden rules span **8 single-parent function types** (linear, threshold,
inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear),
**multi-parent interaction rules** (additive, multiplicative, min, max), and
optional **hidden confounders** that inject unexplainable correlated noise.
The agent must:
1. **Design experiments** -- probe variable relationships using interventions,
correlations, counterfactuals, or passive observations
2. **Update beliefs** from noisy experimental results
3. **Submit a hypothesis** -- a structured description of the discovered causal rules
The environment rewards informative experiments, precise hypotheses, calibrated
confidence, and efficient budget use.
## Quick Start
```bash
# Install dependencies
pip install -e .
# Run the server locally
uvicorn server.app:app --port 8000
# In another terminal, run the baseline agent
export OPENAI_API_KEY=sk-...
python baseline_inference.py
```
### Using the Client
```python
from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType
# Async usage
async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
result = await env.reset(noise_level="low", domain="system_alpha")
obs = result.observation
# Run an intervention
result = await env.run_intervention(
control_variable=obs.available_variables[0],
control_value=5.0,
target_variable=obs.available_variables[1],
)
print(result.observation.system_message)
# Submit hypothesis
result = await env.submit_hypothesis(
hypothesis_text="Beta = 2.1 * Alpha + 3.0",
confidence=0.85,
)
print(f"Score: {result.observation.total_episode_reward}")
# Sync usage
env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
with env:
result = env.reset(noise_level="low")
...
```
## File Structure
```
hypothesis_lab/
βββ openenv.yaml # OpenEnv manifest
βββ pyproject.toml # Project metadata and dependencies
βββ requirements.txt # Pip fallback dependencies
βββ README.md # This file
βββ models.py # Pydantic Action / Observation / State models
βββ client.py # Typed EnvClient for agents and trainers
βββ __init__.py # Module exports
βββ baseline_inference.py # Baseline agent using OpenAI API
βββ Dockerfile # For HF Spaces deployment
βββ server/
β βββ __init__.py
β βββ app.py # FastAPI server (create_app entry point)
β βββ hypothesis_lab_environment.py # Core environment logic
β βββ causal_world.py # Hidden causal graph generator
β βββ rubric.py # Multi-component reward engine
βββ tasks/
β βββ __init__.py
β βββ task_easy.py # Easy: 2 vars, low noise, 12 budget
β βββ task_medium.py # Medium: 3 vars, medium noise, 10 budget
β βββ task_hard.py # Hard: 4 vars, high noise, 8 budget
βββ tests/
βββ __init__.py
βββ test_environment.py # Unit + integration tests
```
## Action Space
**HypLabAction** has two modes:
| Field | Type | Description |
|---|---|---|
| `action_type` | `"experiment"` or `"submit"` | What the agent is doing |
| `experiment_type` | `"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"` | Experiment kind (experiment mode) |
| `control_variable` | `str` | Variable to set/vary |
| `control_value` | `float` | Value to set (intervention/counterfactual) |
| `control_range` | `[min, max, n]` | Sweep range (correlation only) |
| `target_variable` | `str` | Variable to observe |
| `hypothesis_text` | `str` | Free-text hypothesis (submit mode) |
| `hypothesis_equations` | `list[str]` | Structured equations (submit mode) |
| `confidence` | `float [0,1]` | Self-reported confidence (submit mode) |
## Observation Space
**HypLabObservation** always contains:
- `system_message`: Human-readable text the LLM reads
- `available_variables`: Variable names in this episode
- `budget_remaining`: Steps left
- `done`: Whether episode ended
- `reward`: Step reward
On experiment steps: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant`
On submit: `accuracy_score`, `precision_bonus`, `calibration_score`, `efficiency_bonus`, `contradiction_penalty`, `total_episode_reward`, `ground_truth_revealed`
## Causal Rule Types
The hidden world can contain any of these relationship types:
| Rule | Formula | Shape |
|---|---|---|
| Linear | `y = a*x + b` | Straight line |
| Threshold | `y = high if x > t else low` | Step function |
| Inverse | `y = a / x` | Hyperbola |
| Quadratic | `y = a*xΒ² + b*x + c` | Parabola |
| Exponential | `y = a * exp(k*x)` | Growth/decay |
| Logarithmic | `y = a * ln(x) + b` | Diminishing returns |
| Saturating | `y = Vmax * x / (Km + x)` | Plateau (Michaelis-Menten) |
| Piecewise-linear | Two slopes with a knot | Regime change |
Additionally, some effects may depend on **two parents** via interaction rules
(additive, multiplicative, min, max), and **hidden confounders** may inject
correlated noise the agent cannot explain.
## Reward Components
| Signal | Value | What it trains |
|---|---|---|
| Information gain | +0.05 to +0.25/step | Designing informative experiments |
| Redundant experiment | -0.10 | Not wasting budget |
| Hypothesis accuracy | 0.0 to +1.0 | Getting the right answer |
| Precision bonus | +0.10 | Quantitative, falsifiable claims |
| Calibration score | 0.0 to +0.20 | Knowing what you don't know |
| Efficiency bonus | +0.15 | Submitting early when confident |
| Contradiction penalty | -0.50 | Contradicting the experimental setup |
## Tasks (3 difficulty levels)
| Task | Noise | Variables | Budget | Domain | Key Challenge |
|---|---|---|---|---|---|
| Easy | 0.05 | 2 | 12 | system_alpha | Single-edge discovery |
| Medium | 0.20 | 3 | 10 | Random | Multi-edge, noisy signals |
| Hard | 0.50 | 4 | 8 | Random | Complex graph + interactions, tight budget |
Each task has a deterministic grader that returns a score in [0.0, 1.0].
## Design Decisions
**Abstract variable names:** Variables are named Alpha, Beta, Gamma (or V1, V2,
V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents
from using pretrained knowledge of real-world physics/economics/biology to
shortcut the reasoning process. The agent must reason purely from experimental
data.
**Diverse rule types:** With 8 single-parent types plus interaction rules, the
agent cannot memorize a small set of templates. Many rule types look similar in
narrow ranges (e.g. exponential β linear for small x), forcing the agent to
design discriminating experiments.
## Deploy to HF Spaces
```bash
openenv push --org your-org --token $HF_TOKEN
```
## Run Tests
```bash
pytest tests/ -v
```
## Baseline Scores
Baseline agent (gpt-4o-mini, temperature=0.3):
| Task | Score |
|---|---|
| Easy | ~0.65 |
| Medium | ~0.40 |
| Hard | ~0.25 |
| Average | ~0.43 |
These scores are reproducible via `python baseline_inference.py` with the same model and seed.
|