Labexperiment / README.md
Sbhimraj's picture
Add application file
aab0192
|
Raw
History Blame Contribute Delete
7.89 kB
---
title: Scientific Hypothesis Lab
emoji: πŸ”¬
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Scientific Hypothesis Lab -- OpenEnv Environment
An RL environment where agents discover hidden causal rules through systematic
experimentation. Built for the [OpenEnv Hub](https://huggingface.co/openenv).
## What it does
Each episode, the agent is presented with a set of **abstract** variables
(e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world.
Variable names are deliberately opaque so agents cannot leverage pretrained
real-world knowledge -- they must reason purely from experimental evidence.
The hidden rules span **8 single-parent function types** (linear, threshold,
inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear),
**multi-parent interaction rules** (additive, multiplicative, min, max), and
optional **hidden confounders** that inject unexplainable correlated noise.
The agent must:
1. **Design experiments** -- probe variable relationships using interventions,
correlations, counterfactuals, or passive observations
2. **Update beliefs** from noisy experimental results
3. **Submit a hypothesis** -- a structured description of the discovered causal rules
The environment rewards informative experiments, precise hypotheses, calibrated
confidence, and efficient budget use.
## Quick Start
```bash
# Install dependencies
pip install -e .
# Run the server locally
uvicorn server.app:app --port 8000
# In another terminal, run the baseline agent
export OPENAI_API_KEY=sk-...
python baseline_inference.py
```
### Using the Client
```python
from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType
# Async usage
async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
result = await env.reset(noise_level="low", domain="system_alpha")
obs = result.observation
# Run an intervention
result = await env.run_intervention(
control_variable=obs.available_variables[0],
control_value=5.0,
target_variable=obs.available_variables[1],
)
print(result.observation.system_message)
# Submit hypothesis
result = await env.submit_hypothesis(
hypothesis_text="Beta = 2.1 * Alpha + 3.0",
confidence=0.85,
)
print(f"Score: {result.observation.total_episode_reward}")
# Sync usage
env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
with env:
result = env.reset(noise_level="low")
...
```
## File Structure
```
hypothesis_lab/
β”œβ”€β”€ openenv.yaml # OpenEnv manifest
β”œβ”€β”€ pyproject.toml # Project metadata and dependencies
β”œβ”€β”€ requirements.txt # Pip fallback dependencies
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ models.py # Pydantic Action / Observation / State models
β”œβ”€β”€ client.py # Typed EnvClient for agents and trainers
β”œβ”€β”€ __init__.py # Module exports
β”œβ”€β”€ baseline_inference.py # Baseline agent using OpenAI API
β”œβ”€β”€ Dockerfile # For HF Spaces deployment
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ app.py # FastAPI server (create_app entry point)
β”‚ β”œβ”€β”€ hypothesis_lab_environment.py # Core environment logic
β”‚ β”œβ”€β”€ causal_world.py # Hidden causal graph generator
β”‚ └── rubric.py # Multi-component reward engine
β”œβ”€β”€ tasks/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ task_easy.py # Easy: 2 vars, low noise, 12 budget
β”‚ β”œβ”€β”€ task_medium.py # Medium: 3 vars, medium noise, 10 budget
β”‚ └── task_hard.py # Hard: 4 vars, high noise, 8 budget
└── tests/
β”œβ”€β”€ __init__.py
└── test_environment.py # Unit + integration tests
```
## Action Space
**HypLabAction** has two modes:
| Field | Type | Description |
|---|---|---|
| `action_type` | `"experiment"` or `"submit"` | What the agent is doing |
| `experiment_type` | `"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"` | Experiment kind (experiment mode) |
| `control_variable` | `str` | Variable to set/vary |
| `control_value` | `float` | Value to set (intervention/counterfactual) |
| `control_range` | `[min, max, n]` | Sweep range (correlation only) |
| `target_variable` | `str` | Variable to observe |
| `hypothesis_text` | `str` | Free-text hypothesis (submit mode) |
| `hypothesis_equations` | `list[str]` | Structured equations (submit mode) |
| `confidence` | `float [0,1]` | Self-reported confidence (submit mode) |
## Observation Space
**HypLabObservation** always contains:
- `system_message`: Human-readable text the LLM reads
- `available_variables`: Variable names in this episode
- `budget_remaining`: Steps left
- `done`: Whether episode ended
- `reward`: Step reward
On experiment steps: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant`
On submit: `accuracy_score`, `precision_bonus`, `calibration_score`, `efficiency_bonus`, `contradiction_penalty`, `total_episode_reward`, `ground_truth_revealed`
## Causal Rule Types
The hidden world can contain any of these relationship types:
| Rule | Formula | Shape |
|---|---|---|
| Linear | `y = a*x + b` | Straight line |
| Threshold | `y = high if x > t else low` | Step function |
| Inverse | `y = a / x` | Hyperbola |
| Quadratic | `y = a*xΒ² + b*x + c` | Parabola |
| Exponential | `y = a * exp(k*x)` | Growth/decay |
| Logarithmic | `y = a * ln(x) + b` | Diminishing returns |
| Saturating | `y = Vmax * x / (Km + x)` | Plateau (Michaelis-Menten) |
| Piecewise-linear | Two slopes with a knot | Regime change |
Additionally, some effects may depend on **two parents** via interaction rules
(additive, multiplicative, min, max), and **hidden confounders** may inject
correlated noise the agent cannot explain.
## Reward Components
| Signal | Value | What it trains |
|---|---|---|
| Information gain | +0.05 to +0.25/step | Designing informative experiments |
| Redundant experiment | -0.10 | Not wasting budget |
| Hypothesis accuracy | 0.0 to +1.0 | Getting the right answer |
| Precision bonus | +0.10 | Quantitative, falsifiable claims |
| Calibration score | 0.0 to +0.20 | Knowing what you don't know |
| Efficiency bonus | +0.15 | Submitting early when confident |
| Contradiction penalty | -0.50 | Contradicting the experimental setup |
## Tasks (3 difficulty levels)
| Task | Noise | Variables | Budget | Domain | Key Challenge |
|---|---|---|---|---|---|
| Easy | 0.05 | 2 | 12 | system_alpha | Single-edge discovery |
| Medium | 0.20 | 3 | 10 | Random | Multi-edge, noisy signals |
| Hard | 0.50 | 4 | 8 | Random | Complex graph + interactions, tight budget |
Each task has a deterministic grader that returns a score in [0.0, 1.0].
## Design Decisions
**Abstract variable names:** Variables are named Alpha, Beta, Gamma (or V1, V2,
V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents
from using pretrained knowledge of real-world physics/economics/biology to
shortcut the reasoning process. The agent must reason purely from experimental
data.
**Diverse rule types:** With 8 single-parent types plus interaction rules, the
agent cannot memorize a small set of templates. Many rule types look similar in
narrow ranges (e.g. exponential β‰ˆ linear for small x), forcing the agent to
design discriminating experiments.
## Deploy to HF Spaces
```bash
openenv push --org your-org --token $HF_TOKEN
```
## Run Tests
```bash
pytest tests/ -v
```
## Baseline Scores
Baseline agent (gpt-4o-mini, temperature=0.3):
| Task | Score |
|---|---|
| Easy | ~0.65 |
| Medium | ~0.40 |
| Hard | ~0.25 |
| Average | ~0.43 |
These scores are reproducible via `python baseline_inference.py` with the same model and seed.