Spaces:
Sleeping
Sleeping
| title: Scientific Hypothesis Lab | |
| emoji: π¬ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # Scientific Hypothesis Lab -- OpenEnv Environment | |
| An RL environment where agents discover hidden causal rules through systematic | |
| experimentation. Built for the [OpenEnv Hub](https://huggingface.co/openenv). | |
| ## What it does | |
| Each episode, the agent is presented with a set of **abstract** variables | |
| (e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world. | |
| Variable names are deliberately opaque so agents cannot leverage pretrained | |
| real-world knowledge -- they must reason purely from experimental evidence. | |
| The hidden rules span **8 single-parent function types** (linear, threshold, | |
| inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear), | |
| **multi-parent interaction rules** (additive, multiplicative, min, max), and | |
| optional **hidden confounders** that inject unexplainable correlated noise. | |
| The agent must: | |
| 1. **Design experiments** -- probe variable relationships using interventions, | |
| correlations, counterfactuals, or passive observations | |
| 2. **Update beliefs** from noisy experimental results | |
| 3. **Submit a hypothesis** -- a structured description of the discovered causal rules | |
| The environment rewards informative experiments, precise hypotheses, calibrated | |
| confidence, and efficient budget use. | |
| ## Quick Start | |
| ```bash | |
| # Install dependencies | |
| pip install -e . | |
| # Run the server locally | |
| uvicorn server.app:app --port 8000 | |
| # In another terminal, run the baseline agent | |
| export OPENAI_API_KEY=sk-... | |
| python baseline_inference.py | |
| ``` | |
| ### Using the Client | |
| ```python | |
| from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType | |
| # Async usage | |
| async with HypothesisLabEnv(base_url="http://localhost:8000") as env: | |
| result = await env.reset(noise_level="low", domain="system_alpha") | |
| obs = result.observation | |
| # Run an intervention | |
| result = await env.run_intervention( | |
| control_variable=obs.available_variables[0], | |
| control_value=5.0, | |
| target_variable=obs.available_variables[1], | |
| ) | |
| print(result.observation.system_message) | |
| # Submit hypothesis | |
| result = await env.submit_hypothesis( | |
| hypothesis_text="Beta = 2.1 * Alpha + 3.0", | |
| confidence=0.85, | |
| ) | |
| print(f"Score: {result.observation.total_episode_reward}") | |
| # Sync usage | |
| env = HypothesisLabEnv(base_url="http://localhost:8000").sync() | |
| with env: | |
| result = env.reset(noise_level="low") | |
| ... | |
| ``` | |
| ## File Structure | |
| ``` | |
| hypothesis_lab/ | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ pyproject.toml # Project metadata and dependencies | |
| βββ requirements.txt # Pip fallback dependencies | |
| βββ README.md # This file | |
| βββ models.py # Pydantic Action / Observation / State models | |
| βββ client.py # Typed EnvClient for agents and trainers | |
| βββ __init__.py # Module exports | |
| βββ baseline_inference.py # Baseline agent using OpenAI API | |
| βββ Dockerfile # For HF Spaces deployment | |
| βββ server/ | |
| β βββ __init__.py | |
| β βββ app.py # FastAPI server (create_app entry point) | |
| β βββ hypothesis_lab_environment.py # Core environment logic | |
| β βββ causal_world.py # Hidden causal graph generator | |
| β βββ rubric.py # Multi-component reward engine | |
| βββ tasks/ | |
| β βββ __init__.py | |
| β βββ task_easy.py # Easy: 2 vars, low noise, 12 budget | |
| β βββ task_medium.py # Medium: 3 vars, medium noise, 10 budget | |
| β βββ task_hard.py # Hard: 4 vars, high noise, 8 budget | |
| βββ tests/ | |
| βββ __init__.py | |
| βββ test_environment.py # Unit + integration tests | |
| ``` | |
| ## Action Space | |
| **HypLabAction** has two modes: | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `action_type` | `"experiment"` or `"submit"` | What the agent is doing | | |
| | `experiment_type` | `"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"` | Experiment kind (experiment mode) | | |
| | `control_variable` | `str` | Variable to set/vary | | |
| | `control_value` | `float` | Value to set (intervention/counterfactual) | | |
| | `control_range` | `[min, max, n]` | Sweep range (correlation only) | | |
| | `target_variable` | `str` | Variable to observe | | |
| | `hypothesis_text` | `str` | Free-text hypothesis (submit mode) | | |
| | `hypothesis_equations` | `list[str]` | Structured equations (submit mode) | | |
| | `confidence` | `float [0,1]` | Self-reported confidence (submit mode) | | |
| ## Observation Space | |
| **HypLabObservation** always contains: | |
| - `system_message`: Human-readable text the LLM reads | |
| - `available_variables`: Variable names in this episode | |
| - `budget_remaining`: Steps left | |
| - `done`: Whether episode ended | |
| - `reward`: Step reward | |
| On experiment steps: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant` | |
| On submit: `accuracy_score`, `precision_bonus`, `calibration_score`, `efficiency_bonus`, `contradiction_penalty`, `total_episode_reward`, `ground_truth_revealed` | |
| ## Causal Rule Types | |
| The hidden world can contain any of these relationship types: | |
| | Rule | Formula | Shape | | |
| |---|---|---| | |
| | Linear | `y = a*x + b` | Straight line | | |
| | Threshold | `y = high if x > t else low` | Step function | | |
| | Inverse | `y = a / x` | Hyperbola | | |
| | Quadratic | `y = a*xΒ² + b*x + c` | Parabola | | |
| | Exponential | `y = a * exp(k*x)` | Growth/decay | | |
| | Logarithmic | `y = a * ln(x) + b` | Diminishing returns | | |
| | Saturating | `y = Vmax * x / (Km + x)` | Plateau (Michaelis-Menten) | | |
| | Piecewise-linear | Two slopes with a knot | Regime change | | |
| Additionally, some effects may depend on **two parents** via interaction rules | |
| (additive, multiplicative, min, max), and **hidden confounders** may inject | |
| correlated noise the agent cannot explain. | |
| ## Reward Components | |
| | Signal | Value | What it trains | | |
| |---|---|---| | |
| | Information gain | +0.05 to +0.25/step | Designing informative experiments | | |
| | Redundant experiment | -0.10 | Not wasting budget | | |
| | Hypothesis accuracy | 0.0 to +1.0 | Getting the right answer | | |
| | Precision bonus | +0.10 | Quantitative, falsifiable claims | | |
| | Calibration score | 0.0 to +0.20 | Knowing what you don't know | | |
| | Efficiency bonus | +0.15 | Submitting early when confident | | |
| | Contradiction penalty | -0.50 | Contradicting the experimental setup | | |
| ## Tasks (3 difficulty levels) | |
| | Task | Noise | Variables | Budget | Domain | Key Challenge | | |
| |---|---|---|---|---|---| | |
| | Easy | 0.05 | 2 | 12 | system_alpha | Single-edge discovery | | |
| | Medium | 0.20 | 3 | 10 | Random | Multi-edge, noisy signals | | |
| | Hard | 0.50 | 4 | 8 | Random | Complex graph + interactions, tight budget | | |
| Each task has a deterministic grader that returns a score in [0.0, 1.0]. | |
| ## Design Decisions | |
| **Abstract variable names:** Variables are named Alpha, Beta, Gamma (or V1, V2, | |
| V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents | |
| from using pretrained knowledge of real-world physics/economics/biology to | |
| shortcut the reasoning process. The agent must reason purely from experimental | |
| data. | |
| **Diverse rule types:** With 8 single-parent types plus interaction rules, the | |
| agent cannot memorize a small set of templates. Many rule types look similar in | |
| narrow ranges (e.g. exponential β linear for small x), forcing the agent to | |
| design discriminating experiments. | |
| ## Deploy to HF Spaces | |
| ```bash | |
| openenv push --org your-org --token $HF_TOKEN | |
| ``` | |
| ## Run Tests | |
| ```bash | |
| pytest tests/ -v | |
| ``` | |
| ## Baseline Scores | |
| Baseline agent (gpt-4o-mini, temperature=0.3): | |
| | Task | Score | | |
| |---|---| | |
| | Easy | ~0.65 | | |
| | Medium | ~0.40 | | |
| | Hard | ~0.25 | | |
| | Average | ~0.43 | | |
| These scores are reproducible via `python baseline_inference.py` with the same model and seed. | |