--- title: Scientific Hypothesis Lab emoji: 🔬 colorFrom: blue colorTo: green sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # Scientific Hypothesis Lab -- OpenEnv Environment An RL environment where agents discover hidden causal rules through systematic experimentation. Built for the [OpenEnv Hub](https://huggingface.co/openenv). ## What it does Each episode, the agent is presented with a set of **abstract** variables (e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world. Variable names are deliberately opaque so agents cannot leverage pretrained real-world knowledge -- they must reason purely from experimental evidence. The hidden rules span **8 single-parent function types** (linear, threshold, inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear), **multi-parent interaction rules** (additive, multiplicative, min, max), and optional **hidden confounders** that inject unexplainable correlated noise. The agent must: 1. **Design experiments** -- probe variable relationships using interventions, correlations, counterfactuals, or passive observations 2. **Update beliefs** from noisy experimental results 3. **Submit a hypothesis** -- a structured description of the discovered causal rules The environment rewards informative experiments, precise hypotheses, calibrated confidence, and efficient budget use. ## Quick Start ```bash # Install dependencies pip install -e . # Run the server locally uvicorn server.app:app --port 8000 # In another terminal, run the baseline agent export OPENAI_API_KEY=sk-... python baseline_inference.py ``` ### Using the Client ```python from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType # Async usage async with HypothesisLabEnv(base_url="http://localhost:8000") as env: result = await env.reset(noise_level="low", domain="system_alpha") obs = result.observation # Run an intervention result = await env.run_intervention( control_variable=obs.available_variables[0], control_value=5.0, target_variable=obs.available_variables[1], ) print(result.observation.system_message) # Submit hypothesis result = await env.submit_hypothesis( hypothesis_text="Beta = 2.1 * Alpha + 3.0", confidence=0.85, ) print(f"Score: {result.observation.total_episode_reward}") # Sync usage env = HypothesisLabEnv(base_url="http://localhost:8000").sync() with env: result = env.reset(noise_level="low") ... ``` ## File Structure ``` hypothesis_lab/ ├── openenv.yaml # OpenEnv manifest ├── pyproject.toml # Project metadata and dependencies ├── requirements.txt # Pip fallback dependencies ├── README.md # This file ├── models.py # Pydantic Action / Observation / State models ├── client.py # Typed EnvClient for agents and trainers ├── __init__.py # Module exports ├── baseline_inference.py # Baseline agent using OpenAI API ├── Dockerfile # For HF Spaces deployment ├── server/ │ ├── __init__.py │ ├── app.py # FastAPI server (create_app entry point) │ ├── hypothesis_lab_environment.py # Core environment logic │ ├── causal_world.py # Hidden causal graph generator │ └── rubric.py # Multi-component reward engine ├── tasks/ │ ├── __init__.py │ ├── task_easy.py # Easy: 2 vars, low noise, 12 budget │ ├── task_medium.py # Medium: 3 vars, medium noise, 10 budget │ └── task_hard.py # Hard: 4 vars, high noise, 8 budget └── tests/ ├── __init__.py └── test_environment.py # Unit + integration tests ``` ## Action Space **HypLabAction** has two modes: | Field | Type | Description | |---|---|---| | `action_type` | `"experiment"` or `"submit"` | What the agent is doing | | `experiment_type` | `"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"` | Experiment kind (experiment mode) | | `control_variable` | `str` | Variable to set/vary | | `control_value` | `float` | Value to set (intervention/counterfactual) | | `control_range` | `[min, max, n]` | Sweep range (correlation only) | | `target_variable` | `str` | Variable to observe | | `hypothesis_text` | `str` | Free-text hypothesis (submit mode) | | `hypothesis_equations` | `list[str]` | Structured equations (submit mode) | | `confidence` | `float [0,1]` | Self-reported confidence (submit mode) | ## Observation Space **HypLabObservation** always contains: - `system_message`: Human-readable text the LLM reads - `available_variables`: Variable names in this episode - `budget_remaining`: Steps left - `done`: Whether episode ended - `reward`: Step reward On experiment steps: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant` On submit: `accuracy_score`, `precision_bonus`, `calibration_score`, `efficiency_bonus`, `contradiction_penalty`, `total_episode_reward`, `ground_truth_revealed` ## Causal Rule Types The hidden world can contain any of these relationship types: | Rule | Formula | Shape | |---|---|---| | Linear | `y = a*x + b` | Straight line | | Threshold | `y = high if x > t else low` | Step function | | Inverse | `y = a / x` | Hyperbola | | Quadratic | `y = a*x² + b*x + c` | Parabola | | Exponential | `y = a * exp(k*x)` | Growth/decay | | Logarithmic | `y = a * ln(x) + b` | Diminishing returns | | Saturating | `y = Vmax * x / (Km + x)` | Plateau (Michaelis-Menten) | | Piecewise-linear | Two slopes with a knot | Regime change | Additionally, some effects may depend on **two parents** via interaction rules (additive, multiplicative, min, max), and **hidden confounders** may inject correlated noise the agent cannot explain. ## Reward Components | Signal | Value | What it trains | |---|---|---| | Information gain | +0.05 to +0.25/step | Designing informative experiments | | Redundant experiment | -0.10 | Not wasting budget | | Hypothesis accuracy | 0.0 to +1.0 | Getting the right answer | | Precision bonus | +0.10 | Quantitative, falsifiable claims | | Calibration score | 0.0 to +0.20 | Knowing what you don't know | | Efficiency bonus | +0.15 | Submitting early when confident | | Contradiction penalty | -0.50 | Contradicting the experimental setup | ## Tasks (3 difficulty levels) | Task | Noise | Variables | Budget | Domain | Key Challenge | |---|---|---|---|---|---| | Easy | 0.05 | 2 | 12 | system_alpha | Single-edge discovery | | Medium | 0.20 | 3 | 10 | Random | Multi-edge, noisy signals | | Hard | 0.50 | 4 | 8 | Random | Complex graph + interactions, tight budget | Each task has a deterministic grader that returns a score in [0.0, 1.0]. ## Design Decisions **Abstract variable names:** Variables are named Alpha, Beta, Gamma (or V1, V2, V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents from using pretrained knowledge of real-world physics/economics/biology to shortcut the reasoning process. The agent must reason purely from experimental data. **Diverse rule types:** With 8 single-parent types plus interaction rules, the agent cannot memorize a small set of templates. Many rule types look similar in narrow ranges (e.g. exponential ≈ linear for small x), forcing the agent to design discriminating experiments. ## Deploy to HF Spaces ```bash openenv push --org your-org --token $HF_TOKEN ``` ## Run Tests ```bash pytest tests/ -v ``` ## Baseline Scores Baseline agent (gpt-4o-mini, temperature=0.3): | Task | Score | |---|---| | Easy | ~0.65 | | Medium | ~0.40 | | Hard | ~0.25 | | Average | ~0.43 | These scores are reproducible via `python baseline_inference.py` with the same model and seed.