Spaces:

Sumukh66
/

Labexperiment

Sleeping

App Files Files Community

Labexperiment / README.md

Sbhimraj

Add application file

aab0192 3 months ago

preview code

Raw

History Blame Contribute Delete

7.89 kB

	---
	title: Scientific Hypothesis Lab
	emoji: 🔬
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	---

	# Scientific Hypothesis Lab -- OpenEnv Environment

	An RL environment where agents discover hidden causal rules through systematic
	experimentation. Built for the [OpenEnv Hub](https://huggingface.co/openenv).

	## What it does

	Each episode, the agent is presented with a set of abstract variables
	(e.g. Alpha, Beta, Gamma or V1, V2, V3) from a randomised causal world.
	Variable names are deliberately opaque so agents cannot leverage pretrained
	real-world knowledge -- they must reason purely from experimental evidence.

	The hidden rules span 8 single-parent function types (linear, threshold,
	inverse, quadratic, exponential, logarithmic, saturating, piecewise-linear),
	multi-parent interaction rules (additive, multiplicative, min, max), and
	optional hidden confounders that inject unexplainable correlated noise.

	The agent must:

	1. Design experiments -- probe variable relationships using interventions,
	correlations, counterfactuals, or passive observations
	2. Update beliefs from noisy experimental results
	3. Submit a hypothesis -- a structured description of the discovered causal rules

	The environment rewards informative experiments, precise hypotheses, calibrated
	confidence, and efficient budget use.

	## Quick Start

	```bash
	# Install dependencies
	pip install -e .

	# Run the server locally
	uvicorn server.app:app --port 8000

	# In another terminal, run the baseline agent
	export OPENAI_API_KEY=sk-...
	python baseline_inference.py
	```

	### Using the Client

	```python
	from hypothesis_lab import HypothesisLabEnv, HypLabAction, ActionType

	# Async usage
	async with HypothesisLabEnv(base_url="http://localhost:8000") as env:
	result = await env.reset(noise_level="low", domain="system_alpha")
	obs = result.observation

	# Run an intervention
	result = await env.run_intervention(
	control_variable=obs.available_variables[0],
	control_value=5.0,
	target_variable=obs.available_variables[1],
	)
	print(result.observation.system_message)

	# Submit hypothesis
	result = await env.submit_hypothesis(
	hypothesis_text="Beta = 2.1 * Alpha + 3.0",
	confidence=0.85,
	)
	print(f"Score: {result.observation.total_episode_reward}")

	# Sync usage
	env = HypothesisLabEnv(base_url="http://localhost:8000").sync()
	with env:
	result = env.reset(noise_level="low")
	...
	```

	## File Structure

	```
	hypothesis_lab/
	├── openenv.yaml # OpenEnv manifest
	├── pyproject.toml # Project metadata and dependencies
	├── requirements.txt # Pip fallback dependencies
	├── README.md # This file
	├── models.py # Pydantic Action / Observation / State models
	├── client.py # Typed EnvClient for agents and trainers
	├── __init__.py # Module exports
	├── baseline_inference.py # Baseline agent using OpenAI API
	├── Dockerfile # For HF Spaces deployment
	├── server/
	│ ├── __init__.py
	│ ├── app.py # FastAPI server (create_app entry point)
	│ ├── hypothesis_lab_environment.py # Core environment logic
	│ ├── causal_world.py # Hidden causal graph generator
	│ └── rubric.py # Multi-component reward engine
	├── tasks/
	│ ├── __init__.py
	│ ├── task_easy.py # Easy: 2 vars, low noise, 12 budget
	│ ├── task_medium.py # Medium: 3 vars, medium noise, 10 budget
	│ └── task_hard.py # Hard: 4 vars, high noise, 8 budget
	└── tests/
	├── __init__.py
	└── test_environment.py # Unit + integration tests
	```

	## Action Space

	HypLabAction has two modes:

	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `action_type` \| `"experiment"` or `"submit"` \| What the agent is doing \|
	\| `experiment_type` \| `"intervention"`, `"correlation"`, `"counterfactual"`, `"passive"` \| Experiment kind (experiment mode) \|
	\| `control_variable` \| `str` \| Variable to set/vary \|
	\| `control_value` \| `float` \| Value to set (intervention/counterfactual) \|
	\| `control_range` \| `[min, max, n]` \| Sweep range (correlation only) \|
	\| `target_variable` \| `str` \| Variable to observe \|
	\| `hypothesis_text` \| `str` \| Free-text hypothesis (submit mode) \|
	\| `hypothesis_equations` \| `list[str]` \| Structured equations (submit mode) \|
	\| `confidence` \| `float [0,1]` \| Self-reported confidence (submit mode) \|

	## Observation Space

	HypLabObservation always contains:
	- `system_message`: Human-readable text the LLM reads
	- `available_variables`: Variable names in this episode
	- `budget_remaining`: Steps left
	- `done`: Whether episode ended
	- `reward`: Step reward

	On experiment steps: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant`

	On submit: `accuracy_score`, `precision_bonus`, `calibration_score`, `efficiency_bonus`, `contradiction_penalty`, `total_episode_reward`, `ground_truth_revealed`

	## Causal Rule Types

	The hidden world can contain any of these relationship types:

	\| Rule \| Formula \| Shape \|
	\|---\|---\|---\|
	\| Linear \| `y = a*x + b` \| Straight line \|
	\| Threshold \| `y = high if x > t else low` \| Step function \|
	\| Inverse \| `y = a / x` \| Hyperbola \|
	\| Quadratic \| `y = ax² + bx + c` \| Parabola \|
	\| Exponential \| `y = a * exp(k*x)` \| Growth/decay \|
	\| Logarithmic \| `y = a * ln(x) + b` \| Diminishing returns \|
	\| Saturating \| `y = Vmax * x / (Km + x)` \| Plateau (Michaelis-Menten) \|
	\| Piecewise-linear \| Two slopes with a knot \| Regime change \|

	Additionally, some effects may depend on two parents via interaction rules
	(additive, multiplicative, min, max), and hidden confounders may inject
	correlated noise the agent cannot explain.

	## Reward Components

	\| Signal \| Value \| What it trains \|
	\|---\|---\|---\|
	\| Information gain \| +0.05 to +0.25/step \| Designing informative experiments \|
	\| Redundant experiment \| -0.10 \| Not wasting budget \|
	\| Hypothesis accuracy \| 0.0 to +1.0 \| Getting the right answer \|
	\| Precision bonus \| +0.10 \| Quantitative, falsifiable claims \|
	\| Calibration score \| 0.0 to +0.20 \| Knowing what you don't know \|
	\| Efficiency bonus \| +0.15 \| Submitting early when confident \|
	\| Contradiction penalty \| -0.50 \| Contradicting the experimental setup \|

	## Tasks (3 difficulty levels)

	\| Task \| Noise \| Variables \| Budget \| Domain \| Key Challenge \|
	\|---\|---\|---\|---\|---\|---\|
	\| Easy \| 0.05 \| 2 \| 12 \| system_alpha \| Single-edge discovery \|
	\| Medium \| 0.20 \| 3 \| 10 \| Random \| Multi-edge, noisy signals \|
	\| Hard \| 0.50 \| 4 \| 8 \| Random \| Complex graph + interactions, tight budget \|

	Each task has a deterministic grader that returns a score in [0.0, 1.0].

	## Design Decisions

	Abstract variable names: Variables are named Alpha, Beta, Gamma (or V1, V2,
	V3, etc.) rather than Temperature, Pressure, Volume. This prevents LLM agents
	from using pretrained knowledge of real-world physics/economics/biology to
	shortcut the reasoning process. The agent must reason purely from experimental
	data.

	Diverse rule types: With 8 single-parent types plus interaction rules, the
	agent cannot memorize a small set of templates. Many rule types look similar in
	narrow ranges (e.g. exponential ≈ linear for small x), forcing the agent to
	design discriminating experiments.

	## Deploy to HF Spaces

	```bash
	openenv push --org your-org --token $HF_TOKEN
	```

	## Run Tests

	```bash
	pytest tests/ -v
	```

	## Baseline Scores

	Baseline agent (gpt-4o-mini, temperature=0.3):

	\| Task \| Score \|
	\|---\|---\|
	\| Easy \| ~0.65 \|
	\| Medium \| ~0.40 \|
	\| Hard \| ~0.25 \|
	\| Average \| ~0.43 \|

	These scores are reproducible via `python baseline_inference.py` with the same model and seed.