bio-experiment

Running

App Files Files Community

Ev3Dev commited on Mar 8

Commit

5c3cfae

verified ·

1 Parent(s): 4db0438

Upload folder using huggingface_hub

Browse files

Files changed (47) hide show

README.md +277 -71
__init__.py +27 -0
_dashboard_state.json +207 -0
dashboard.html +522 -0
dashboard.py +129 -0
demo.html +1639 -0
eval_compare.py +174 -0
models.py +1408 -19
my_env/README.md +255 -0
my_env/__init__.py +16 -0
my_env/client.py +99 -0
my_env/models.py +28 -0
my_env/openenv.yaml +7 -0
my_env/pyproject.toml +45 -0
my_env/server/Dockerfile +80 -0
my_env/server/__init__.py +11 -0
my_env/server/app.py +81 -0
my_env/server/my_env_environment.py +101 -0
my_env/server/requirements.txt +6 -0
pyproject.toml +10 -4
run_agent.py +978 -292
server/app.py +21 -7
server/biology/__init__.py +0 -0
server/biology/gene_index.py +225 -0
server/hackathon_environment.py +35 -16
server/requirements.txt +7 -4
server/rewards/reward.py +207 -20
server/rules/engine.py +130 -10
server/simulator/latent_state.py +10 -0
server/simulator/noise.py +5 -1
server/simulator/output_generator.py +213 -27
server/simulator/transition.py +58 -9
server/tasks/bio_palette.py +692 -0
server/tasks/generator.py +15 -2
server/tasks/procedural_generator.py +501 -0
server/tasks/scenarios.py +4 -2
tests/test_environment.py +11 -0
tests/test_rewards.py +62 -0
tests/test_rules.py +30 -0
tests/test_run_agent.py +36 -0
tests/test_training_script.py +123 -0
training/__init__.py +0 -2
training/evaluation.py +3 -2
training/literature_benchmark.py +6 -19
training/rollout_collection.py +219 -0
training_script.py +1250 -0
uv.lock +0 -0

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ The environment is designed as a partially observable Markov decision process (P
 - visible task metadata, resource usage, step history, and intermediate outputs
 - dense step-wise reward plus terminal reward for conclusion quality
-## What "how it works" means here
 At a high level, each episode looks like this:
@@ -68,11 +68,15 @@ This separation is what makes the environment a POMDP rather than a fully observ
 Defines the contracts that all other modules use:
-- `ExperimentAction`: one structured step chosen by the agent
-- `ExperimentObservation`: what the agent can see after each step
-- `TaskSpec`: the problem statement, budget, time limit, assays, tools, and expected findings
-- `IntermediateOutput`: the simulated artifact returned by a step
-- `ConclusionClaim`: structured claims used for final synthesis
 The action vocabulary is intentionally broad enough to mix wet-lab, computational, and meta-planning actions.
@@ -80,27 +84,30 @@ The action vocabulary is intentionally broad enough to mix wet-lab, computationa
 This is where episodes come from.
-- `scenarios.py` defines a small library of curated biological scenarios
-- `generator.py` turns a scenario into a `(TaskSpec, FullLatentState)` pair
-- optional domain randomization perturbs budget, time, noise, batch effects, cell proportions, and effect sizes
-Right now the scenario library includes:
-- `cardiac_disease_de`: disease vs healthy differential expression in heart tissue
-- `hematopoiesis_trajectory`: developmental trajectory inference in bone marrow
-- `perturbation_immune`: treatment response under JAK inhibition
-- `biomarker_validation_lung`: follow-up validation of `SPP1` in IPF
 ### `server/simulator/`
 This is the simulator itself.
-- `latent_state.py` defines hidden biological, technical, progress, and resource state
-- `noise.py` centralizes stochasticity so episodes are reproducible from a seed
-- `output_generator.py` turns an action plus hidden state into a realistic `IntermediateOutput`
-- `transition.py` applies action costs, updates progress flags, propagates artifacts, and decides whether the episode is done
-The output generator does not simply echo the action. It conditions outputs on the hidden state, then injects realistic noise such as dropout, false positives, false negatives, and imperfect clustering.
 ### `server/rules/engine.py`
@@ -109,32 +116,41 @@ The rule engine enforces scientific and procedural constraints before each actio
 - hard violations block the action entirely
 - soft violations allow the action, but reduce output quality and add reward penalties
-Examples:
-- sequencing before library prep is a hard violation
-- running QC twice is a soft redundancy violation
-- making causal claims without enough evidence is a soft validity violation
 ### `server/rewards/reward.py`
 Rewards are decomposed rather than being a single opaque number.
-Per-step reward includes:
-- validity
-- ordering
-- information gain
-- efficiency
-- novelty
-- penalties
-- potential-based shaping
 Terminal reward adds:
-- pipeline completeness
-- calibration of conclusions against hidden truth
-- remaining budget and time efficiency
-- overconfidence penalties for strong but incorrect claims
 This makes the environment easier to debug, benchmark, and train against.
@@ -145,7 +161,7 @@ This is the orchestration layer that ties everything together.
 On `reset()` it:
 - seeds the noise model
-- generates a task and latent state
 - clears history, outputs, discoveries, conclusions, and cumulative reward
 On `step()` it:
@@ -171,12 +187,34 @@ Here is the concrete order of operations for `env.step(action)`:
 5. Otherwise deduct budget and time based on `ACTION_COSTS`.
 6. Update latent progress flags like `samples_collected`, `qc_performed`, or `de_performed`.
 7. Generate a structured simulated output for the chosen action.
-8. If there were soft violations, degrade output quality and attach warnings.
 9. Propagate artifacts back into latent state, such as discovered DE genes or cluster names.
 10. Compute decomposed reward from state transition plus output quality.
 11. If the episode is ending, compute terminal reward from completeness and conclusion calibration.
 12. Return an observation that exposes the visible summary but not the hidden truth.
 ## Typical successful pipeline
 Most scenarios reward a sensible experiment order similar to:
@@ -193,12 +231,47 @@ Most scenarios reward a sensible experiment order similar to:
    `regulatory_network_inference`, `marker_selection`, `validate_marker`
 9. `synthesize_conclusion`
-The exact best sequence depends on the scenario. For example:
 - trajectory scenarios benefit from `trajectory_analysis` and regulatory inference
 - biomarker scenarios benefit from DE, marker selection, and validation
 - perturbation scenarios benefit from pathway-level interpretation
 ## Interfaces you can use
 ### 1. In-process environment
@@ -224,6 +297,10 @@ print(obs.latest_output.summary if obs.latest_output else "No output yet")
 print(obs.reward)
 ```
 ### 2. OpenEnv client/server mode
 Use the FastAPI app when you want to serve the environment over HTTP and WebSocket:
@@ -233,6 +310,16 @@ uv sync --extra dev
 uv run uvicorn server.app:app --reload
 ```
 Then connect with the client:
 ```python
@@ -247,40 +334,133 @@ with BioExperimentEnv(base_url="http://localhost:8000") as env:
 The environment class supports concurrent sessions, but the bundled server is currently configured with `max_concurrent_envs=1` in `server/app.py`.
-### 3. Gymnasium wrapper
-Use `training/gym_wrapper.py` when you want a classic RL interface:
-```python
-from training.gym_wrapper import BioExperimentGymEnv
-env = BioExperimentGymEnv()
-obs, info = env.reset()
-obs, reward, terminated, truncated, info = env.step({
-    "action_type": 0,
-    "confidence": 0.7,
-})
 ```
-This wrapper vectorizes the structured observation into arrays and reduces the action interface to:
-- a discrete action type index
-- a scalar confidence value
-### 4. Benchmark and scripted agents
 - `training/literature_benchmark.py` runs paper-aligned action sequences and compares outcomes against curated expected findings
 - `run_agent.py` runs a local language model planner against the environment
 - `training/trajectory.py` stores trajectories for offline RL, imitation learning, replay, and evaluation
 - `training/evaluation.py` computes online, benchmark, expert-review, and fidelity-oriented metrics
-## Episode termination
-An episode ends when one of the following happens:
-- the agent chooses `synthesize_conclusion`
-- resources are exhausted
-- the environment reaches `MAX_STEPS` which is currently `30`
 ## Why this is useful
@@ -299,31 +479,51 @@ That makes it suitable for:
 - literature-grounded evaluation
 - comparing structured policies against LLM-driven planners
-## Minimal project map
 ```text
 .
-├── client.py                     # OpenEnv client
 ├── models.py                     # Shared action / observation / task schemas
 ├── server/
-│   ├── app.py                    # FastAPI/OpenEnv server
 │   ├── hackathon_environment.py  # Main environment orchestration
-│   ├── rewards/                  # Reward model
-│   ├── rules/                    # Constraint checking
-│   ├── simulator/                # Latent state, noise, outputs, transitions
-│   └── tasks/                    # Scenario library and task generation
 ├── training/
-│   ├── evaluation.py             # Metrics
-│   ├── gym_wrapper.py            # Gymnasium wrapper
 │   ├── literature_benchmark.py   # Paper-backed benchmark flow
-│   └── trajectory.py             # Trajectory serialization
-└── tests/                        # Unit and integration tests
 ```
 ## Quick sanity check
-The current implementation was sanity-checked with:
 ```bash
 uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q
 ```
@@ -335,3 +535,9 @@ Those tests verify:
 - conclusion termination
 - literature-backed scenario selection
 - benchmark matching for curated expected findings

 - visible task metadata, resource usage, step history, and intermediate outputs
 - dense step-wise reward plus terminal reward for conclusion quality
+## How it works
 At a high level, each episode looks like this:
 Defines the contracts that all other modules use:
+- `ActionType`: 21 discrete experiment steps, grouped into three frozensets — `WET_LAB_ACTIONS` (8), `COMPUTATIONAL_ACTIONS` (10), and `META_ACTIONS` (3)
+- `SubagentType`: 9 sub-agent delegate roles (e.g. `wet_lab_planner`, `computational_analyst`, `causal_reasoning_agent`)
+- `ExperimentAction`: one structured step chosen by the agent; fields include `action_type`, `method`, `parameters`, `justification`, `confidence` (clamped to `[0, 1]`), `invoked_subagent`, `tool_call_spec`, `input_targets`
+- `ExperimentObservation`: what the agent can see after each step; includes `task`, `pipeline_history`, `resource_usage`, `latest_output`, `all_outputs`, `discovered_markers`, `candidate_mechanisms`, `conclusions`, `rule_violations`, `step_reward_breakdown`
+- `TaskSpec`: the problem statement, organism, tissue, conditions, budget, time limit, assays, tools, paper references, and expected findings
+- `IntermediateOutput`: the simulated artifact returned by a step; carries `output_type`, `success`, `quality_score`, `summary`, `data`, `uncertainty`, `warnings`, `artifacts_available`
+- `ConclusionClaim`: structured claims used for final synthesis; carries `claim`, `evidence_steps`, `confidence`, `claim_type`, `supporting_data`
+- `PipelineStepRecord`: compact observable record of one past step stored in history
+- `ResourceUsage`: budget and time tracking visible to the agent
 The action vocabulary is intentionally broad enough to mix wet-lab, computational, and meta-planning actions.
 This is where episodes come from.
+- `scenarios.py` defines a curated library of four biological scenarios as `Scenario` dataclass objects, each bundling a `TaskSpec`, a `LatentBiologicalState`, a `TechnicalState`, hidden failure conditions, and tags
+- `generator.py` turns a scenario into a `(TaskSpec, FullLatentState)` pair via `TaskGenerator.generate()`; optional domain randomisation perturbs budget (±30%), time (±20%), technical noise, batch effects, cell proportions, and effect sizes
+The four scenarios are:
+| Name | Difficulty | Tissue | Problem | Budget | Time |
+|---|---|---|---|---|---|
+| `cardiac_disease_de` | easy | heart | Differential expression between healthy and dilated cardiomyopathy cardiomyocytes | $80 K | 120 days |
+| `hematopoiesis_trajectory` | medium | bone marrow | Infer HSC → mature lineage trajectory with three branches | $100 K | 150 days |
+| `perturbation_immune` | hard | synovial fluid | JAK inhibitor effect on T-cell states in rheumatoid arthritis | $120 K | 180 days |
+| `biomarker_validation_lung` | medium | lung | Validate SPP1 as biomarker for pro-fibrotic macrophages in IPF | $90 K | 150 days |
+Each scenario carries paper references with DOIs, true DE genes with log2FC values, true pathway activities, true regulatory networks, and ground-truth causal mechanisms used for terminal reward calibration.
 ### `server/simulator/`
 This is the simulator itself.
+- `latent_state.py` defines `FullLatentState`, the root aggregate of all hidden state. Key sub-structures are `LatentBiologicalState` (true DE genes, pathways, gene programs, trajectory, regulatory network, markers, causal mechanisms), `TechnicalState` (dropout, doublets, ambient RNA, sample quality), `ExperimentProgress` (18 boolean milestone flags plus counts), and `ResourceState` (internal budget and time tracking with exhaustion properties)
+- `noise.py` centralises stochasticity in `NoiseModel`. All randomness flows through a single seeded `numpy.Generator`. Methods include `add_expression_noise`, `sample_effect_sizes`, `sample_p_values`, `generate_false_positives`, `generate_false_negatives`, `quality_degradation`, `sample_qc_metric`, `sample_cluster_count`, `shuffle_ranking`, and `coin_flip`
+- `output_generator.py` turns an action plus hidden state into a realistic `IntermediateOutput`. Every action type has a dedicated handler conditioned on the latent state; noise is then injected — dropout in expression data, false positives and false negatives in DE and marker results, over/under-clustering, and pathway contamination
+- `transition.py` applies action costs from `ACTION_COSTS`, updates progress flags, calls the output generator, degrades quality on soft violations, propagates discovered DE genes and cluster names back into latent state, and decides whether the episode is done
+The output generator does not simply echo the action. It conditions outputs on the hidden state, then injects realistic noise.
 ### `server/rules/engine.py`
 - hard violations block the action entirely
 - soft violations allow the action, but reduce output quality and add reward penalties
+The four rule families are:
+1. **Prerequisites (HARD)** — each computational step requires the appropriate upstream milestone flag. For example: `normalize_data` requires `data_filtered`, `differential_expression` requires `data_normalized`, `validate_marker` requires `markers_discovered`
+2. **Resource constraints (HARD/SOFT)** — budget or time exhausted is a hard block; action cost exceeding remaining budget (when budget > 0) is a soft warning
+3. **Redundancy (SOFT)** — repeating an already-completed step such as `run_qc` or `normalize_data`
+4. **Causal validity (SOFT)** — synthesizing conclusions without prior DE or clustering; making causal claims without validation evidence; pathway enrichment before DE
 ### `server/rewards/reward.py`
 Rewards are decomposed rather than being a single opaque number.
+Per-step reward formula:
+```
+R_t = r_validity + r_ordering + r_info_gain + r_efficiency + r_novelty + r_penalty + γ[φ(s_{t+1}) − φ(s_t)]
+```
+| Component | Weight | Description |
+|---|---|---|
+| `validity` | 0.3 | `1.0` if output succeeded, `−1.0` if hard violation |
+| `ordering` | 0.2 | `1.0` if natural next step, `0.3` otherwise |
+| `info_gain` | 0.4 | `quality_score × (1 − uncertainty)` |
+| `efficiency` | 0.3 | `max(0, 1 − 5 × budget_fraction_used)` |
+| `novelty` | +0.1 | Bonus when no soft violations |
+| `penalty` | −0.15/violation | Per soft violation |
+| `shaping` | γ = 0.99 | Potential-based over 12 progress milestones |
 Terminal reward adds:
+| Component | Weight | Description |
+|---|---|---|
+| Pipeline completeness | 3.0 | Fraction of 7 core milestones completed |
+| Calibration | 4.0 | How well conclusions match hidden markers and mechanisms |
+| Budget + time efficiency | 1.0 | Average fraction of budget and time remaining |
+| Overconfidence penalty | −0.5/claim | For high-confidence claims (`> 0.8`) that are wrong |
 This makes the environment easier to debug, benchmark, and train against.
 On `reset()` it:
 - seeds the noise model
+- generates a task and latent state via `TaskGenerator`
 - clears history, outputs, discoveries, conclusions, and cumulative reward
 On `step()` it:
 5. Otherwise deduct budget and time based on `ACTION_COSTS`.
 6. Update latent progress flags like `samples_collected`, `qc_performed`, or `de_performed`.
 7. Generate a structured simulated output for the chosen action.
+8. If there were soft violations, degrade output quality (×0.5) and attach warnings.
 9. Propagate artifacts back into latent state, such as discovered DE genes or cluster names.
 10. Compute decomposed reward from state transition plus output quality.
 11. If the episode is ending, compute terminal reward from completeness and conclusion calibration.
 12. Return an observation that exposes the visible summary but not the hidden truth.
+## Action costs
+Each action deducts from the episode's budget and time. Computational steps also accrue compute hours.
+| Action | Budget | Time (days) |
+|---|---|---|
+| `sequence_cells` | $15,000 | 5 |
+| `prepare_library` | $8,000 | 3 |
+| `collect_sample` | $5,000 | 7 |
+| `validate_marker` | $5,000 | 14 |
+| `culture_cells` | $3,000 | 14 |
+| `perturb_gene` | $2,000 | 3 |
+| `perturb_compound` | $1,000 | 2 |
+| `select_cohort` | $500 | 1 |
+| `run_qc` | $100 | 0.5 |
+| `integrate_batches` | $300 | 1 |
+| `regulatory_network_inference` | $200 | 1 |
+| `cluster_cells` | $150 | 0.5 |
+| `differential_expression`, `trajectory_analysis`, `pathway_enrichment` | $100–200 | 0.5–1 |
+| `filter_data`, `normalize_data`, `marker_selection` | $50–100 | 0.25–0.5 |
+| `synthesize_conclusion`, `design_followup_experiment`, `request_subagent_review` | $0 | 0.25–0.5 |
 ## Typical successful pipeline
 Most scenarios reward a sensible experiment order similar to:
    `regulatory_network_inference`, `marker_selection`, `validate_marker`
 9. `synthesize_conclusion`
+The exact best sequence depends on the scenario:
 - trajectory scenarios benefit from `trajectory_analysis` and regulatory inference
 - biomarker scenarios benefit from DE, marker selection, and validation
 - perturbation scenarios benefit from pathway-level interpretation
+## Episode termination
+An episode ends when one of the following happens:
+- the agent chooses `synthesize_conclusion`
+- resources are exhausted
+- the environment reaches `MAX_STEPS` which is currently `30`
+## Installation
+Dependencies are managed with `uv`. The package requires Python ≥ 3.10.
+```bash
+# Core environment only
+uv sync
+# With dev/test tools
+uv sync --extra dev
+# With training dependencies (TRL, transformers, torch)
+uv sync --extra train
+# With bioinformatics extras (scanpy, biopython, gseapy)
+uv sync --extra bio
+```
+Key dependency groups from `pyproject.toml`:
+| Group | Key packages |
+|---|---|
+| core | `openenv-core[core]>=0.2.0`, `numpy`, `scipy`, `pydantic>=2.0` |
+| train | `trl>=0.29`, `transformers>=5.3`, `accelerate`, `datasets`, `torch`, `matplotlib` |
+| bio | `scanpy`, `biopython`, `gseapy` |
+| dev | `pytest`, `pytest-cov` |
 ## Interfaces you can use
 ### 1. In-process environment
 print(obs.reward)
 ```
+The constructor accepts:
+- `scenario_name: Optional[str]` — pin to a specific scenario; `None` picks randomly each episode
+- `domain_randomise: bool = True` — perturbs scenario parameters for generalization
 ### 2. OpenEnv client/server mode
 Use the FastAPI app when you want to serve the environment over HTTP and WebSocket:
 uv run uvicorn server.app:app --reload
 ```
+The server exposes five endpoints:
+| Method | Path | Description |
+|---|---|---|
+| `POST` | `/reset` | Start a new episode |
+| `POST` | `/step` | Execute one action |
+| `GET` | `/state` | Current environment state |
+| `GET` | `/schema` | Action/observation JSON schemas |
+| `WS` | `/ws` | WebSocket for persistent sessions |
 Then connect with the client:
 ```python
 The environment class supports concurrent sessions, but the bundled server is currently configured with `max_concurrent_envs=1` in `server/app.py`.
+### 3. Running a local agent
+`run_agent.py` runs a single interactive episode using a local Hugging Face model:
+```bash
+uv run python run_agent.py
 ```
+Configuration is via environment variables:
+| Variable | Default | Description |
+|---|---|---|
+| `RUN_AGENT_USE_PIPELINE` | `0` | Use HF `pipeline()` path instead of direct generate |
+| `RUN_AGENT_MAX_EPISODE_STEPS` | `12` | Maximum number of planning steps |
+The local model defaults to `Qwen/Qwen3.5-0.8B` with sampling parameters `temperature=0.7`, `top_p=0.8`, `top_k=20`, `repetition_penalty=1.3`. The episode runs up to `MAX_EPISODE_STEPS = 12` steps. When action parsing fails, the script falls back to an observation-aware action that respects prerequisites.
+PowerShell note: older PowerShell versions do not support `&&`. Run commands from the target directory directly, or use `;` as the command separator.
+Windows runtime warnings:
+- If you see HuggingFace symlink-cache warnings, functionality is unaffected; optionally set `HF_HUB_DISABLE_SYMLINKS_WARNING=1`.
+- If you see flash attention / causal-conv fallback warnings, execution continues with a slower PyTorch path.
+### 4. GRPO training
+`training_script.py` follows the TRL GRPO pattern and uses OpenEnv rewards to score generated action JSON against this environment.
+```bash
+uv sync --extra train
+uv run python training_script.py --dry-run
+uv run python training_script.py --model-id Qwen/Qwen3.5-0.8B
+```
+Key arguments:
+| Argument | Default | Description |
+|---|---|---|
+| `--model-id` | `Qwen/Qwen2.5-7B-Instruct` | Base model to fine-tune |
+| `--output-dir` | `training/grpo-output` | Save directory |
+| `--dataset-episodes` | `8` | Rollout episodes for prompt dataset |
+| `--rollout-steps` | `6` | Steps per episode during collection |
+| `--collection-policy` | `heuristic` | `random` or `heuristic` |
+| `--reward-backend` | `local` | `local` (in-process) or `remote` (live server) |
+| `--base-url` | `http://localhost:8000` | Server URL for remote backend |
+| `--scenario-name` | all | Repeatable; restricts which scenarios are used |
+| `--domain-randomise` | off | Enable domain randomisation |
+| `--num-generations` | `4` | GRPO generations per prompt |
+| `--max-completion-length` | `220` | Max tokens for model completions |
+| `--max-prompt-length` | `768` | Max tokens for prompts |
+| `--learning-rate` | `5e-6` | AdamW learning rate |
+| `--dry-run` | off | Build data and test reward without training |
+By default the reward function reconstructs prompt states locally so the prompt and reward stay aligned. Switch to a live server-backed reward loop with `--reward-backend remote --base-url http://localhost:8000`.
+After training, the script saves plots to the output directory:
+- `training_loss.png`
+- `training_reward.png`
+- `training_metric.png`
+- `training_dashboard.png`
+- `training_plot_manifest.json`
+Use `--plot-metric-key <logged_key>` to force a specific extra metric on the third chart; otherwise the script auto-selects a useful logged metric such as KL or gradient norm.
+### 5. Rollout collection
+`training/rollout_collection.py` collects direct environment rollouts into trajectory files:
+```bash
+uv run python -m training.rollout_collection
+```
+This runs N episodes with a `random` or `heuristic` policy, saves JSON trajectories, and prints evaluation metrics.
+### 6. Benchmark and scripted agents
 - `training/literature_benchmark.py` runs paper-aligned action sequences and compares outcomes against curated expected findings
+- `training/rollout_collection.py` collects direct environment rollouts into trajectory files
+- `training_script.py` trains a GRPO policy with OpenEnv reward calls
 - `run_agent.py` runs a local language model planner against the environment
 - `training/trajectory.py` stores trajectories for offline RL, imitation learning, replay, and evaluation
 - `training/evaluation.py` computes online, benchmark, expert-review, and fidelity-oriented metrics
+## Training utilities
+### `training/trajectory.py`
+Provides `TrajectoryStep`, `Trajectory`, and `TrajectoryDataset` for episode serialization.
+- `TrajectoryStep` stores `action`, `observation`, `reward`, `done`, `reward_breakdown`, and an optional `latent_snapshot`
+- `Trajectory` accumulates steps with `add_step()`, computes `total_reward`, and exposes `save(path)` / `load(path)`
+- `TrajectoryDataset` wraps a list of trajectories with `filter_successful()`, `save_dir()`, `load_dir()`, and `summary()` (n, success_rate, mean_reward, mean_length, max/min reward)
+### `training/evaluation.py`
+`EvaluationSuite` is a stateless class with four families of `@staticmethod` methods:
+| Family | Method | Metrics |
+|---|---|---|
+| Online RL | `online_metrics(trajectories)` | `mean_return`, `median_return`, `std_return`, `mean_episode_length`, `success_rate` |
+| Offline benchmark | `benchmark_metrics(dataset)` | `pipeline_validity_rate`, `ordering_score`, `action_diversity`, `mean_conclusion_confidence` |
+| Expert review | `expert_review_metrics(...)` | Placeholder; averages provided scores |
+| Simulator fidelity | `simulator_fidelity_metrics(sim, real)` | `reward_distribution_gap` |
+### `training/literature_benchmark.py`
+`run_paper_benchmark(problem_statement, scenario_name, domain_randomise)` runs a paper-aligned action pipeline and scores against `expected_findings` using keyword matching. Returns a `PaperBenchmarkResult` with `match_ratio`.
+## Docker deployment
+The server ships with a `server/Dockerfile`. It uses a multi-stage build based on `openenv-base`, installs dependencies via `uv`, and starts `uvicorn server.app:app` on port 8000.
+```bash
+docker build -f server/Dockerfile -t bio-experiment-env .
+docker run -p 8000:8000 bio-experiment-env
+```
+The `openenv.yaml` file configures the deployment for the OpenEnv platform:
+```yaml
+spec_version: 1
+name: hackathon
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+```
 ## Why this is useful
 - literature-grounded evaluation
 - comparing structured policies against LLM-driven planners
+## Project map
 ```text
 .
+├── client.py                     # OpenEnv HTTP/WebSocket client
 ├── models.py                     # Shared action / observation / task schemas
+├── openenv.yaml                  # OpenEnv platform deployment config
+├── pyproject.toml                # Package metadata and dependency groups
+├── run_agent.py                  # Single-episode interactive agent runner
 ├── server/
+│   ├── app.py                    # FastAPI/OpenEnv server entry point
+│   ├── Dockerfile                # Multi-stage Docker build
 │   ├── hackathon_environment.py  # Main environment orchestration
+│   ├── requirements.txt          # Minimal server dependencies
+│   ├── rewards/
+│   │   └── reward.py             # Decomposed reward model
+│   ├── rules/
+│   │   └── engine.py             # Biological constraint checking
+│   ├── simulator/
+│   │   ├── latent_state.py       # Hidden biological, technical, progress, resource state
+│   │   ├── noise.py              # Seeded stochastic noise model
+│   │   ├── output_generator.py   # Per-action simulated output generation
+│   │   └── transition.py         # State transition engine and ACTION_COSTS table
+│   ├── subagents/                # Placeholder for future sub-agent integration
+│   └── tasks/
+│       ├── generator.py          # TaskGenerator with domain randomisation
+│       └── scenarios.py          # SCENARIO_LIBRARY with 4 curated scenarios
 ├── training/
+│   ├── evaluation.py             # EvaluationSuite metrics
 │   ├── literature_benchmark.py   # Paper-backed benchmark flow
+│   ├── rollout_collection.py     # Direct rollout collection helper
+│   └── trajectory.py             # Trajectory serialization and dataset utilities
+├── training_script.py            # TRL GRPO training entry point
+└── tests/
+    ├── test_environment.py
+    ├── test_literature_benchmark.py
+    ├── test_models.py
+    ├── test_rewards.py
+    ├── test_rules.py
+    ├── test_simulator.py
+    └── test_training_script.py
 ```
 ## Quick sanity check
 ```bash
 uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q
 ```
 - conclusion termination
 - literature-backed scenario selection
 - benchmark matching for curated expected findings
+Run the full suite with coverage:
+```bash
+uv run pytest tests/ --cov -q
+```

__init__.py CHANGED Viewed

@@ -1,48 +1,75 @@
 try:  # pragma: no cover - package import path
     from .client import BioExperimentEnv
     from .models import (
         ActionType,
         ConclusionClaim,
         ExpectedFinding,
         ExperimentAction,
         ExperimentObservation,
         IntermediateOutput,
         OutputType,
         PaperReference,
         PipelineStepRecord,
         ResourceUsage,
         SubagentType,
         TaskSpec,
     )
 except ImportError:  # pragma: no cover - direct module import path
     from client import BioExperimentEnv
     from models import (
         ActionType,
         ConclusionClaim,
         ExpectedFinding,
         ExperimentAction,
         ExperimentObservation,
         IntermediateOutput,
         OutputType,
         PaperReference,
         PipelineStepRecord,
         ResourceUsage,
         SubagentType,
         TaskSpec,
     )
 __all__ = [
     "ActionType",
     "BioExperimentEnv",
     "ConclusionClaim",
     "ExpectedFinding",
     "ExperimentAction",
     "ExperimentObservation",
     "IntermediateOutput",
     "OutputType",
     "PaperReference",
     "PipelineStepRecord",
     "ResourceUsage",
     "SubagentType",
     "TaskSpec",
 ]

 try:  # pragma: no cover - package import path
     from .client import BioExperimentEnv
     from .models import (
+        ASSAY_REGISTRY,
         ActionType,
+        AssaySpec,
         ConclusionClaim,
         ExpectedFinding,
         ExperimentAction,
         ExperimentObservation,
         IntermediateOutput,
+        MODALITY_REGISTRY,
+        ModalitySpec,
         OutputType,
         PaperReference,
         PipelineStepRecord,
         ResourceUsage,
         SubagentType,
+        TOOL_REGISTRY,
         TaskSpec,
+        ToolSpec,
+        assays_for_modality,
+        tools_by_category,
+        tools_for_modality,
     )
 except ImportError:  # pragma: no cover - direct module import path
     from client import BioExperimentEnv
     from models import (
+        ASSAY_REGISTRY,
         ActionType,
+        AssaySpec,
         ConclusionClaim,
         ExpectedFinding,
         ExperimentAction,
         ExperimentObservation,
         IntermediateOutput,
+        MODALITY_REGISTRY,
+        ModalitySpec,
         OutputType,
         PaperReference,
         PipelineStepRecord,
         ResourceUsage,
         SubagentType,
+        TOOL_REGISTRY,
         TaskSpec,
+        ToolSpec,
+        assays_for_modality,
+        tools_by_category,
+        tools_for_modality,
     )
 __all__ = [
+    "ASSAY_REGISTRY",
     "ActionType",
+    "AssaySpec",
     "BioExperimentEnv",
     "ConclusionClaim",
     "ExpectedFinding",
     "ExperimentAction",
     "ExperimentObservation",
     "IntermediateOutput",
+    "MODALITY_REGISTRY",
+    "ModalitySpec",
     "OutputType",
     "PaperReference",
     "PipelineStepRecord",
     "ResourceUsage",
     "SubagentType",
+    "TOOL_REGISTRY",
     "TaskSpec",
+    "ToolSpec",
+    "assays_for_modality",
+    "tools_by_category",
+    "tools_for_modality",
 ]

_dashboard_state.json ADDED Viewed

	@@ -0,0 +1,207 @@

+{
+  "timestamp": 1772961529.009918,
+  "step": 4,
+  "episode_done": false,
+  "cumulative_reward": 1.8781281693998408,
+  "gen_time_s": 15.02,
+  "model_response_raw": "{\n\"action_type\": \"prepare_library\",\n\"method\": \"\",\n\"parameters\": {\n\"cutoff_value\": \"\"\n},\n\"Justification\": \"The error 'Cannot sequence without library preparation' indicates that while sample collection was completed successfully via dry-polymerization technology ('collectible'), mature cDNA libraries have yet to be constructed from those frozen tissues. Library prep requires reverse transcription followed by amplification/cleanup using polymerases/ligation enzymes. Since all other required conditions met but materials remain unprocessed, preparing high-throughput Illumina libraries first allows",
+  "model_thinking": "",
+  "thinking_enabled": false,
+  "task": {
+    "problem_statement": "Determine the effect of anti-TNF biologic therapy on cell states in synovium tissue affected by rheumatoid arthritis.",
+    "organism": "human",
+    "tissue": "synovium",
+    "modality": "scRNA-seq",
+    "conditions": [
+      "untreated_rheumatoid_arthritis",
+      "anti_TNF_treated"
+    ],
+    "budget_limit": 80340.57829173621,
+    "time_limit_days": 140.67464277202689
+  },
+  "resources": {
+    "budget_used": 13500.0,
+    "budget_remaining": 66840.58,
+    "time_used_days": 11.0,
+    "time_remaining_days": 129.7,
+    "samples_consumed": 0,
+    "compute_hours_used": 0.0
+  },
+  "pipeline_history": [
+    {
+      "step_index": 1,
+      "action_type": "collect_sample",
+      "method": null,
+      "output_summary": "Collected 6 samples (quality=0.54)",
+      "success": true,
+      "quality_score": 0.541,
+      "resource_cost": 5000.0,
+      "time_cost_days": 7.0
+    },
+    {
+      "step_index": 2,
+      "action_type": "select_cohort",
+      "method": null,
+      "output_summary": "Selected cohort of 4 samples with criteria {}",
+      "success": true,
+      "quality_score": 1.0,
+      "resource_cost": 500.0,
+      "time_cost_days": 1.0
+    },
+    {
+      "step_index": 3,
+      "action_type": "sequence_cells",
+      "method": null,
+      "output_summary": "Action blocked: Cannot sequence without library preparation",
+      "success": false,
+      "quality_score": 1.0,
+      "resource_cost": 15000.0,
+      "time_cost_days": 5.0
+    },
+    {
+      "step_index": 4,
+      "action_type": "prepare_library",
+      "method": null,
+      "output_summary": "Library prepared (complexity=0.72)",
+      "success": true,
+      "quality_score": 0.718,
+      "resource_cost": 8000.0,
+      "time_cost_days": 3.0
+    }
+  ],
+  "current_action": {
+    "action_type": "prepare_library",
+    "method": null,
+    "parameters": {
+      "cutoff_value": ""
+    },
+    "justification": "The error 'Cannot sequence without library preparation' indicates that while sample collection was completed successfully via dry-polymerization technology ('collectible'), mature cDNA libraries have yet to be constructed from those frozen tissues. Library prep requires reverse transcription followed by amplification/cleanup using polymerases/ligation enzymes. Since all other required conditions met but materials remain unprocessed, preparing high-throughput Illumina libraries first allows conversion of input DNA/RNA directly into sequencable formats.",
+    "confidence": 0.95
+  },
+  "latest_output": {
+    "summary": "Library prepared (complexity=0.72)",
+    "success": true,
+    "quality_score": 0.718,
+    "uncertainty": 0.0,
+    "warnings": [],
+    "data_preview": "{\"library_complexity\": 0.7177962559040082, \"method\": \"10x_chromium\"}"
+  },
+  "discovered_markers": [],
+  "candidate_mechanisms": [],
+  "rule_violations": [],
+  "uncertainty_summary": {
+    "avg_uncertainty": 0.0,
+    "avg_quality": 0.815
+  },
+  "reward_breakdown": {},
+  "latent": {
+    "cell_populations": [
+      {
+        "name": "fibroblast",
+        "proportion": 0.445,
+        "marker_genes": [
+          "COL1A1",
+          "FAP",
+          "THY1"
+        ],
+        "state": "activated"
+      },
+      {
+        "name": "CD4_T_cell",
+        "proportion": 0.179,
+        "marker_genes": [
+          "CD3D",
+          "CD4",
+          "IL7R"
+        ],
+        "state": "quiescent"
+      },
+      {
+        "name": "CD8_T_cell",
+        "proportion": 0.139,
+        "marker_genes": [
+          "CD3D",
+          "CD8A",
+          "GZMB"
+        ],
+        "state": "activated"
+      },
+      {
+        "name": "B_cell",
+        "proportion": 0.142,
+        "marker_genes": [
+          "CD19",
+          "MS4A1",
+          "CD79A"
+        ],
+        "state": "quiescent"
+      },
+      {
+        "name": "endothelial",
+        "proportion": 0.096,
+        "marker_genes": [
+          "PECAM1",
+          "VWF"
+        ],
+        "state": "quiescent"
+      }
+    ],
+    "true_markers": [
+      "TNF",
+      "IL6",
+      "MMP3",
+      "CXCL13"
+    ],
+    "causal_mechanisms": [
+      "TNF/NF-kB-driven synovial inflammation",
+      "Th17-mediated cartilage destruction via MMPs"
+    ],
+    "true_pathways": {
+      "JAK_STAT_signalling": 0.785,
+      "TNF_signalling": 0.723,
+      "Th17_differentiation": 0.633,
+      "NF_kB_signalling": 0.826,
+      "matrix_metalloproteinase_activity": 0.847
+    },
+    "true_de_genes_count": 9,
+    "true_regulatory_network_size": 16,
+    "confounders": {},
+    "n_true_cells": 15873,
+    "technical": {
+      "ambient_rna_fraction": 0.037873267501661645,
+      "doublet_rate": 0.03797665930677535,
+      "dropout_rate": 0.14738025069803395,
+      "sample_quality": 0.9068064354870293,
+      "library_complexity": 0.8,
+      "capture_efficiency": 0.6
+    },
+    "progress": {
+      "samples_collected": true,
+      "cohort_selected": true,
+      "cells_cultured": false,
+      "library_prepared": true,
+      "perturbation_applied": false,
+      "cells_sequenced": false,
+      "qc_performed": false,
+      "data_filtered": false,
+      "data_normalized": false,
+      "batches_integrated": false,
+      "cells_clustered": false,
+      "de_performed": false,
+      "trajectories_inferred": false,
+      "pathways_analyzed": false,
+      "networks_inferred": false,
+      "markers_discovered": false,
+      "markers_validated": false,
+      "followup_designed": false,
+      "subagent_review_requested": false,
+      "conclusion_reached": false,
+      "n_cells_sequenced": null,
+      "n_cells_after_filter": null,
+      "n_clusters_found": null,
+      "n_de_genes_found": null,
+      "n_markers_found": null
+    },
+    "hidden_failure_conditions": []
+  }
+}

dashboard.html ADDED Viewed

	@@ -0,0 +1,522 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="utf-8" />
+<meta name="viewport" content="width=device-width, initial-scale=1" />
+<title>Bio-Experiment Agent Dashboard</title>
+<link rel="preconnect" href="https://fonts.googleapis.com" />
+<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;600&family=DM+Sans:wght@400;500;700&display=swap" rel="stylesheet" />
+<style>
+:root {
+  --bg: #0c0e14;
+  --surface: #151822;
+  --surface2: #1c2030;
+  --border: #2a2f42;
+  --text: #e2e4ea;
+  --text-dim: #8b90a5;
+  --accent: #5ce0d8;
+  --accent2: #7c6cf0;
+  --green: #4ade80;
+  --red: #f87171;
+  --amber: #fbbf24;
+  --blue: #60a5fa;
+  --pink: #f472b6;
+}
+*, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
+body { background: var(--bg); color: var(--text); font-family: 'DM Sans', system-ui, sans-serif; line-height: 1.5; min-height: 100vh; }
+.mono { font-family: 'JetBrains Mono', monospace; }
+.header { display: flex; align-items: center; justify-content: space-between; padding: 14px 28px; border-bottom: 1px solid var(--border); background: var(--surface); }
+.header h1 { font-size: 18px; font-weight: 700; letter-spacing: -.3px; }
+.header h1 span { color: var(--accent); }
+.header-right { display: flex; align-items: center; gap: 10px; }
+.status-pill { font-size: 12px; padding: 4px 14px; border-radius: 20px; font-weight: 600; text-transform: uppercase; letter-spacing: .5px; }
+.status-pill.live { background: rgba(76,222,128,.15); color: var(--green); }
+.status-pill.done { background: rgba(248,113,113,.15); color: var(--red); }
+.status-pill.waiting { background: rgba(139,144,165,.15); color: var(--text-dim); }
+.btn { padding: 6px 16px; border-radius: 8px; border: 1px solid var(--border); background: var(--surface2); color: var(--text); font-size: 12px; font-weight: 600; cursor: pointer; transition: all .15s; }
+.btn:hover { border-color: var(--accent); color: var(--accent); }
+.btn.primary { background: rgba(92,224,216,.12); border-color: var(--accent); color: var(--accent); }
+.btn.primary:hover { background: rgba(92,224,216,.25); }
+.btn.danger { border-color: var(--red); color: var(--red); }
+.btn.danger:hover { background: rgba(248,113,113,.12); }
+.grid { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 16px; padding: 20px 28px; max-width: 1600px; }
+@media (max-width: 1100px) { .grid { grid-template-columns: 1fr 1fr; } }
+@media (max-width: 700px)  { .grid { grid-template-columns: 1fr; } }
+.card { background: var(--surface); border: 1px solid var(--border); border-radius: 12px; padding: 18px 20px; overflow: hidden; }
+.card h2 { font-size: 11px; font-weight: 600; text-transform: uppercase; letter-spacing: 1px; color: var(--text-dim); margin-bottom: 12px; }
+.card.span2 { grid-column: span 2; }
+.card.span3 { grid-column: span 3; }
+@media (max-width: 700px) { .card.span2, .card.span3 { grid-column: span 1; } }
+.gauge-row { display: flex; gap: 14px; flex-wrap: wrap; }
+.gauge { flex: 1; min-width: 130px; background: var(--surface2); border-radius: 10px; padding: 14px; }
+.gauge-label { font-size: 11px; color: var(--text-dim); margin-bottom: 6px; text-transform: uppercase; letter-spacing: .5px; }
+.gauge-value { font-size: 22px; font-weight: 700; }
+.gauge-bar { height: 5px; border-radius: 3px; background: var(--border); margin-top: 8px; overflow: hidden; }
+.gauge-bar-fill { height: 100%; border-radius: 3px; transition: width .6s ease; }
+.timeline { position: relative; padding-left: 20px; }
+.timeline::before { content: ''; position: absolute; left: 6px; top: 0; bottom: 0; width: 2px; background: var(--border); }
+.timeline-item { position: relative; margin-bottom: 14px; padding-left: 18px; }
+.timeline-item::before { content: ''; position: absolute; left: -18px; top: 6px; width: 10px; height: 10px; border-radius: 50%; border: 2px solid var(--accent); background: var(--bg); }
+.timeline-item.fail::before { border-color: var(--red); }
+.tl-action { font-weight: 600; font-size: 14px; }
+.tl-meta { font-size: 12px; color: var(--text-dim); margin-top: 2px; }
+.mini-table { width: 100%; font-size: 13px; border-collapse: collapse; }
+.mini-table td { padding: 5px 8px; border-bottom: 1px solid var(--border); vertical-align: top; }
+.mini-table td:first-child { color: var(--text-dim); white-space: nowrap; width: 40%; }
+.tag-list { display: flex; flex-wrap: wrap; gap: 6px; }
+.tag { font-size: 12px; padding: 3px 10px; border-radius: 6px; background: var(--surface2); border: 1px solid var(--border); font-family: 'JetBrains Mono', monospace; }
+.tag.green { border-color: rgba(76,222,128,.3); color: var(--green); }
+.tag.pink  { border-color: rgba(244,114,182,.3); color: var(--pink); }
+.tag.amber { border-color: rgba(251,191,36,.3); color: var(--amber); }
+.tag.red   { border-color: rgba(248,113,113,.3); color: var(--red); }
+.tag.match { background: rgba(76,222,128,.15); }
+.tag.miss  { background: rgba(248,113,113,.08); }
+.code-block { background: var(--surface2); border: 1px solid var(--border); border-radius: 8px; padding: 12px 14px; font-family: 'JetBrains Mono', monospace; font-size: 12px; white-space: pre-wrap; word-break: break-all; max-height: 220px; overflow-y: auto; color: var(--text-dim); line-height: 1.6; }
+.progress-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(150px, 1fr)); gap: 6px; }
+.progress-item { display: flex; align-items: center; gap: 6px; font-size: 12px; }
+.dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; background: var(--border); }
+.dot.done { background: var(--green); }
+.pop-bar-container { margin-bottom: 10px; }
+.pop-bar-label { font-size: 12px; margin-bottom: 3px; display: flex; justify-content: space-between; }
+.pop-bar { height: 14px; border-radius: 4px; background: var(--surface2); overflow: hidden; }
+.pop-bar-fill { height: 100%; border-radius: 4px; }
+#reward-chart { width: 100%; height: 120px; }
+::-webkit-scrollbar { width: 6px; }
+::-webkit-scrollbar-track { background: transparent; }
+::-webkit-scrollbar-thumb { background: var(--border); border-radius: 3px; }
+.conclusion-card { background: var(--surface2); border: 1px solid var(--border); border-radius: 10px; padding: 14px 16px; margin-bottom: 12px; }
+.conclusion-card .cc-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 8px; }
+.cc-type { font-size: 11px; padding: 2px 10px; border-radius: 4px; font-weight: 600; text-transform: uppercase; letter-spacing: .5px; }
+.cc-type.causal { background: rgba(244,114,182,.15); color: var(--pink); }
+.cc-type.correlative { background: rgba(96,165,250,.15); color: var(--blue); }
+.cc-type.descriptive { background: rgba(139,144,165,.15); color: var(--text-dim); }
+.cc-conf { font-family: 'JetBrains Mono', monospace; font-size: 13px; font-weight: 600; }
+.cc-claim { font-size: 14px; margin-bottom: 8px; line-height: 1.5; }
+.cc-section-label { font-size: 10px; color: var(--text-dim); text-transform: uppercase; letter-spacing: .5px; margin-bottom: 3px; margin-top: 8px; }
+/* ── control panel ────────────────────────────── */
+.control-panel { background: var(--surface); border: 1px solid var(--border); border-radius: 12px; margin: 20px 28px 0; padding: 18px 20px; }
+.control-panel summary { cursor: pointer; font-size: 13px; font-weight: 600; color: var(--accent); }
+.control-panel[open] summary { margin-bottom: 14px; }
+.form-row { display: flex; gap: 12px; margin-bottom: 10px; flex-wrap: wrap; align-items: end; }
+.form-field { display: flex; flex-direction: column; gap: 4px; }
+.form-field label { font-size: 11px; color: var(--text-dim); text-transform: uppercase; letter-spacing: .5px; }
+.form-field input, .form-field textarea, .form-field select {
+  background: var(--surface2); border: 1px solid var(--border); border-radius: 6px;
+  color: var(--text); padding: 7px 10px; font-size: 13px; font-family: inherit; outline: none;
+}
+.form-field input:focus, .form-field textarea:focus, .form-field select:focus { border-color: var(--accent); }
+.form-field textarea { min-height: 60px; resize: vertical; }
+/* ── final report ─────────────────────────────── */
+.report-overlay { display: none; position: fixed; inset: 0; z-index: 100; background: rgba(12,14,20,.85); backdrop-filter: blur(6px); overflow-y: auto; padding: 40px 20px; }
+.report-overlay.visible { display: flex; justify-content: center; align-items: flex-start; }
+.report-card { background: var(--surface); border: 1px solid var(--border); border-radius: 16px; padding: 32px 36px; max-width: 900px; width: 100%; }
+.report-card h2 { font-size: 22px; font-weight: 700; margin-bottom: 4px; color: var(--text); text-transform: none; letter-spacing: normal; }
+.report-card .subtitle { font-size: 13px; color: var(--text-dim); margin-bottom: 20px; }
+.report-section { margin-bottom: 20px; }
+.report-section h3 { font-size: 12px; color: var(--accent); text-transform: uppercase; letter-spacing: 1px; margin-bottom: 8px; }
+.comparison-row { display: flex; gap: 20px; margin-bottom: 16px; }
+.comparison-col { flex: 1; }
+.comparison-col h4 { font-size: 11px; color: var(--text-dim); text-transform: uppercase; margin-bottom: 6px; }
+.pulse { animation: pulse 1.5s ease-in-out infinite; }
+@keyframes pulse { 0%,100% { opacity: 1; } 50% { opacity: .5; } }
+</style>
+</head>
+<body>
+<div class="header">
+  <h1><span>BioExp</span> Agent Dashboard</h1>
+  <div class="header-right">
+    <span id="thinking-badge" class="mono" style="font-size:11px;color:var(--accent2);display:none">REASONING ON</span>
+    <span id="step-label" class="mono" style="font-size:13px;color:var(--text-dim)">Step 0</span>
+    <span id="status-pill" class="status-pill waiting">Waiting</span>
+    <button class="btn primary" onclick="doRestart()">Restart</button>
+    <button class="btn" onclick="showReport()">Report</button>
+  </div>
+</div>
+<!-- Control Panel (collapsible) -->
+<details class="control-panel" id="control-panel">
+  <summary>New Task / Custom Ground Truth</summary>
+  <div class="form-row">
+    <div class="form-field" style="flex:2">
+      <label>Scenario (leave blank for random)</label>
+      <select id="f-scenario"><option value="">— random —</option></select>
+    </div>
+    <div class="form-field" style="flex:1">
+      <label>True Markers (comma-separated)</label>
+      <input id="f-markers" placeholder="e.g. MYH7, TNNT2, ACTA1" />
+    </div>
+    <div class="form-field" style="flex:1">
+      <label>Causal Mechanisms (comma-separated)</label>
+      <input id="f-mechanisms" placeholder="e.g. sarcomere dysfunction" />
+    </div>
+  </div>
+  <div class="form-row">
+    <div class="form-field" style="flex:2">
+      <label>True Pathways (name:score, comma-sep)</label>
+      <input id="f-pathways" placeholder="e.g. Wnt_signaling:0.8, MAPK:0.6" />
+    </div>
+    <div class="form-field">
+      <button class="btn primary" onclick="doCustomRun()">Run with Ground Truth</button>
+    </div>
+  </div>
+</details>
+<div class="grid">
+  <div class="card span2" id="card-task">
+    <h2>Task</h2>
+    <div id="task-statement" style="font-size:15px;font-weight:500;margin-bottom:8px;">—</div>
+    <div id="task-meta" style="font-size:13px;color:var(--text-dim)"></div>
+  </div>
+  <div class="card">
+    <h2>Reward</h2>
+    <div id="reward-value" class="mono" style="font-size:32px;font-weight:700;margin-bottom:6px;">0.000</div>
+    <canvas id="reward-chart"></canvas>
+  </div>
+  <div class="card span3"><h2>Resources</h2><div class="gauge-row" id="gauges"></div></div>
+  <div class="card span2" style="max-height:460px;overflow-y:auto">
+    <h2>Pipeline History <span style="color:var(--accent);font-size:10px">OBSERVABLE</span></h2>
+    <div class="timeline" id="timeline"></div>
+  </div>
+  <div class="card">
+    <h2>Current Action</h2>
+    <table class="mini-table" id="action-table"><tbody></tbody></table>
+    <h2 style="margin-top:14px" id="thinking-header" style="display:none">Model Reasoning</h2>
+    <div class="code-block" id="model-thinking" style="display:none;border-color:rgba(124,108,240,.2);max-height:140px;margin-bottom:10px">—</div>
+    <h2 style="margin-top:10px">Model Raw Output</h2>
+    <div class="code-block" id="model-response">—</div>
+  </div>
+  <div class="card">
+    <h2>Discovered Markers <span style="color:var(--accent);font-size:10px">OBSERVABLE</span></h2>
+    <div class="tag-list" id="markers-list"><span class="tag" style="color:var(--text-dim)">none yet</span></div>
+    <h2 style="margin-top:14px">Candidate Mechanisms</h2>
+    <div class="tag-list" id="mechanisms-list"><span class="tag" style="color:var(--text-dim)">none yet</span></div>
+  </div>
+  <div class="card">
+    <h2>Rule Violations</h2>
+    <div id="violations" style="font-size:13px;color:var(--text-dim)">None</div>
+    <h2 style="margin-top:14px">Uncertainty Summary</h2>
+    <table class="mini-table" id="uncertainty-table"><tbody></tbody></table>
+    <h2 style="margin-top:14px">Reward Breakdown</h2>
+    <table class="mini-table" id="reward-breakdown-table"><tbody></tbody></table>
+  </div>
+  <div class="card">
+    <h2>Latest Output</h2>
+    <table class="mini-table" id="output-table"><tbody></tbody></table>
+    <div class="code-block" id="output-data" style="margin-top:10px;max-height:140px">—</div>
+  </div>
+  <div class="card span3" id="card-conclusions" style="display:none;border-color:rgba(76,222,128,.25)">
+    <h2 style="color:var(--green)">Synthesized Conclusions</h2>
+    <div id="conclusions-list"></div>
+  </div>
+  <!-- Ground Truth Comparison (shown when episode done + has conclusions) -->
+  <div class="card span3" id="card-gt-comparison" style="display:none;border-color:rgba(251,191,36,.25)">
+    <h2 style="color:var(--amber)">Ground Truth Comparison</h2>
+    <div class="comparison-row">
+      <div class="comparison-col">
+        <h4>Agent's Markers</h4>
+        <div class="tag-list" id="gt-agent-markers"></div>
+      </div>
+      <div class="comparison-col">
+        <h4>True Markers</h4>
+        <div class="tag-list" id="gt-true-markers"></div>
+      </div>
+    </div>
+    <div class="comparison-row">
+      <div class="comparison-col">
+        <h4>Agent's Mechanisms</h4>
+        <div class="tag-list" id="gt-agent-mechs"></div>
+      </div>
+      <div class="comparison-col">
+        <h4>True Mechanisms</h4>
+        <div class="tag-list" id="gt-true-mechs"></div>
+      </div>
+    </div>
+    <div id="gt-score" style="margin-top:8px;font-size:14px;font-weight:600"></div>
+  </div>
+  <div class="card" style="border-color:rgba(124,108,240,.25)">
+    <h2 style="color:var(--accent2)">Cell Populations <span style="font-size:10px">HIDDEN</span></h2>
+    <div id="populations"></div>
+  </div>
+  <div class="card" style="border-color:rgba(124,108,240,.25)">
+    <h2 style="color:var(--accent2)">Ground Truth <span style="font-size:10px">HIDDEN</span></h2>
+    <div style="margin-bottom:8px"><span style="font-size:11px;color:var(--text-dim);text-transform:uppercase">True Markers</span><div class="tag-list" id="true-markers" style="margin-top:4px"></div></div>
+    <div style="margin-bottom:8px"><span style="font-size:11px;color:var(--text-dim);text-transform:uppercase">Causal Mechanisms</span><div class="tag-list" id="true-mechanisms" style="margin-top:4px"></div></div>
+    <div><span style="font-size:11px;color:var(--text-dim);text-transform:uppercase">Top Pathways</span><table class="mini-table" id="pathways-table" style="margin-top:4px"><tbody></tbody></table></div>
+  </div>
+  <div class="card" style="border-color:rgba(124,108,240,.25)">
+    <h2 style="color:var(--accent2)">Technical State <span style="font-size:10px">HIDDEN</span></h2>
+    <table class="mini-table" id="technical-table"><tbody></tbody></table>
+    <h2 style="margin-top:14px;color:var(--accent2)">Failure Conditions <span style="font-size:10px">HIDDEN</span></h2>
+    <div class="tag-list" id="failure-conditions"></div>
+  </div>
+  <div class="card span3" style="border-color:rgba(124,108,240,.25)">
+    <h2 style="color:var(--accent2)">Experiment Progress <span style="font-size:10px">HIDDEN</span></h2>
+    <div class="progress-grid" id="progress-grid"></div>
+  </div>
+</div>
+<!-- Final Report Overlay -->
+<div class="report-overlay" id="report-overlay" onclick="if(event.target===this)hideReport()">
+  <div class="report-card" id="report-content"></div>
+</div>
+<script>
+const POLL_MS = 1200;
+const POP_COLORS = ['#5ce0d8','#7c6cf0','#f472b6','#60a5fa','#fbbf24','#4ade80','#f87171','#c084fc','#fb923c','#38bdf8'];
+let rewardHistory = [];
+let lastTimestamp = 0;
+let latestState = null;
+function $(id) { return document.getElementById(id); }
+function setHTML(id, html) { $(id).innerHTML = html; }
+function tagsHTML(arr, cls) {
+  if (!arr || !arr.length) return '<span class="tag" style="color:var(--text-dim)">—</span>';
+  return arr.map(t => `<span class="tag ${cls||''}">${esc(t)}</span>`).join('');
+}
+function esc(s) { if (s == null) return '—'; const d = document.createElement('div'); d.textContent = String(s); return d.innerHTML; }
+function pct(used, total) { if (!total) return 0; return Math.min(100, Math.max(0, (used / total) * 100)); }
+function gaugeColor(p) { return p < 50 ? 'var(--green)' : p < 80 ? 'var(--amber)' : 'var(--red)'; }
+function fmt(n) { if (n == null) return '0'; return Number(n).toLocaleString('en-US', { maximumFractionDigits: 0 }); }
+function gauge(label, value, pctVal, inv) {
+  let bar = '';
+  if (pctVal != null) { const c = inv ? gaugeColor(100-pctVal) : gaugeColor(pctVal); bar = `<div class="gauge-bar"><div class="gauge-bar-fill" style="width:${pctVal.toFixed(1)}%;background:${c}"></div></div>`; }
+  return `<div class="gauge"><div class="gauge-label">${label}</div><div class="gauge-value mono">${value}</div>${bar}</div>`;
+}
+function miniRows(obj) { return Object.entries(obj).map(([k,v]) => `<tr><td>${esc(k)}</td><td>${esc(v)}</td></tr>`).join(''); }
+function drawRewardChart(canvas, data) {
+  const ctx = canvas.getContext('2d'); const W = canvas.width = canvas.offsetWidth * 2; const H = canvas.height = canvas.offsetHeight * 2;
+  ctx.clearRect(0, 0, W, H); if (data.length < 2) return;
+  const vals = data.map(d => d.v); const minV = Math.min(0, ...vals); const maxV = Math.max(0.1, ...vals); const range = maxV - minV || 1; const pad = 8;
+  ctx.strokeStyle = 'rgba(92,224,216,.4)'; ctx.lineWidth = 2; ctx.beginPath();
+  const yZ = H - pad - ((0 - minV) / range) * (H - 2*pad); ctx.moveTo(pad, yZ); ctx.lineTo(W-pad, yZ); ctx.stroke();
+  ctx.strokeStyle = '#5ce0d8'; ctx.lineWidth = 3; ctx.beginPath();
+  data.forEach((d,i) => { const x = pad+(i/(data.length-1))*(W-2*pad); const y = H-pad-((d.v-minV)/range)*(H-2*pad); i===0?ctx.moveTo(x,y):ctx.lineTo(x,y); }); ctx.stroke();
+  data.forEach((d,i) => { const x = pad+(i/(data.length-1))*(W-2*pad); const y = H-pad-((d.v-minV)/range)*(H-2*pad); ctx.fillStyle = d.v>=0?'#4ade80':'#f87171'; ctx.beginPath(); ctx.arc(x,y,5,0,Math.PI*2); ctx.fill(); });
+}
+function comparedTags(agentArr, trueArr, cls) {
+  if (!agentArr || !agentArr.length) return '<span class="tag" style="color:var(--text-dim)">—</span>';
+  const trueSet = new Set((trueArr||[]).map(t => t.toUpperCase()));
+  return agentArr.map(t => {
+    const hit = trueSet.has(t.toUpperCase());
+    return `<span class="tag ${cls} ${hit?'match':'miss'}">${esc(t)} ${hit?'✓':'✗'}</span>`;
+  }).join('');
+}
+// ── API actions ──
+async function doRestart() {
+  rewardHistory = []; lastTimestamp = 0;
+  await fetch('/api/restart', { method: 'POST' });
+}
+async function doCustomRun() {
+  const scenario = $('f-scenario').value || undefined;
+  const markers = $('f-markers').value.split(',').map(s=>s.trim()).filter(Boolean);
+  const mechs = $('f-mechanisms').value.split(',').map(s=>s.trim()).filter(Boolean);
+  const pwRaw = $('f-pathways').value.split(',').map(s=>s.trim()).filter(Boolean);
+  const pathways = {};
+  pwRaw.forEach(p => { const [k,v] = p.split(':'); if (k && v) pathways[k.trim()] = parseFloat(v); });
+  const gt = {};
+  if (markers.length) gt.true_markers = markers;
+  if (mechs.length) gt.causal_mechanisms = mechs;
+  if (Object.keys(pathways).length) gt.true_pathways = pathways;
+  rewardHistory = []; lastTimestamp = 0;
+  await fetch('/api/run', { method: 'POST', headers: {'Content-Type':'application/json'}, body: JSON.stringify({ scenario_name: scenario, ground_truth: Object.keys(gt).length ? gt : undefined }) });
+}
+function showReport() {
+  const s = latestState; if (!s) return;
+  const rc = $('report-content');
+  const t = s.task || {};
+  const lat = s.latent || {};
+  const conc = s.conclusions || [];
+  const trueM = lat.true_markers || [];
+  const trueMech = lat.causal_mechanisms || [];
+  const agentM = s.discovered_markers || [];
+  const markerHits = agentM.filter(m => trueM.some(t => t.toUpperCase() === m.toUpperCase()));
+  const r = s.resources || {};
+  let html = `<h2>Experiment Report</h2>
+  <div class="subtitle">${esc(t.problem_statement)}</div>
+  <div class="report-section"><h3>Summary</h3>
+    <table class="mini-table"><tbody>
+      <tr><td>Status</td><td>${s.episode_done ? 'Completed' : 'In Progress'}</td></tr>
+      <tr><td>Steps</td><td>${s.step}</td></tr>
+      <tr><td>Cumulative Reward</td><td style="color:${(s.cumulative_reward||0)>=0?'var(--green)':'var(--red)'}">${((s.cumulative_reward||0)>=0?'+':'')}${(s.cumulative_reward||0).toFixed(3)}</td></tr>
+      <tr><td>Budget Used</td><td>$${fmt(r.budget_used)} / $${fmt((r.budget_used||0)+(r.budget_remaining||0))}</td></tr>
+      <tr><td>Time Used</td><td>${(r.time_used_days||0).toFixed(0)}d / ${((r.time_used_days||0)+(r.time_remaining_days||0)).toFixed(0)}d</td></tr>
+      <tr><td>Markers Found</td><td>${agentM.length} (${markerHits.length} match ground truth)</td></tr>
+    </tbody></table>
+  </div>`;
+  if (conc.length) {
+    html += `<div class="report-section"><h3>Conclusions</h3>`;
+    conc.forEach(c => {
+      html += `<div class="conclusion-card"><div class="cc-header"><span class="cc-type ${(c.claim_type||'').toLowerCase()}">${esc(c.claim_type)}</span><span class="cc-conf" style="color:${c.confidence>=.7?'var(--green)':c.confidence>=.4?'var(--amber)':'var(--red)'}">${((c.confidence||0)*100).toFixed(0)}%</span></div>`;
+      if (c.claim) html += `<div class="cc-claim">${esc(c.claim)}</div>`;
+      if (c.top_markers?.length) html += `<div class="cc-section-label">Top Markers</div><div class="tag-list">${c.top_markers.map(m=>`<span class="tag green">${esc(m)}</span>`).join('')}</div>`;
+      if (c.causal_mechanisms?.length) html += `<div class="cc-section-label">Causal Mechanisms</div><div class="tag-list">${c.causal_mechanisms.map(m=>`<span class="tag pink">${esc(m)}</span>`).join('')}</div>`;
+      if (c.predicted_pathways && Object.keys(c.predicted_pathways).length) html += `<div class="cc-section-label">Predicted Pathways</div><table class="mini-table"><tbody>${Object.entries(c.predicted_pathways).map(([k,v])=>`<tr><td>${esc(k)}</td><td>${Number(v).toFixed(3)}</td></tr>`).join('')}</tbody></table>`;
+      html += `</div>`;
+    });
+    html += `</div>`;
+  }
+  html += `<div class="report-section"><h3>Ground Truth Comparison</h3>
+    <div class="comparison-row"><div class="comparison-col"><h4>Agent's Markers</h4><div class="tag-list">${comparedTags(agentM, trueM, 'green')}</div></div>
+    <div class="comparison-col"><h4>True Markers</h4><div class="tag-list">${tagsHTML(trueM,'green')}</div></div></div>
+    <div class="comparison-row"><div class="comparison-col"><h4>Agent's Mechanisms</h4><div class="tag-list">${comparedTags(s.candidate_mechanisms, trueMech, 'pink')}</div></div>
+    <div class="comparison-col"><h4>True Mechanisms</h4><div class="tag-list">${tagsHTML(trueMech,'pink')}</div></div></div>
+  </div>`;
+  const hist = s.pipeline_history || [];
+  if (hist.length) {
+    html += `<div class="report-section"><h3>Pipeline Steps</h3><table class="mini-table"><tbody>`;
+    hist.forEach(h => { html += `<tr><td>${h.success?'✓':'✗'} ${esc(h.action_type)}</td><td>${esc(h.output_summary)} · q=${h.quality_score}</td></tr>`; });
+    html += `</tbody></table></div>`;
+  }
+  html += `<div style="margin-top:20px;text-align:right"><button class="btn" onclick="hideReport()">Close</button> <button class="btn primary" onclick="doRestart();hideReport()">New Run</button></div>`;
+  rc.innerHTML = html;
+  $('report-overlay').classList.add('visible');
+}
+function hideReport() { $('report-overlay').classList.remove('visible'); }
+function renderState(s) {
+  latestState = s;
+  if (s.error) { $('status-pill').className='status-pill waiting'; $('status-pill').textContent='Waiting'; $('task-statement').textContent=s.error; return; }
+  const pill = $('status-pill');
+  if (s.episode_done) { pill.className='status-pill done'; pill.textContent='Done'; } else { pill.className='status-pill live'; pill.textContent='Live'; }
+  $('step-label').textContent = `Step ${s.step}`;
+  if (s.thinking_enabled) { $('thinking-badge').style.display = ''; } else { $('thinking-badge').style.display = 'none'; }
+  const t = s.task || {};
+  $('task-statement').textContent = t.problem_statement || '—';
+  $('task-meta').innerHTML = [t.organism, t.tissue, t.modality, t.conditions ? t.conditions.join(' vs ') : null].filter(Boolean).map(v => `<span class="tag">${esc(v)}</span>`).join(' ');
+  const cum = s.cumulative_reward || 0;
+  $('reward-value').textContent = (cum >= 0 ? '+' : '') + cum.toFixed(3);
+  $('reward-value').style.color = cum >= 0 ? 'var(--green)' : 'var(--red)';
+  if (s.timestamp !== lastTimestamp && s.step > 0) { rewardHistory.push({ step: s.step, v: cum }); lastTimestamp = s.timestamp; }
+  drawRewardChart($('reward-chart'), rewardHistory);
+  const r = s.resources || {};
+  const bT = (r.budget_used||0)+(r.budget_remaining||0), tT = (r.time_used_days||0)+(r.time_remaining_days||0);
+  const bP = pct(r.budget_used, bT), tP = pct(r.time_used_days, tT);
+  $('gauges').innerHTML = [gauge('Budget Used',`$${fmt(r.budget_used)}`,bP), gauge('Budget Left',`$${fmt(r.budget_remaining)}`,100-bP,true), gauge('Time Used',`${(r.time_used_days||0).toFixed(0)}d`,tP), gauge('Time Left',`${(r.time_remaining_days||0).toFixed(0)}d`,100-tP,true), gauge('Samples',String(r.samples_consumed||0),null), gauge('Compute',`${(r.compute_hours_used||0).toFixed(1)}h`,null)].join('');
+  const hist = s.pipeline_history || [];
+  $('timeline').innerHTML = hist.length ? hist.map(h => `<div class="timeline-item ${!h.success?'fail':''}"><div class="tl-action">${esc(h.action_type)}${h.method?` <span style="color:var(--text-dim);font-weight:400;font-size:12px">${esc(h.method)}</span>`:''}</div><div class="tl-meta">${h.success?'✓':'✗'} ${esc(h.output_summary)} · q=${h.quality_score} · $${fmt(h.resource_cost)} · ${h.time_cost_days}d</div></div>`).join('') : '<div style="color:var(--text-dim);font-size:13px">No steps yet</div>';
+  const a = s.current_action;
+  if (a) { $('action-table').querySelector('tbody').innerHTML = miniRows({'Type':a.action_type,'Method':a.method||'—','Confidence':a.confidence?.toFixed(2),'Justification':a.justification||'—','Fallback?':s.used_fallback?'YES':'no'}); }
+  if (s.model_thinking) { $('model-thinking').style.display=''; $('model-thinking').textContent = s.model_thinking; } else { $('model-thinking').style.display='none'; }
+  $('model-response').textContent = s.model_response_raw || '—';
+  setHTML('markers-list', tagsHTML(s.discovered_markers, 'green'));
+  setHTML('mechanisms-list', tagsHTML(s.candidate_mechanisms, 'pink'));
+  const v = s.rule_violations || [];
+  $('violations').innerHTML = v.length ? v.map(x=>`<div class="tag red" style="margin-bottom:4px">${esc(x)}</div>`).join('') : '<span style="color:var(--text-dim)">None</span>';
+  $('uncertainty-table').querySelector('tbody').innerHTML = miniRows(s.uncertainty_summary || {});
+  const rb = s.reward_breakdown || {};
+  $('reward-breakdown-table').querySelector('tbody').innerHTML = miniRows(Object.fromEntries(Object.entries(rb).map(([k,v])=>[k,(v>=0?'+':'')+v.toFixed(4)])));
+  const lo = s.latest_output;
+  if (lo) { $('output-table').querySelector('tbody').innerHTML = miniRows({'Summary':lo.summary,'Success':lo.success?'✓':'✗','Quality':lo.quality_score,'Uncertainty':lo.uncertainty,'Warnings':(lo.warnings||[]).join('; ')||'—'}); $('output-data').textContent = lo.data_preview||'—'; }
+  const conc = s.conclusions || [];
+  if (conc.length) {
+    $('card-conclusions').style.display = '';
+    $('conclusions-list').innerHTML = conc.map(c => {
+      const confColor = c.confidence>=.7?'var(--green)':c.confidence>=.4?'var(--amber)':'var(--red)';
+      let h = `<div class="conclusion-card"><div class="cc-header"><span class="cc-type ${(c.claim_type||'').toLowerCase()}">${esc(c.claim_type||'unknown')}</span><span class="cc-conf" style="color:${confColor}">${((c.confidence||0)*100).toFixed(0)}%</span></div>`;
+      if (c.claim) h += `<div class="cc-claim">${esc(c.claim)}</div>`;
+      if (c.top_markers?.length) h += `<div class="cc-section-label">Top Markers</div><div class="tag-list">${c.top_markers.map(m=>`<span class="tag green">${esc(m)}</span>`).join('')}</div>`;
+      if (c.causal_mechanisms?.length) h += `<div class="cc-section-label">Causal Mechanisms</div><div class="tag-list">${c.causal_mechanisms.map(m=>`<span class="tag pink">${esc(m)}</span>`).join('')}</div>`;
+      if (c.predicted_pathways && Object.keys(c.predicted_pathways).length) h += `<div class="cc-section-label">Predicted Pathways</div><table class="mini-table"><tbody>${Object.entries(c.predicted_pathways).map(([k,v])=>`<tr><td>${esc(k)}</td><td>${Number(v).toFixed(3)}</td></tr>`).join('')}</tbody></table>`;
+      return h + '</div>';
+    }).join('');
+  } else { $('card-conclusions').style.display = 'none'; }
+  // Ground truth comparison (visible when done or has conclusions)
+  const lat = s.latent;
+  if ((s.episode_done || conc.length) && lat) {
+    $('card-gt-comparison').style.display = '';
+    setHTML('gt-agent-markers', comparedTags(s.discovered_markers, lat.true_markers, 'green'));
+    setHTML('gt-true-markers', tagsHTML(lat.true_markers, 'green'));
+    setHTML('gt-agent-mechs', comparedTags(s.candidate_mechanisms, lat.causal_mechanisms, 'pink'));
+    setHTML('gt-true-mechs', tagsHTML(lat.causal_mechanisms, 'pink'));
+    const hits = (s.discovered_markers||[]).filter(m => (lat.true_markers||[]).some(t => t.toUpperCase()===m.toUpperCase()));
+    $('gt-score').innerHTML = `Marker accuracy: <span style="color:var(--accent)">${hits.length}</span> / ${(lat.true_markers||[]).length} true markers recovered`;
+  } else { $('card-gt-comparison').style.display = 'none'; }
+  if (!lat) return;
+  const pops = lat.cell_populations || [];
+  $('populations').innerHTML = pops.map((p,i) => { const c = POP_COLORS[i%POP_COLORS.length]; const w = (p.proportion*100).toFixed(1); return `<div class="pop-bar-container"><div class="pop-bar-label"><span>${esc(p.name)} <span style="color:var(--text-dim);font-size:11px">${p.state}</span></span><span class="mono" style="font-size:12px">${w}%</span></div><div class="pop-bar"><div class="pop-bar-fill" style="width:${w}%;background:${c}"></div></div><div class="tag-list" style="margin-top:3px">${p.marker_genes.map(g=>`<span class="tag" style="font-size:11px">${esc(g)}</span>`).join('')}</div></div>`; }).join('') || '<span style="color:var(--text-dim)">—</span>';
+  setHTML('true-markers', tagsHTML(lat.true_markers, 'green'));
+  setHTML('true-mechanisms', tagsHTML(lat.causal_mechanisms, 'pink'));
+  const pw = lat.true_pathways || {};
+  $('pathways-table').querySelector('tbody').innerHTML = miniRows(Object.fromEntries(Object.entries(pw).slice(0,10).map(([k,v])=>[k,v.toFixed(3)])));
+  $('technical-table').querySelector('tbody').innerHTML = miniRows(lat.technical || {});
+  setHTML('failure-conditions', tagsHTML(lat.hidden_failure_conditions, 'red'));
+  const prog = lat.progress || {};
+  const bK = Object.entries(prog).filter(([,v])=>typeof v==='boolean'), nK = Object.entries(prog).filter(([,v])=>typeof v!=='boolean');
+  $('progress-grid').innerHTML = bK.map(([k,v])=>`<div class="progress-item"><div class="dot ${v?'done':''}"></div>${k.replace(/_/g,' ')}</div>`).join('') + nK.map(([k,v])=>`<div class="progress-item" style="color:var(--accent)"><span class="mono" style="font-size:11px;margin-right:4px">${v??'—'}</span>${k.replace(/_/g,' ')}</div>`).join('');
+  if (s.episode_done && !reportShownForTimestamp && s.timestamp) { reportShownForTimestamp = s.timestamp; setTimeout(showReport, 800); }
+}
+let reportShownForTimestamp = null;
+async function loadScenarios() {
+  try {
+    const res = await fetch('/api/scenarios');
+    const data = await res.json();
+    const sel = $('f-scenario');
+    (data.scenarios || []).forEach(n => { const o = document.createElement('option'); o.value = n; o.textContent = n; sel.appendChild(o); });
+  } catch(e) {}
+}
+async function poll() {
+  try { const res = await fetch('/api/state',{cache:'no-store'}); const data = await res.json(); renderState(data); } catch(e) {}
+  setTimeout(poll, POLL_MS);
+}
+loadScenarios();
+poll();
+</script>
+</body>
+</html>

dashboard.py ADDED Viewed

	@@ -0,0 +1,129 @@

+"""Lightweight dashboard server for the bio-experiment agent.
+No external dependencies — uses only the Python standard library.
+Usage:
+    python dashboard.py          # serves on http://localhost:8050
+    python dashboard.py --port 9000
+"""
+from __future__ import annotations
+import argparse
+import json
+from http.server import HTTPServer, SimpleHTTPRequestHandler
+from pathlib import Path
+ROOT = Path(__file__).parent
+STATE_FILE = ROOT / "_dashboard_state.json"
+CMD_FILE = ROOT / "_dashboard_cmd.json"
+DASHBOARD_HTML = ROOT / "dashboard.html"
+class DashboardHandler(SimpleHTTPRequestHandler):
+    def do_GET(self):
+        if self.path == "/" or self.path == "/index.html":
+            self._serve_file(DASHBOARD_HTML, "text/html")
+        elif self.path == "/api/state":
+            self._serve_state()
+        elif self.path == "/api/scenarios":
+            self._serve_scenarios()
+        else:
+            self.send_error(404)
+    def do_POST(self):
+        if self.path == "/api/restart":
+            self._handle_command({"action": "restart"})
+        elif self.path == "/api/run":
+            body = self._read_body()
+            if body is None:
+                return
+            body["action"] = "restart"
+            self._handle_command(body)
+        else:
+            self.send_error(404)
+    def do_OPTIONS(self):
+        self.send_response(204)
+        self.send_header("Access-Control-Allow-Origin", "*")
+        self.send_header("Access-Control-Allow-Methods", "GET, POST, OPTIONS")
+        self.send_header("Access-Control-Allow-Headers", "Content-Type")
+        self.end_headers()
+    def _read_body(self):
+        length = int(self.headers.get("Content-Length", 0))
+        if length == 0:
+            return {}
+        raw = self.rfile.read(length)
+        try:
+            return json.loads(raw)
+        except json.JSONDecodeError:
+            self._json_response(400, {"error": "Invalid JSON"})
+            return None
+    def _handle_command(self, cmd: dict):
+        CMD_FILE.write_text(json.dumps(cmd), encoding="utf-8")
+        self._json_response(200, {"ok": True, "command": cmd.get("action")})
+    def _serve_state(self):
+        self.send_response(200)
+        self.send_header("Content-Type", "application/json")
+        self.send_header("Access-Control-Allow-Origin", "*")
+        self.send_header("Cache-Control", "no-cache")
+        self.end_headers()
+        try:
+            data = STATE_FILE.read_bytes()
+        except FileNotFoundError:
+            data = b'{"error": "No state file yet. Run run_agent.py to start an episode."}'
+        self.wfile.write(data)
+    def _serve_scenarios(self):
+        try:
+            from server.tasks.scenarios import SCENARIO_LIBRARY
+            names = [s.name for s in SCENARIO_LIBRARY]
+        except Exception:
+            names = []
+        self._json_response(200, {"scenarios": names})
+    def _serve_file(self, path: Path, content_type: str):
+        try:
+            body = path.read_bytes()
+        except FileNotFoundError:
+            self.send_error(404, f"{path.name} not found")
+            return
+        self.send_response(200)
+        self.send_header("Content-Type", content_type)
+        self.send_header("Content-Length", str(len(body)))
+        self.end_headers()
+        self.wfile.write(body)
+    def _json_response(self, code: int, obj: dict):
+        body = json.dumps(obj).encode()
+        self.send_response(code)
+        self.send_header("Content-Type", "application/json")
+        self.send_header("Access-Control-Allow-Origin", "*")
+        self.send_header("Content-Length", str(len(body)))
+        self.end_headers()
+        self.wfile.write(body)
+    def log_message(self, format, *args):
+        pass
+def main():
+    parser = argparse.ArgumentParser(description="Bio-experiment dashboard server")
+    parser.add_argument("--port", type=int, default=8050)
+    args = parser.parse_args()
+    server = HTTPServer(("0.0.0.0", args.port), DashboardHandler)
+    print(f"Dashboard running at  http://localhost:{args.port}")
+    print("Waiting for agent state from run_agent.py ...")
+    try:
+        server.serve_forever()
+    except KeyboardInterrupt:
+        print("\nShutting down.")
+        server.server_close()
+if __name__ == "__main__":
+    main()

demo.html ADDED Viewed

	@@ -0,0 +1,1639 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>BioEnv</title>
+<style>
+  @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500;600&display=swap');
+  :root {
+    --bg: #07090d;
+    --bg-surface: #0c0f16;
+    --bg-raised: #111827;
+    --bg-hover: #1a2235;
+    --border: #1e293b;
+    --border-active: #334155;
+    --text: #e2e8f0;
+    --text-dim: #94a3b8;
+    --text-muted: #475569;
+    --accent: #38bdf8;
+    --accent-dim: rgba(56,189,248,0.12);
+    --green: #34d399;
+    --green-dim: rgba(52,211,153,0.10);
+    --amber: #fbbf24;
+    --amber-dim: rgba(251,191,36,0.10);
+    --red: #f87171;
+    --red-dim: rgba(248,113,113,0.10);
+    --cyan: #22d3ee;
+    --cyan-dim: rgba(34,211,238,0.10);
+    --pink: #f472b6;
+    --purple: #a78bfa;
+  }
+  * { margin: 0; padding: 0; box-sizing: border-box; }
+  html, body { height: 100%; overflow: hidden; }
+  body {
+    font-family: 'Inter', -apple-system, sans-serif;
+    background: var(--bg);
+    color: var(--text);
+    display: flex;
+    flex-direction: column;
+  }
+  /* ---- Top Bar ---- */
+  .topbar {
+    height: 48px;
+    min-height: 48px;
+    background: var(--bg-surface);
+    border-bottom: 1px solid var(--border);
+    display: flex;
+    align-items: center;
+    padding: 0 20px;
+    gap: 16px;
+    z-index: 10;
+  }
+  .topbar-logo {
+    font-size: 15px;
+    font-weight: 800;
+    letter-spacing: -0.5px;
+    background: linear-gradient(135deg, #38bdf8, #22d3ee);
+    -webkit-background-clip: text;
+    -webkit-text-fill-color: transparent;
+  }
+  .topbar-sep { width: 1px; height: 20px; background: var(--border); }
+  .topbar-env {
+    font-size: 12px;
+    color: var(--text-dim);
+    font-family: 'JetBrains Mono', monospace;
+  }
+  .topbar-status {
+    display: flex;
+    align-items: center;
+    gap: 6px;
+    margin-left: auto;
+    font-size: 12px;
+    color: var(--text-dim);
+  }
+  .status-dot {
+    width: 7px; height: 7px;
+    border-radius: 50%;
+    background: var(--text-muted);
+  }
+  .status-dot.live {
+    background: var(--green);
+    box-shadow: 0 0 8px var(--green);
+    animation: pulse 2s infinite;
+  }
+  @keyframes pulse {
+    0%, 100% { opacity: 1; }
+    50% { opacity: 0.5; }
+  }
+  .topbar-btn {
+    font-size: 12px;
+    font-weight: 600;
+    padding: 6px 14px;
+    border-radius: 6px;
+    border: none;
+    cursor: pointer;
+    transition: all 0.15s;
+    font-family: inherit;
+  }
+  .btn-primary { background: var(--accent); color: #07090d; font-weight: 700; }
+  .btn-primary:hover { background: #7dd3fc; }
+  .btn-primary:disabled { opacity: 0.4; cursor: not-allowed; }
+  .btn-ghost {
+    background: transparent;
+    color: var(--text-dim);
+    border: 1px solid var(--border);
+  }
+  .btn-ghost:hover { background: var(--bg-hover); color: var(--text); }
+  /* ---- Main Layout ---- */
+  .main {
+    flex: 1;
+    display: grid;
+    grid-template-columns: 260px 1fr 340px;
+    overflow: hidden;
+  }
+  /* ---- Left Sidebar ---- */
+  .sidebar {
+    background: var(--bg-surface);
+    border-right: 1px solid var(--border);
+    display: flex;
+    flex-direction: column;
+    overflow-y: auto;
+  }
+  .sidebar-section {
+    padding: 16px;
+    border-bottom: 1px solid var(--border);
+  }
+  .sidebar-heading {
+    font-size: 10px;
+    font-weight: 600;
+    text-transform: uppercase;
+    letter-spacing: 1.5px;
+    color: var(--text-muted);
+    margin-bottom: 10px;
+  }
+  .scenario-list { display: flex; flex-direction: column; gap: 4px; }
+  .scenario-opt {
+    display: flex;
+    align-items: center;
+    gap: 10px;
+    padding: 8px 10px;
+    border-radius: 6px;
+    cursor: pointer;
+    transition: all 0.15s;
+    border: 1px solid transparent;
+  }
+  .scenario-opt:hover { background: var(--bg-hover); }
+  .scenario-opt.active {
+    background: var(--accent-dim);
+    border-color: rgba(56,189,248,0.2);
+  }
+  .scenario-opt .sc-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
+  .scenario-opt .sc-name {
+    font-size: 12px; font-weight: 500; flex: 1;
+    white-space: nowrap; overflow: hidden; text-overflow: ellipsis;
+  }
+  .scenario-opt .sc-diff {
+    font-size: 10px; font-weight: 600;
+    text-transform: uppercase; letter-spacing: 0.5px;
+  }
+  .gauge { margin-bottom: 14px; }
+  .gauge:last-child { margin-bottom: 0; }
+  .gauge-header {
+    display: flex; justify-content: space-between;
+    align-items: baseline; margin-bottom: 6px;
+  }
+  .gauge-label { font-size: 12px; color: var(--text-dim); font-weight: 500; }
+  .gauge-value {
+    font-size: 12px; font-weight: 600;
+    font-family: 'JetBrains Mono', monospace;
+  }
+  .gauge-track {
+    height: 4px; background: var(--bg-hover);
+    border-radius: 4px; overflow: hidden;
+  }
+  .gauge-fill {
+    height: 100%; border-radius: 4px;
+    transition: width 0.8s cubic-bezier(0.4,0,0.2,1);
+  }
+  .pipeline-steps { display: flex; flex-direction: column; gap: 2px; }
+  .pipe-step {
+    display: flex; align-items: center; gap: 8px;
+    padding: 5px 8px; border-radius: 4px;
+    font-size: 11px; font-family: 'JetBrains Mono', monospace;
+    color: var(--text-muted);
+    opacity: 0; transform: translateX(-8px);
+    transition: all 0.3s ease;
+  }
+  .pipe-step.visible { opacity: 1; transform: translateX(0); }
+  .pipe-step.active { color: var(--text); background: var(--accent-dim); }
+  .pipe-step.done { color: var(--text-dim); }
+  .pipe-step .step-icon {
+    width: 16px; height: 16px; border-radius: 50%;
+    border: 1.5px solid var(--text-muted);
+    display: flex; align-items: center; justify-content: center;
+    font-size: 8px; flex-shrink: 0; transition: all 0.3s;
+  }
+  .pipe-step.done .step-icon {
+    background: var(--green-dim); border-color: var(--green); color: var(--green);
+  }
+  .pipe-step.active .step-icon {
+    border-color: var(--accent); background: var(--accent-dim);
+    color: var(--accent); animation: pulse 1.5s infinite;
+  }
+  /* ---- Center: Lab + Terminal ---- */
+  .center {
+    display: flex;
+    flex-direction: column;
+    overflow: hidden;
+    background: var(--bg);
+  }
+  /* Lab canvas */
+  .lab-panel {
+    height: 300px;
+    min-height: 300px;
+    background: var(--bg-surface);
+    border-bottom: 1px solid var(--border);
+    position: relative;
+    overflow: hidden;
+  }
+  .lab-panel canvas {
+    display: block;
+    width: 100%;
+    height: 100%;
+  }
+  .lab-label {
+    position: absolute;
+    top: 8px;
+    left: 12px;
+    font-size: 10px;
+    font-weight: 600;
+    text-transform: uppercase;
+    letter-spacing: 1.5px;
+    color: var(--text-muted);
+    z-index: 2;
+    pointer-events: none;
+  }
+  .lab-action-label {
+    position: absolute;
+    bottom: 10px;
+    left: 50%;
+    transform: translateX(-50%);
+    font-size: 11px;
+    font-family: 'JetBrains Mono', monospace;
+    color: var(--text-dim);
+    background: rgba(12,15,22,0.85);
+    padding: 4px 14px;
+    border-radius: 100px;
+    border: 1px solid var(--border);
+    z-index: 2;
+    pointer-events: none;
+    opacity: 0;
+    transition: opacity 0.3s;
+  }
+  .lab-action-label.visible { opacity: 1; }
+  .center-header {
+    height: 36px;
+    min-height: 36px;
+    display: flex;
+    align-items: center;
+    padding: 0 16px;
+    background: var(--bg-surface);
+    border-bottom: 1px solid var(--border);
+    gap: 8px;
+  }
+  .tab {
+    font-size: 11px; font-weight: 500;
+    padding: 4px 12px; border-radius: 4px;
+    color: var(--text-dim); cursor: pointer;
+    transition: all 0.15s;
+  }
+  .tab.active { color: var(--text); background: var(--bg-hover); }
+  .tab:hover { color: var(--text); }
+  .terminal {
+    flex: 1;
+    overflow-y: auto;
+    padding: 16px 20px;
+    font-family: 'JetBrains Mono', monospace;
+    font-size: 12.5px;
+    line-height: 1.9;
+    scrollbar-width: thin;
+    scrollbar-color: var(--border) transparent;
+  }
+  .terminal::-webkit-scrollbar { width: 6px; }
+  .terminal::-webkit-scrollbar-track { background: transparent; }
+  .terminal::-webkit-scrollbar-thumb { background: var(--border); border-radius: 3px; }
+  .t-line {
+    white-space: pre-wrap;
+    opacity: 0;
+    animation: lineIn 0.25s ease forwards;
+  }
+  @keyframes lineIn {
+    from { opacity: 0; transform: translateY(4px); }
+    to { opacity: 1; transform: translateY(0); }
+  }
+  .t-prompt { color: var(--green); }
+  .t-cmd { color: var(--text); }
+  .t-dim { color: var(--text-muted); }
+  .t-label { color: var(--accent); }
+  .t-str { color: var(--amber); }
+  .t-kw { color: var(--pink); }
+  .t-fn { color: var(--cyan); }
+  .t-num { color: var(--purple); }
+  .t-ok { color: var(--green); }
+  .t-warn { color: var(--amber); }
+  .t-err { color: var(--red); }
+  .t-sub { color: var(--text-dim); }
+  /* ---- Right Panel ---- */
+  .right {
+    background: var(--bg-surface);
+    border-left: 1px solid var(--border);
+    display: flex;
+    flex-direction: column;
+    overflow-y: auto;
+    scrollbar-width: thin;
+    scrollbar-color: var(--border) transparent;
+  }
+  .panel-section {
+    padding: 16px;
+    border-bottom: 1px solid var(--border);
+  }
+  .panel-heading {
+    font-size: 10px; font-weight: 600;
+    text-transform: uppercase; letter-spacing: 1.5px;
+    color: var(--text-muted); margin-bottom: 12px;
+    display: flex; align-items: center; justify-content: space-between;
+  }
+  .reward-row {
+    display: flex; align-items: center; gap: 10px; margin-bottom: 8px;
+  }
+  .reward-row:last-child { margin-bottom: 0; }
+  .rw-label {
+    font-size: 11px; font-weight: 500; width: 80px;
+    color: var(--text-dim); text-align: right;
+  }
+  .rw-track {
+    flex: 1; height: 18px;
+    background: rgba(255,255,255,0.03);
+    border-radius: 4px; overflow: hidden; position: relative;
+  }
+  .rw-fill {
+    height: 100%; border-radius: 4px; width: 0%;
+    transition: width 0.6s cubic-bezier(0.4,0,0.2,1);
+    display: flex; align-items: center; justify-content: flex-end;
+    padding-right: 6px; font-size: 10px; font-weight: 600;
+    font-family: 'JetBrains Mono', monospace;
+    color: rgba(255,255,255,0.85); min-width: fit-content;
+  }
+  .rw-fill.validity { background: linear-gradient(90deg, rgba(52,211,153,0.5), rgba(52,211,153,0.85)); }
+  .rw-fill.ordering { background: linear-gradient(90deg, rgba(34,211,238,0.5), rgba(34,211,238,0.85)); }
+  .rw-fill.info_gain { background: linear-gradient(90deg, rgba(56,189,248,0.5), rgba(56,189,248,0.85)); }
+  .rw-fill.efficiency { background: linear-gradient(90deg, rgba(251,191,36,0.5), rgba(251,191,36,0.85)); }
+  .rw-fill.novelty { background: linear-gradient(90deg, rgba(167,139,250,0.5), rgba(167,139,250,0.85)); }
+  .rw-fill.penalty { background: linear-gradient(90deg, rgba(248,113,113,0.5), rgba(248,113,113,0.85)); }
+  .cumulative-row {
+    display: flex; align-items: baseline; justify-content: space-between;
+    margin-top: 12px; padding-top: 12px; border-top: 1px solid var(--border);
+  }
+  .cum-label { font-size: 11px; color: var(--text-dim); }
+  .cum-value {
+    font-size: 20px; font-weight: 700;
+    font-family: 'JetBrains Mono', monospace; color: var(--green);
+  }
+  .discovery-list { display: flex; flex-direction: column; gap: 6px; }
+  .discovery {
+    display: flex; align-items: flex-start; gap: 8px;
+    padding: 8px 10px; background: var(--bg-raised);
+    border-radius: 6px; border: 1px solid var(--border);
+    opacity: 0; transform: scale(0.95); transition: all 0.3s ease;
+  }
+  .discovery.visible { opacity: 1; transform: scale(1); }
+  .disc-icon {
+    width: 20px; height: 20px; border-radius: 4px;
+    display: flex; align-items: center; justify-content: center;
+    font-size: 10px; flex-shrink: 0; margin-top: 1px;
+  }
+  .disc-body { flex: 1; }
+  .disc-title { font-size: 11px; font-weight: 600; }
+  .disc-detail {
+    font-size: 10px; color: var(--text-dim); margin-top: 2px;
+    font-family: 'JetBrains Mono', monospace;
+  }
+  .empty-state {
+    font-size: 11px; color: var(--text-muted);
+    font-style: italic; padding: 8px 0;
+  }
+  .step-reward-mini {
+    display: flex; align-items: center; justify-content: space-between;
+    padding: 6px 10px; background: var(--bg-raised);
+    border-radius: 6px; margin-bottom: 4px;
+    font-size: 11px; font-family: 'JetBrains Mono', monospace;
+    opacity: 0; transition: all 0.3s;
+  }
+  .step-reward-mini.visible { opacity: 1; }
+  .step-reward-mini .srm-name { color: var(--text-dim); }
+  .step-reward-mini .srm-val { font-weight: 600; }
+  .step-reward-mini .srm-val.pos { color: var(--green); }
+  .step-reward-mini .srm-val.neg { color: var(--red); }
+</style>
+</head>
+<body>
+<!-- Top Bar -->
+<div class="topbar">
+  <div class="topbar-logo">BioEnv</div>
+  <div class="topbar-sep"></div>
+  <div class="topbar-env">biomarker_validation_lung</div>
+  <div class="topbar-status">
+    <div class="status-dot" id="statusDot"></div>
+    <span id="statusText">Ready</span>
+  </div>
+  <button class="topbar-btn btn-ghost" id="resetBtn" onclick="resetDemo()">Reset</button>
+  <button class="topbar-btn btn-primary" id="runBtn" onclick="startDemo()">Run Episode</button>
+</div>
+<div class="main">
+  <!-- Left Sidebar -->
+  <div class="sidebar">
+    <div class="sidebar-section">
+      <div class="sidebar-heading">Scenario</div>
+      <div class="scenario-list">
+        <div class="scenario-opt" onclick="selectScenario(this)">
+          <div class="sc-dot" style="background: var(--green);"></div>
+          <span class="sc-name">Cardiac Disease DE</span>
+          <span class="sc-diff" style="color: var(--green);">Easy</span>
+        </div>
+        <div class="scenario-opt" onclick="selectScenario(this)">
+          <div class="sc-dot" style="background: var(--amber);"></div>
+          <span class="sc-name">Hematopoiesis Trajectory</span>
+          <span class="sc-diff" style="color: var(--amber);">Med</span>
+        </div>
+        <div class="scenario-opt" onclick="selectScenario(this)">
+          <div class="sc-dot" style="background: var(--amber);"></div>
+          <span class="sc-name">Perturbation Immune</span>
+          <span class="sc-diff" style="color: var(--amber);">Med</span>
+        </div>
+        <div class="scenario-opt active" onclick="selectScenario(this)">
+          <div class="sc-dot" style="background: var(--red);"></div>
+          <span class="sc-name">Biomarker Validation (Lung)</span>
+          <span class="sc-diff" style="color: var(--red);">Hard</span>
+        </div>
+      </div>
+    </div>
+    <div class="sidebar-section">
+      <div class="sidebar-heading">Environment State</div>
+      <div class="gauge">
+        <div class="gauge-header">
+          <span class="gauge-label">Budget</span>
+          <span class="gauge-value" id="budgetVal">$100,000</span>
+        </div>
+        <div class="gauge-track"><div class="gauge-fill" id="budgetFill" style="width:100%;background:var(--green);"></div></div>
+      </div>
+      <div class="gauge">
+        <div class="gauge-header">
+          <span class="gauge-label">Time</span>
+          <span class="gauge-value" id="timeVal">180 / 180 days</span>
+        </div>
+        <div class="gauge-track"><div class="gauge-fill" id="timeFill" style="width:100%;background:var(--cyan);"></div></div>
+      </div>
+      <div class="gauge">
+        <div class="gauge-header">
+          <span class="gauge-label">Steps</span>
+          <span class="gauge-value" id="stepVal">0 / 30</span>
+        </div>
+        <div class="gauge-track"><div class="gauge-fill" id="stepFill" style="width:0%;background:var(--accent);"></div></div>
+      </div>
+    </div>
+    <div class="sidebar-section" style="flex:1;overflow-y:auto;">
+      <div class="sidebar-heading">Pipeline</div>
+      <div class="pipeline-steps" id="pipelineSteps"></div>
+    </div>
+  </div>
+  <!-- Center: Lab + Terminal -->
+  <div class="center">
+    <div class="lab-panel">
+      <div class="lab-label">Virtual Lab</div>
+      <div class="lab-action-label" id="labActionLabel"></div>
+      <canvas id="labCanvas"></canvas>
+    </div>
+    <div class="center-header">
+      <div class="tab active">Agent Log</div>
+      <div class="tab">Raw JSON</div>
+    </div>
+    <div class="terminal" id="terminal"></div>
+  </div>
+  <!-- Right Panel -->
+  <div class="right">
+    <div class="panel-section">
+      <div class="panel-heading">
+        Step Reward
+        <span id="stepRewardLabel" style="font-family:'JetBrains Mono',monospace;font-size:11px;color:var(--text-dim);">--</span>
+      </div>
+      <div id="rewardBars">
+        <div class="reward-row"><span class="rw-label">Validity</span><div class="rw-track"><div class="rw-fill validity" id="rw-validity"></div></div></div>
+        <div class="reward-row"><span class="rw-label">Ordering</span><div class="rw-track"><div class="rw-fill ordering" id="rw-ordering"></div></div></div>
+        <div class="reward-row"><span class="rw-label">Info Gain</span><div class="rw-track"><div class="rw-fill info_gain" id="rw-info_gain"></div></div></div>
+        <div class="reward-row"><span class="rw-label">Efficiency</span><div class="rw-track"><div class="rw-fill efficiency" id="rw-efficiency"></div></div></div>
+        <div class="reward-row"><span class="rw-label">Novelty</span><div class="rw-track"><div class="rw-fill novelty" id="rw-novelty"></div></div></div>
+        <div class="reward-row"><span class="rw-label">Penalty</span><div class="rw-track"><div class="rw-fill penalty" id="rw-penalty"></div></div></div>
+      </div>
+      <div class="cumulative-row">
+        <span class="cum-label">Cumulative Reward</span>
+        <span class="cum-value" id="cumReward">0.00</span>
+      </div>
+    </div>
+    <div class="panel-section">
+      <div class="panel-heading">Reward History</div>
+      <div id="rewardHistory"><div class="empty-state">No steps yet</div></div>
+    </div>
+    <div class="panel-section">
+      <div class="panel-heading">Discoveries</div>
+      <div class="discovery-list" id="discoveries"><div class="empty-state">No discoveries yet</div></div>
+    </div>
+    <div class="panel-section">
+      <div class="panel-heading">Violations</div>
+      <div id="violations"><div class="empty-state">No violations</div></div>
+    </div>
+  </div>
+</div>
+<script>
+// =====================================================
+// VIRTUAL LAB - Canvas rendering
+// =====================================================
+const labCanvas = document.getElementById('labCanvas');
+const ctx = labCanvas.getContext('2d');
+let labW, labH, dpr;
+function resizeLab() {
+  const rect = labCanvas.parentElement.getBoundingClientRect();
+  dpr = window.devicePixelRatio || 1;
+  labW = rect.width;
+  labH = rect.height;
+  labCanvas.width = labW * dpr;
+  labCanvas.height = labH * dpr;
+  ctx.setTransform(dpr, 0, 0, dpr, 0, 0);
+}
+resizeLab();
+window.addEventListener('resize', () => { resizeLab(); });
+// Lab stations (positions as fractions of canvas, converted in draw)
+const STATIONS = {
+  idle:       { fx: 0.06, fy: 0.55, label: 'ENTRANCE',       icon: 'door',    color: '#475569' },
+  sample:     { fx: 0.20, fy: 0.35, label: 'SAMPLE BENCH',   icon: 'bench',   color: '#34d399' },
+  cohort:     { fx: 0.20, fy: 0.75, label: 'COHORT SELECT',  icon: 'people',  color: '#34d399' },
+  prep:       { fx: 0.38, fy: 0.35, label: 'LIBRARY PREP',   icon: 'flask',   color: '#2dd4bf' },
+  sequencer:  { fx: 0.38, fy: 0.75, label: 'SEQUENCER',      icon: 'machine', color: '#22d3ee' },
+  computer:   { fx: 0.62, fy: 0.50, label: 'COMPUTE',        icon: 'screen',  color: '#38bdf8' },
+  whiteboard: { fx: 0.84, fy: 0.45, label: 'SYNTHESIS',      icon: 'board',   color: '#a78bfa' },
+};
+// Map actions to stations
+const ACTION_STATION = {
+  collect_sample: 'sample',
+  select_cohort: 'cohort',
+  prepare_library: 'prep',
+  sequence_cells: 'sequencer',
+  run_qc: 'computer',
+  normalize_data: 'computer',
+  cluster_cells: 'computer',
+  differential_expression: 'computer',
+  pathway_enrichment: 'computer',
+  marker_selection: 'computer',
+  validate_marker: 'computer',
+  synthesize_conclusion: 'whiteboard',
+};
+// Agent state
+let agent = { x: 0, y: 0, targetX: 0, targetY: 0, station: 'idle', working: false };
+let agentTrail = [];
+let workingTick = 0;
+let terminalLines = []; // fake terminal on computer screen
+let activeStationKey = null;
+let particlesLab = [];
+function stationPos(key) {
+  const s = STATIONS[key];
+  return { x: s.fx * labW, y: s.fy * labH };
+}
+function initAgent() {
+  const p = stationPos('idle');
+  agent.x = p.x; agent.y = p.y;
+  agent.targetX = p.x; agent.targetY = p.y;
+  agent.station = 'idle';
+  agent.working = false;
+  agent.facing = 1;
+  agentTrail = [];
+  terminalLines = [];
+  activeStationKey = null;
+  particlesLab = [];
+}
+initAgent();
+function moveAgentTo(stationKey) {
+  const p = stationPos(stationKey);
+  agent.targetX = p.x;
+  agent.targetY = p.y;
+  agent.station = stationKey;
+  agent.working = false;
+  activeStationKey = stationKey;
+}
+function setAgentWorking(actionName) {
+  agent.working = true;
+  workingTick = 0;
+  // If at computer, set up terminal lines
+  if (agent.station === 'computer') {
+    terminalLines = [];
+    typeComputerLines(actionName);
+  }
+}
+const COMP_COMMANDS = {
+  run_qc:                  ['$ scanpy.pp.filter_cells()', '  filtering 11847 cells...', '  10234 passed QC', '  doublet rate: 3.2%'],
+  normalize_data:          ['$ scran.normalize(adata)', '  computing size factors...', '  log1p transform', '  HVGs: 3000 selected'],
+  cluster_cells:           ['$ sc.tl.leiden(adata, 0.8)', '  building kNN graph...', '  optimizing modularity', '  14 clusters found'],
+  differential_expression: ['$ DESeq2.run(IPF, Ctrl)', '  fitting GLM...', '  1847 DE genes', '  SPP1 log2FC=3.42 ***'],
+  pathway_enrichment:      ['$ gseapy.enrich(de_genes)', '  KEGG + Reactome...', '  ECM-receptor p=4.2e-12', '  TGF-beta p=1.8e-09'],
+  marker_selection:        ['$ rank_markers(candidates)', '  SPP1  AUROC: 0.94', '  MMP7  AUROC: 0.87', '  COL1A1 AUROC: 0.81'],
+  validate_marker:         ['$ cross_validate("SPP1")', '  fold 1: 0.93', '  fold 2: 0.89', '  mean AUROC: 0.91 OK'],
+};
+async function typeComputerLines(actionName) {
+  const lines = COMP_COMMANDS[actionName] || ['$ processing...', '  computing...', '  done'];
+  for (let i = 0; i < lines.length; i++) {
+    await wait(250);
+    terminalLines.push(lines[i]);
+    if (terminalLines.length > 5) terminalLines.shift();
+  }
+}
+// Particles burst
+function spawnParticles(x, y, color, count = 8) {
+  for (let i = 0; i < count; i++) {
+    const angle = (Math.PI * 2 / count) * i + Math.random() * 0.5;
+    particlesLab.push({
+      x, y,
+      vx: Math.cos(angle) * (1.5 + Math.random() * 2),
+      vy: Math.sin(angle) * (1.5 + Math.random() * 2),
+      life: 1,
+      color,
+      size: 2 + Math.random() * 2,
+    });
+  }
+}
+// ---- Draw loop ----
+let frameCount = 0;
+const FLOOR_COLOR = '#0f1520';
+const WALL_COLOR = '#1a2332';
+const FLOOR_TILE_A = '#0d1219';
+const FLOOR_TILE_B = '#10161f';
+function drawLab() {
+  frameCount++;
+  ctx.clearRect(0, 0, labW, labH);
+  // Floor - checkerboard tiles
+  const tileSize = 24;
+  for (let ty = 0; ty < labH; ty += tileSize) {
+    for (let tx = 0; tx < labW; tx += tileSize) {
+      const checker = ((Math.floor(tx / tileSize) + Math.floor(ty / tileSize)) % 2 === 0);
+      ctx.fillStyle = checker ? FLOOR_TILE_A : FLOOR_TILE_B;
+      ctx.fillRect(tx, ty, tileSize, tileSize);
+    }
+  }
+  // Walls - top and bottom border
+  ctx.fillStyle = WALL_COLOR;
+  ctx.fillRect(0, 0, labW, 18);
+  ctx.fillRect(0, labH - 8, labW, 8);
+  ctx.strokeStyle = '#253040';
+  ctx.lineWidth = 1;
+  ctx.beginPath(); ctx.moveTo(0, 18); ctx.lineTo(labW, 18); ctx.stroke();
+  // Draw equipment at each station (behind the person)
+  for (const [key, s] of Object.entries(STATIONS)) {
+    const pos = stationPos(key);
+    const isActive = key === activeStationKey;
+    drawEquipment(key, pos.x, pos.y, s.color, isActive);
+  }
+  // Draw walking path (subtle floor markings)
+  ctx.strokeStyle = 'rgba(56,189,248,0.06)';
+  ctx.lineWidth = 16;
+  ctx.lineCap = 'round';
+  ctx.lineJoin = 'round';
+  const pathOrder = ['idle','sample','prep','computer','whiteboard'];
+  ctx.beginPath();
+  const p0 = stationPos(pathOrder[0]);
+  ctx.moveTo(p0.x, p0.y + 10);
+  for (let i = 1; i < pathOrder.length; i++) {
+    const p = stationPos(pathOrder[i]);
+    ctx.lineTo(p.x, p.y + 10);
+  }
+  ctx.stroke();
+  // Lower path
+  ctx.beginPath();
+  const pl0 = stationPos('idle');
+  ctx.moveTo(pl0.x, pl0.y + 10);
+  const pl1 = stationPos('cohort');
+  ctx.lineTo(pl1.x, pl1.y + 10);
+  const pl2 = stationPos('sequencer');
+  ctx.lineTo(pl2.x, pl2.y + 10);
+  const pl3 = stationPos('computer');
+  ctx.lineTo(pl3.x, pl3.y + 10);
+  ctx.stroke();
+  ctx.lineCap = 'butt';
+  // Floating terminal popup at computer
+  if (agent.station === 'computer' && agent.working && terminalLines.length > 0) {
+    const cp = stationPos('computer');
+    const sx = cp.x + 55, sy = cp.y - 65;
+    const sw = 170, sh = 95;
+    // Shadow
+    ctx.fillStyle = 'rgba(0,0,0,0.4)';
+    roundRect(ctx, sx + 3, sy + 3, sw, sh, 6);
+    ctx.fill();
+    ctx.fillStyle = 'rgba(7,9,13,0.97)';
+    ctx.strokeStyle = 'rgba(56,189,248,0.3)';
+    ctx.lineWidth = 1;
+    roundRect(ctx, sx, sy, sw, sh, 6);
+    ctx.fill(); ctx.stroke();
+    // Title bar
+    ctx.fillStyle = 'rgba(30,41,59,0.5)';
+    ctx.fillRect(sx + 1, sy + 1, sw - 2, 14);
+    ctx.fillStyle = '#475569';
+    ctx.font = '500 7px Inter, sans-serif';
+    ctx.textAlign = 'left';
+    ctx.fillText('terminal', sx + 6, sy + 10);
+    // dots
+    ctx.fillStyle = '#f87171'; ctx.beginPath(); ctx.arc(sx + sw - 28, sy + 7, 3, 0, Math.PI*2); ctx.fill();
+    ctx.fillStyle = '#fbbf24'; ctx.beginPath(); ctx.arc(sx + sw - 18, sy + 7, 3, 0, Math.PI*2); ctx.fill();
+    ctx.fillStyle = '#34d399'; ctx.beginPath(); ctx.arc(sx + sw - 8, sy + 7, 3, 0, Math.PI*2); ctx.fill();
+    ctx.font = '500 9px JetBrains Mono, monospace';
+    const startY = sy + 28;
+    for (let i = 0; i < terminalLines.length; i++) {
+      const line = terminalLines[i];
+      ctx.fillStyle = line.startsWith('$') ? '#34d399' : line.includes('***') || line.includes('OK') ? '#34d399' : '#94a3b8';
+      ctx.fillText(terminalLines[i].substring(0, 24), sx + 8, startY + i * 14);
+    }
+    if (frameCount % 60 < 30) {
+      ctx.fillStyle = '#34d399';
+      ctx.fillRect(sx + 8, startY + terminalLines.length * 14 - 8, 6, 11);
+    }
+  }
+  // Whiteboard popup
+  if (agent.station === 'whiteboard' && agent.working) {
+    const wp = stationPos('whiteboard');
+    const bx = wp.x - 60, by = wp.y - 75;
+    const bw = 120, bh = 72;
+    ctx.fillStyle = 'rgba(0,0,0,0.3)';
+    roundRect(ctx, bx + 3, by + 3, bw, bh, 6);
+    ctx.fill();
+    ctx.fillStyle = 'rgba(17,24,39,0.95)';
+    ctx.strokeStyle = 'rgba(167,139,250,0.3)';
+    ctx.lineWidth = 1;
+    roundRect(ctx, bx, by, bw, bh, 6);
+    ctx.fill(); ctx.stroke();
+    ctx.font = '600 8px JetBrains Mono, monospace';
+    ctx.textAlign = 'left';
+    ctx.fillStyle = '#a78bfa';
+    ctx.fillText('CONCLUSION', bx + 8, by + 14);
+    ctx.font = '400 7.5px JetBrains Mono, monospace';
+    const synthLines = ['SPP1 validated', 'AUROC = 0.91', 'Confidence: 0.85', 'Match: 4/5'];
+    for (let i = 0; i < synthLines.length; i++) {
+      ctx.fillStyle = i === 0 ? '#34d399' : '#94a3b8';
+      ctx.fillText(synthLines[i], bx + 8, by + 28 + i * 12);
+    }
+  }
+  // Activity text above active station
+  if (agent.working && activeStationKey && activeStationKey !== 'idle') {
+    const sp = stationPos(activeStationKey);
+    const actTexts = {
+      sample: 'collecting tissue...', cohort: 'selecting cohort...',
+      prep: 'preparing library...', sequencer: 'sequencing...',
+      computer: 'computing...', whiteboard: 'synthesizing...',
+    };
+    ctx.fillStyle = STATIONS[activeStationKey].color;
+    ctx.font = '500 9px JetBrains Mono, monospace';
+    ctx.textAlign = 'center';
+    ctx.globalAlpha = 0.5 + 0.3 * Math.sin(frameCount * 0.06);
+    const yOff = ['sample','prep'].includes(activeStationKey) ? -55 : -50;
+    ctx.fillText(actTexts[activeStationKey] || 'working...', sp.x, sp.y + yOff);
+    ctx.globalAlpha = 1;
+  }
+  // Move agent smoothly
+  const dx = agent.targetX - agent.x;
+  const dy = agent.targetY - agent.y;
+  const dist = Math.sqrt(dx * dx + dy * dy);
+  const isWalking = dist > 2;
+  if (isWalking) {
+    const speed = 0.05;
+    agent.x += dx * speed;
+    agent.y += dy * speed;
+    agent.facing = dx > 0 ? 1 : dx < -0.5 ? -1 : agent.facing;
+  }
+  // Draw person
+  drawPerson(agent.x, agent.y, isWalking, agent.working, agent.facing || 1);
+  // Particles
+  for (let i = particlesLab.length - 1; i >= 0; i--) {
+    const p = particlesLab[i];
+    p.x += p.vx; p.y += p.vy;
+    p.vx *= 0.95; p.vy *= 0.95;
+    p.life -= 0.02;
+    if (p.life <= 0) { particlesLab.splice(i, 1); continue; }
+    ctx.globalAlpha = p.life * 0.6;
+    ctx.fillStyle = p.color;
+    ctx.beginPath();
+    ctx.arc(p.x, p.y, p.size * p.life, 0, Math.PI * 2);
+    ctx.fill();
+  }
+  ctx.globalAlpha = 1;
+  // Station labels
+  for (const [key, s] of Object.entries(STATIONS)) {
+    if (key === 'idle') continue;
+    const pos = stationPos(key);
+    const isActive = key === activeStationKey;
+    ctx.fillStyle = isActive ? s.color : '#334155';
+    ctx.font = `600 ${isActive ? 9 : 8}px Inter, sans-serif`;
+    ctx.textAlign = 'center';
+    const ly = key === 'cohort' || key === 'sequencer' ? pos.y + 45 : pos.y + 42;
+    ctx.fillText(s.label, pos.x, ly);
+  }
+  requestAnimationFrame(drawLab);
+}
+// ---- Draw person (lab coat researcher) ----
+function drawPerson(x, y, walking, working, facing) {
+  const f = facing;
+  const t = frameCount;
+  // Walking cycle
+  const walkCycle = walking ? Math.sin(t * 0.15) : 0;
+  const bobY = walking ? Math.abs(Math.sin(t * 0.15)) * 2 : 0;
+  // Working arm animation
+  const workArm = working ? Math.sin(t * 0.08) * 0.3 : 0;
+  const py = y - bobY; // feet position base
+  ctx.save();
+  ctx.translate(x, py);
+  // Shadow
+  ctx.fillStyle = 'rgba(0,0,0,0.25)';
+  ctx.beginPath();
+  ctx.ellipse(0, 12, 10, 4, 0, 0, Math.PI * 2);
+  ctx.fill();
+  // Legs
+  const legSpread = walking ? walkCycle * 5 : 0;
+  ctx.strokeStyle = '#1e3a5f';
+  ctx.lineWidth = 3;
+  ctx.lineCap = 'round';
+  // Left leg
+  ctx.beginPath();
+  ctx.moveTo(-3, 4);
+  ctx.lineTo(-3 + legSpread, 12);
+  ctx.stroke();
+  // Right leg
+  ctx.beginPath();
+  ctx.moveTo(3, 4);
+  ctx.lineTo(3 - legSpread, 12);
+  ctx.stroke();
+  // Shoes
+  ctx.fillStyle = '#1e293b';
+  ctx.beginPath(); ctx.arc(-3 + legSpread, 12, 2.5, 0, Math.PI * 2); ctx.fill();
+  ctx.beginPath(); ctx.arc(3 - legSpread, 12, 2.5, 0, Math.PI * 2); ctx.fill();
+  // Body / lab coat
+  ctx.fillStyle = '#e2e8f0'; // white lab coat
+  ctx.beginPath();
+  ctx.moveTo(-7, -4);
+  ctx.lineTo(-6, 6);
+  ctx.lineTo(6, 6);
+  ctx.lineTo(7, -4);
+  ctx.quadraticCurveTo(7, -10, 0, -10);
+  ctx.quadraticCurveTo(-7, -10, -7, -4);
+  ctx.fill();
+  // Coat outline
+  ctx.strokeStyle = '#94a3b8';
+  ctx.lineWidth = 0.5;
+  ctx.stroke();
+  // Coat split at bottom
+  ctx.beginPath();
+  ctx.moveTo(0, 1);
+  ctx.lineTo(0, 6);
+  ctx.strokeStyle = '#cbd5e1';
+  ctx.lineWidth = 0.5;
+  ctx.stroke();
+  // Pocket
+  ctx.strokeStyle = '#94a3b8';
+  ctx.lineWidth = 0.5;
+  ctx.strokeRect(f > 0 ? 1 : -5, -1, 4, 3);
+  // Arms
+  ctx.strokeStyle = '#e2e8f0';
+  ctx.lineWidth = 3.5;
+  ctx.lineCap = 'round';
+  // Back arm
+  const backArmSwing = walking ? -walkCycle * 4 : 0;
+  ctx.beginPath();
+  ctx.moveTo(-f * 6, -6);
+  ctx.lineTo(-f * 6 + backArmSwing, 2);
+  ctx.stroke();
+  // Front arm (active arm)
+  if (working) {
+    // Arm reaching forward/up for work
+    ctx.beginPath();
+    ctx.moveTo(f * 6, -6);
+    ctx.lineTo(f * 10 + workArm * 5, -8 + workArm * 3);
+    ctx.stroke();
+    // Hand/tool
+    ctx.fillStyle = '#fde68a';
+    ctx.beginPath();
+    ctx.arc(f * 10 + workArm * 5, -8 + workArm * 3, 2, 0, Math.PI * 2);
+    ctx.fill();
+  } else {
+    const frontArmSwing = walking ? walkCycle * 4 : 0;
+    ctx.beginPath();
+    ctx.moveTo(f * 6, -6);
+    ctx.lineTo(f * 6 + frontArmSwing, 2);
+    ctx.stroke();
+  }
+  // Skin for hands
+  ctx.fillStyle = '#fde68a';
+  ctx.beginPath(); ctx.arc(-f * 6 + backArmSwing, 2, 1.8, 0, Math.PI * 2); ctx.fill();
+  if (!working) {
+    const fs = walking ? walkCycle * 4 : 0;
+    ctx.beginPath(); ctx.arc(f * 6 + fs, 2, 1.8, 0, Math.PI * 2); ctx.fill();
+  }
+  // Head
+  ctx.fillStyle = '#fde68a'; // skin
+  ctx.beginPath();
+  ctx.arc(0, -15, 7, 0, Math.PI * 2);
+  ctx.fill();
+  // Hair
+  ctx.fillStyle = '#1e293b';
+  ctx.beginPath();
+  ctx.arc(0, -17, 7, Math.PI, 0);
+  ctx.fill();
+  // Face details
+  ctx.fillStyle = '#1e293b';
+  // Eyes
+  ctx.beginPath();
+  ctx.arc(f * 2.5, -15.5, 1, 0, Math.PI * 2);
+  ctx.fill();
+  ctx.beginPath();
+  ctx.arc(f * -1.5, -15.5, 1, 0, Math.PI * 2);
+  ctx.fill();
+  // Glasses
+  ctx.strokeStyle = '#475569';
+  ctx.lineWidth = 0.7;
+  ctx.beginPath();
+  ctx.arc(f * 2.5, -15.5, 2.5, 0, Math.PI * 2);
+  ctx.stroke();
+  ctx.beginPath();
+  ctx.arc(f * -1.5, -15.5, 2.5, 0, Math.PI * 2);
+  ctx.stroke();
+  ctx.beginPath();
+  ctx.moveTo(f * 0.5, -15.5);
+  ctx.lineTo(f * -0.5, -15.5);
+  ctx.stroke();
+  // Mouth
+  if (working) {
+    ctx.fillStyle = '#1e293b';
+    ctx.beginPath();
+    ctx.arc(f * 0.5, -12.5, 1, 0, Math.PI);
+    ctx.fill();
+  }
+  // ID Badge
+  ctx.fillStyle = '#38bdf8';
+  ctx.fillRect(f > 0 ? -6 : 2, -3, 4, 5);
+  ctx.fillStyle = '#fff';
+  ctx.font = 'bold 3px Inter, sans-serif';
+  ctx.textAlign = 'center';
+  ctx.fillText('AI', f > 0 ? -4 : 4, 0.5);
+  ctx.restore();
+}
+// ---- Draw lab equipment ----
+function drawEquipment(stationKey, cx, cy, color, active) {
+  ctx.save();
+  switch (stationKey) {
+    case 'idle':
+      // Door frame
+      ctx.strokeStyle = '#334155';
+      ctx.lineWidth = 2;
+      ctx.strokeRect(cx - 12, cy - 30, 24, 40);
+      ctx.fillStyle = '#1a2332';
+      ctx.fillRect(cx - 10, cy - 28, 20, 36);
+      ctx.fillStyle = '#475569';
+      ctx.beginPath(); ctx.arc(cx + 6, cy - 10, 2, 0, Math.PI * 2); ctx.fill();
+      break;
+    case 'sample':
+      // Lab bench with sample tubes
+      // Bench surface
+      ctx.fillStyle = '#1a2332';
+      ctx.fillRect(cx - 30, cy - 8, 60, 6);
+      // Bench legs
+      ctx.fillStyle = '#253040';
+      ctx.fillRect(cx - 28, cy - 2, 4, 20);
+      ctx.fillRect(cx + 24, cy - 2, 4, 20);
+      // Tube rack
+      ctx.fillStyle = '#253040';
+      ctx.fillRect(cx - 18, cy - 18, 36, 10);
+      // Test tubes
+      const tubeColors = ['#34d399', '#22d3ee', '#fbbf24', '#f472b6', '#34d399', '#22d3ee'];
+      for (let i = 0; i < 6; i++) {
+        const tx = cx - 14 + i * 6;
+        ctx.fillStyle = active ? tubeColors[i] : '#334155';
+        ctx.globalAlpha = active ? 0.7 : 0.4;
+        ctx.fillRect(tx, cy - 28, 4, 12);
+        // Tube caps
+        ctx.globalAlpha = 1;
+        ctx.fillStyle = active ? tubeColors[i] : '#475569';
+        ctx.fillRect(tx - 0.5, cy - 29, 5, 2);
+      }
+      ctx.globalAlpha = 1;
+      // Pipette if active
+      if (active) {
+        const pipY = cy - 32 + Math.sin(frameCount * 0.08) * 4;
+        ctx.strokeStyle = '#94a3b8';
+        ctx.lineWidth = 2;
+        ctx.beginPath();
+        ctx.moveTo(cx + 5, pipY);
+        ctx.lineTo(cx + 5, pipY - 14);
+        ctx.stroke();
+        ctx.fillStyle = '#64748b';
+        ctx.fillRect(cx + 3, pipY - 18, 5, 6);
+        // Droplet
+        if (frameCount % 60 < 20) {
+          ctx.fillStyle = '#34d399';
+          ctx.globalAlpha = 0.6;
+          ctx.beginPath();
+          ctx.arc(cx + 5, pipY + 3, 1.5, 0, Math.PI * 2);
+          ctx.fill();
+          ctx.globalAlpha = 1;
+        }
+      }
+      break;
+    case 'cohort':
+      // Filing cabinet / patient records
+      ctx.fillStyle = '#1a2332';
+      ctx.fillRect(cx - 20, cy - 22, 40, 40);
+      ctx.strokeStyle = '#253040';
+      ctx.lineWidth = 1;
+      for (let i = 0; i < 3; i++) {
+        const dy = cy - 18 + i * 13;
+        ctx.strokeRect(cx - 18, dy, 36, 11);
+        ctx.fillStyle = active ? '#475569' : '#253040';
+        ctx.fillRect(cx - 4, dy + 4, 8, 3);
+      }
+      // Clipboard
+      ctx.fillStyle = '#253040';
+      ctx.fillRect(cx + 24, cy - 16, 14, 20);
+      ctx.strokeStyle = '#475569';
+      ctx.lineWidth = 0.5;
+      for (let i = 0; i < 4; i++) {
+        ctx.beginPath();
+        ctx.moveTo(cx + 27, cy - 12 + i * 4);
+        ctx.lineTo(cx + 35, cy - 12 + i * 4);
+        ctx.stroke();
+      }
+      if (active) {
+        ctx.fillStyle = color;
+        ctx.globalAlpha = 0.5;
+        ctx.beginPath(); ctx.arc(cx + 31, cy - 14, 2, 0, Math.PI * 2); ctx.fill();
+        ctx.globalAlpha = 1;
+      }
+      break;
+    case 'prep':
+      // Library prep station - PCR machine + bench
+      // Bench
+      ctx.fillStyle = '#1a2332';
+      ctx.fillRect(cx - 28, cy - 6, 56, 6);
+      ctx.fillStyle = '#253040';
+      ctx.fillRect(cx - 26, cy, 4, 18);
+      ctx.fillRect(cx + 22, cy, 4, 18);
+      // PCR/thermocycler machine
+      ctx.fillStyle = active ? '#192535' : '#172030';
+      ctx.strokeStyle = active ? color : '#253040';
+      ctx.lineWidth = 1;
+      roundRect(ctx, cx - 18, cy - 26, 36, 20, 3);
+      ctx.fill(); ctx.stroke();
+      // Display on machine
+      ctx.fillStyle = active ? 'rgba(45,212,191,0.15)' : 'rgba(30,41,59,0.3)';
+      ctx.fillRect(cx - 14, cy - 22, 16, 8);
+      if (active) {
+        ctx.fillStyle = color;
+        ctx.font = '500 6px JetBrains Mono, monospace';
+        ctx.textAlign = 'left';
+        ctx.fillText('72.0°C', cx - 12, cy - 16);
+        // LED
+        ctx.fillStyle = color;
+        ctx.beginPath(); ctx.arc(cx + 12, cy - 18, 2, 0, Math.PI * 2); ctx.fill();
+      }
+      // Microplate
+      ctx.fillStyle = '#1e293b';
+      ctx.fillRect(cx - 20, cy - 3, 18, 12);
+      ctx.strokeStyle = '#334155';
+      ctx.lineWidth = 0.3;
+      for (let r = 0; r < 3; r++) {
+        for (let c = 0; c < 4; c++) {
+          ctx.beginPath();
+          ctx.arc(cx - 17 + c * 4.5, cy + 1 + r * 3.5, 1.2, 0, Math.PI * 2);
+          ctx.stroke();
+        }
+      }
+      break;
+    case 'sequencer':
+      // Big sequencing machine (NovaSeq-like)
+      // Machine body
+      ctx.fillStyle = '#172030';
+      ctx.strokeStyle = active ? color : '#253040';
+      ctx.lineWidth = active ? 1.5 : 1;
+      roundRect(ctx, cx - 24, cy - 28, 48, 44, 4);
+      ctx.fill(); ctx.stroke();
+      // Front panel / screen
+      ctx.fillStyle = active ? 'rgba(34,211,238,0.1)' : 'rgba(30,41,59,0.3)';
+      roundRect(ctx, cx - 18, cy - 22, 36, 18, 2);
+      ctx.fill();
+      if (active) {
+        // Progress bar on screen
+        ctx.fillStyle = 'rgba(34,211,238,0.2)';
+        ctx.fillRect(cx - 14, cy - 12, 28, 4);
+        const progress = (frameCount % 120) / 120;
+        ctx.fillStyle = color;
+        ctx.fillRect(cx - 14, cy - 12, 28 * progress, 4);
+        ctx.fillStyle = color;
+        ctx.font = '500 6px JetBrains Mono, monospace';
+        ctx.textAlign = 'center';
+        ctx.fillText('SEQUENCING', cx, cy - 16);
+      }
+      // Slot
+      ctx.fillStyle = '#0f1520';
+      ctx.fillRect(cx - 10, cy, 20, 4);
+      // Status LEDs
+      ctx.fillStyle = active ? '#34d399' : '#334155';
+      ctx.beginPath(); ctx.arc(cx - 14, cy + 10, 2, 0, Math.PI * 2); ctx.fill();
+      if (active && frameCount % 30 < 15) {
+        ctx.fillStyle = '#fbbf24';
+      } else {
+        ctx.fillStyle = '#334155';
+      }
+      ctx.beginPath(); ctx.arc(cx - 8, cy + 10, 2, 0, Math.PI * 2); ctx.fill();
+      break;
+    case 'computer':
+      // Computer desk with dual monitors
+      // Desk
+      ctx.fillStyle = '#1a2332';
+      ctx.fillRect(cx - 36, cy + 2, 72, 5);
+      ctx.fillStyle = '#253040';
+      ctx.fillRect(cx - 32, cy + 7, 4, 16);
+      ctx.fillRect(cx + 28, cy + 7, 4, 16);
+      // Chair
+      ctx.fillStyle = '#1e293b';
+      ctx.beginPath();
+      ctx.arc(cx, cy + 28, 8, 0, Math.PI * 2);
+      ctx.fill();
+      ctx.fillStyle = '#253040';
+      ctx.fillRect(cx - 1, cy + 20, 2, 8);
+      // Monitor 1 (main)
+      ctx.fillStyle = active ? '#0c1219' : '#131c28';
+      ctx.strokeStyle = active ? 'rgba(56,189,248,0.4)' : '#253040';
+      ctx.lineWidth = 1;
+      roundRect(ctx, cx - 30, cy - 28, 32, 24, 2);
+      ctx.fill(); ctx.stroke();
+      // Monitor stand
+      ctx.fillStyle = '#334155';
+      ctx.fillRect(cx - 16, cy - 4, 4, 6);
+      ctx.fillRect(cx - 20, cy + 1, 12, 2);
+      // Monitor 2
+      ctx.fillStyle = active ? '#0c1219' : '#131c28';
+      ctx.strokeStyle = active ? 'rgba(56,189,248,0.3)' : '#253040';
+      roundRect(ctx, cx + 2, cy - 24, 26, 20, 2);
+      ctx.fill(); ctx.stroke();
+      ctx.fillStyle = '#334155';
+      ctx.fillRect(cx + 13, cy - 4, 4, 6);
+      ctx.fillRect(cx + 9, cy + 1, 12, 2);
+      // Screen content
+      if (active) {
+        ctx.fillStyle = 'rgba(56,189,248,0.08)';
+        ctx.fillRect(cx - 28, cy - 26, 28, 20);
+        // Code lines
+        for (let i = 0; i < 5; i++) {
+          ctx.fillStyle = `rgba(56,189,248,${0.15 + i * 0.06})`;
+          const w = 8 + Math.sin(i * 2.3 + frameCount * 0.02) * 6;
+          ctx.fillRect(cx - 26, cy - 24 + i * 4, w, 2);
+        }
+        // Second screen - graph
+        ctx.fillStyle = 'rgba(56,189,248,0.06)';
+        ctx.fillRect(cx + 4, cy - 22, 22, 16);
+        ctx.strokeStyle = 'rgba(34,211,238,0.3)';
+        ctx.lineWidth = 1;
+        ctx.beginPath();
+        ctx.moveTo(cx + 6, cy - 8);
+        for (let i = 0; i < 8; i++) {
+          ctx.lineTo(cx + 6 + i * 2.5, cy - 10 - Math.sin(i * 0.8 + frameCount * 0.03) * 5);
+        }
+        ctx.stroke();
+      }
+      // Keyboard
+      ctx.fillStyle = '#1e293b';
+      ctx.fillRect(cx - 14, cy + 4, 28, 6);
+      // Typing effect
+      if (active && agent.working) {
+        const keyX = cx - 12 + (frameCount % 20) * 1.2;
+        ctx.fillStyle = 'rgba(56,189,248,0.4)';
+        ctx.fillRect(keyX, cy + 5, 3, 4);
+      }
+      break;
+    case 'whiteboard':
+      // Whiteboard on wall + standing desk
+      // Board on wall
+      ctx.fillStyle = '#1e293b';
+      ctx.strokeStyle = '#334155';
+      ctx.lineWidth = 1;
+      ctx.fillRect(cx - 28, cy - 34, 56, 32);
+      ctx.strokeRect(cx - 28, cy - 34, 56, 32);
+      // Board content
+      if (active) {
+        ctx.fillStyle = 'rgba(167,139,250,0.1)';
+        ctx.fillRect(cx - 26, cy - 32, 52, 28);
+        // Diagram elements
+        ctx.strokeStyle = 'rgba(167,139,250,0.4)';
+        ctx.lineWidth = 0.8;
+        // Boxes
+        ctx.strokeRect(cx - 20, cy - 28, 14, 8);
+        ctx.strokeRect(cx + 6, cy - 28, 14, 8);
+        ctx.strokeRect(cx - 8, cy - 16, 16, 8);
+        // Arrows
+        ctx.beginPath();
+        ctx.moveTo(cx - 6, cy - 24); ctx.lineTo(cx + 6, cy - 24); ctx.stroke();
+        ctx.beginPath();
+        ctx.moveTo(cx, cy - 20); ctx.lineTo(cx, cy - 16); ctx.stroke();
+        // Checkmark
+        ctx.strokeStyle = '#34d399';
+        ctx.lineWidth = 1.5;
+        ctx.beginPath();
+        ctx.moveTo(cx - 4, cy - 12);
+        ctx.lineTo(cx - 1, cy - 9);
+        ctx.lineTo(cx + 5, cy - 15);
+        ctx.stroke();
+      } else {
+        // Faint lines
+        ctx.strokeStyle = '#253040';
+        ctx.lineWidth = 0.5;
+        for (let i = 0; i < 4; i++) {
+          ctx.beginPath();
+          ctx.moveTo(cx - 22, cy - 28 + i * 7);
+          ctx.lineTo(cx + 22, cy - 28 + i * 7);
+          ctx.stroke();
+        }
+      }
+      // Standing desk
+      ctx.fillStyle = '#1a2332';
+      ctx.fillRect(cx - 16, cy + 2, 32, 4);
+      ctx.fillStyle = '#253040';
+      ctx.fillRect(cx - 2, cy + 6, 4, 14);
+      break;
+  }
+  ctx.restore();
+}
+function roundRect(ctx, x, y, w, h, r) {
+  ctx.beginPath();
+  ctx.moveTo(x + r, y);
+  ctx.lineTo(x + w - r, y);
+  ctx.quadraticCurveTo(x + w, y, x + w, y + r);
+  ctx.lineTo(x + w, y + h - r);
+  ctx.quadraticCurveTo(x + w, y + h, x + w - r, y + h);
+  ctx.lineTo(x + r, y + h);
+  ctx.quadraticCurveTo(x, y + h, x, y + h - r);
+  ctx.lineTo(x, y + r);
+  ctx.quadraticCurveTo(x, y, x + r, y);
+  ctx.closePath();
+}
+drawLab();
+// =====================================================
+// EPISODE DATA + APP LOGIC
+// =====================================================
+const EPISODE = [
+  {
+    action: 'collect_sample', params: 'n_samples=8, tissue="lung"', category: 'wet',
+    budget: 92400, budgetPct: 92.4, time: 165, timePct: 91.7,
+    output: ['Collected 8 lung tissue samples (4 IPF, 4 control)','Tissue quality: excellent | Storage: -80C'],
+    reward: { validity: 0.90, ordering: 1.00, info_gain: 0.10, efficiency: 0.72, novelty: 1.00, penalty: 0.0 },
+    total: 0.45,
+  },
+  {
+    action: 'select_cohort', params: 'criteria="age_matched, sex_balanced"', category: 'wet',
+    budget: 91800, budgetPct: 91.8, time: 162, timePct: 90.0,
+    output: ['Cohort selected: 4 IPF patients (2M/2F, age 58-67)','Controls matched: 4 healthy donors (2M/2F, age 55-65)'],
+    reward: { validity: 0.85, ordering: 0.90, info_gain: 0.15, efficiency: 0.80, novelty: 0.90, penalty: 0.0 },
+    total: 0.38,
+  },
+  {
+    action: 'prepare_library', params: 'protocol="10x_chromium_v3"', category: 'wet',
+    budget: 84200, budgetPct: 84.2, time: 155, timePct: 86.1,
+    output: ['Library prep complete using 10x Chromium v3','Estimated cell capture: ~12,000 cells','cDNA yield: 42ng (good)'],
+    reward: { validity: 0.95, ordering: 1.00, info_gain: 0.20, efficiency: 0.70, novelty: 0.95, penalty: 0.0 },
+    total: 0.52,
+  },
+  {
+    action: 'sequence_cells', params: 'depth="standard", platform="NovaSeq"', category: 'wet',
+    budget: 68500, budgetPct: 68.5, time: 142, timePct: 78.9,
+    output: ['11,847 cells sequenced | 22,438 genes detected','Median reads/cell: 45,200 | Median genes/cell: 3,842','Sequencing saturation: 78.3%'],
+    reward: { validity: 0.95, ordering: 1.00, info_gain: 0.55, efficiency: 0.60, novelty: 0.90, penalty: 0.0 },
+    total: 0.68,
+  },
+  {
+    action: 'run_qc', params: 'tool="scanpy", min_genes=200', category: 'comp',
+    budget: 68100, budgetPct: 68.1, time: 141, timePct: 78.3,
+    output: ['QC complete: 10,234 / 11,847 cells passed (86.4%)','Removed: 382 doublets (3.2%), 1,231 low-quality cells','Mitochondrial threshold: 20% (flagged 847 cells)'],
+    reward: { validity: 0.95, ordering: 1.00, info_gain: 0.35, efficiency: 0.85, novelty: 0.80, penalty: 0.0 },
+    total: 0.55,
+  },
+  {
+    action: 'normalize_data', params: 'method="scran", log_transform=true', category: 'comp',
+    budget: 67900, budgetPct: 67.9, time: 140, timePct: 77.8,
+    output: ['Size-factor normalization (scran) applied','Log1p transform complete | HVG selection: 3,000 genes'],
+    reward: { validity: 0.90, ordering: 1.00, info_gain: 0.25, efficiency: 0.90, novelty: 0.70, penalty: 0.0 },
+    total: 0.42,
+  },
+  {
+    action: 'cluster_cells', params: 'algorithm="leiden", resolution=0.8', category: 'comp',
+    budget: 67500, budgetPct: 67.5, time: 139, timePct: 77.2,
+    output: ['Leiden clustering: 14 clusters identified','AT1 (8.2%), AT2 (12.1%), Fibroblast (15.7%), Macrophage (18.3%)','Endothelial (9.4%), Basal (6.1%), Ciliated (5.8%), NK/T (7.2%)','Smooth Muscle (4.1%), Mast (2.9%), B cell (3.4%), pDC (2.0%)','Mesothelial (2.6%), Aberrant Basaloid (2.2%)'],
+    reward: { validity: 0.95, ordering: 1.00, info_gain: 0.65, efficiency: 0.85, novelty: 0.85, penalty: 0.0 },
+    total: 0.72,
+    discovery: { title: '14 cell populations identified', detail: 'Including Aberrant Basaloid cells (IPF-associated)', color: 'var(--cyan)', bg: 'var(--cyan-dim)' },
+  },
+  {
+    action: 'differential_expression', params: 'method="DESeq2", contrast="IPF_vs_Ctrl"', category: 'comp',
+    budget: 67000, budgetPct: 67.0, time: 137, timePct: 76.1,
+    output: ['1,847 DE genes (|log2FC| > 1, padj < 0.05)','Top upregulated in IPF:','  SPP1   log2FC=3.42  padj=1.2e-18','  MMP7   log2FC=2.89  padj=3.4e-15','  COL1A1 log2FC=2.67  padj=8.7e-14','  TGFB1  log2FC=1.95  padj=2.1e-09','Top downregulated: AGER (-3.1), SFTPC (-2.8), HOPX (-2.3)'],
+    reward: { validity: 0.95, ordering: 1.00, info_gain: 0.78, efficiency: 0.80, novelty: 0.88, penalty: 0.0 },
+    total: 0.82,
+    discovery: { title: 'SPP1 strongly upregulated in IPF', detail: 'log2FC=3.42, padj=1.2e-18', color: 'var(--pink)', bg: 'rgba(244,114,182,0.10)' },
+  },
+  {
+    action: 'pathway_enrichment', params: 'tool="gseapy", gene_sets="KEGG,Reactome"', category: 'comp',
+    budget: 66600, budgetPct: 66.6, time: 136, timePct: 75.6,
+    output: ['Top enriched pathways (IPF vs Control):','  ECM-receptor interaction     padj=4.2e-12','  TGF-beta signaling           padj=1.8e-09','  PI3K-Akt signaling           padj=3.1e-07','  Focal adhesion               padj=8.9e-07','SPP1 participates in 3/4 top pathways'],
+    reward: { validity: 0.90, ordering: 1.00, info_gain: 0.60, efficiency: 0.85, novelty: 0.75, penalty: 0.0 },
+    total: 0.58,
+    discovery: { title: 'SPP1 in ECM/TGF-beta/PI3K pathways', detail: 'Core fibrosis signaling axis confirmed', color: 'var(--purple)', bg: 'rgba(167,139,250,0.10)' },
+  },
+  {
+    action: 'marker_selection', params: 'candidates=["SPP1","MMP7","COL1A1"]', category: 'comp',
+    budget: 66200, budgetPct: 66.2, time: 135, timePct: 75.0,
+    output: ['Marker ranking by discriminative power:','  1. SPP1   - AUROC: 0.94, specificity: 0.89','  2. MMP7   - AUROC: 0.87, specificity: 0.82','  3. COL1A1 - AUROC: 0.81, specificity: 0.76','SPP1 selected as primary biomarker candidate'],
+    reward: { validity: 0.90, ordering: 1.00, info_gain: 0.50, efficiency: 0.88, novelty: 0.70, penalty: 0.0 },
+    total: 0.55,
+  },
+  {
+    action: 'validate_marker', params: 'gene="SPP1", method="cross_validation"', category: 'comp',
+    budget: 65200, budgetPct: 65.2, time: 130, timePct: 72.2,
+    output: ['SPP1 Biomarker Validation Report:','  5-fold CV AUROC:    0.91 (+/- 0.03)','  Sensitivity:         0.88','  Specificity:         0.87','  Positive LR:         6.77','  Expression in Aberrant Basaloid: 94.2% of cells','  Status: VALIDATED as IPF biomarker'],
+    reward: { validity: 0.95, ordering: 1.00, info_gain: 0.72, efficiency: 0.82, novelty: 0.85, penalty: 0.0 },
+    total: 0.76,
+    discovery: { title: 'SPP1 validated as IPF biomarker', detail: 'AUROC=0.91, specificity=0.87', color: 'var(--green)', bg: 'var(--green-dim)' },
+  },
+  {
+    action: 'synthesize_conclusion', params: 'confidence=0.85', category: 'meta',
+    budget: 65000, budgetPct: 65.0, time: 129, timePct: 71.7,
+    output: ['CONCLUSION (confidence: 0.85):','','SPP1 is a validated biomarker for IPF with strong','discriminative power (AUROC=0.91). It is upregulated','3.42-fold in IPF lungs, concentrated in Aberrant Basaloid','cells (94.2%), and participates in ECM-receptor, TGF-beta,','and PI3K-Akt signaling pathways.','','Literature match: 4/5 expected findings confirmed','Calibration: Well-calibrated (no overconfidence penalty)'],
+    reward: { validity: 1.00, ordering: 1.00, info_gain: 0.40, efficiency: 0.90, novelty: 0.50, penalty: 0.0 },
+    total: 0.91, terminal: true,
+  },
+];
+// State
+let running = false;
+let cumReward = 0;
+// DOM refs
+const terminalEl = document.getElementById('terminal');
+const statusDot = document.getElementById('statusDot');
+const statusText = document.getElementById('statusText');
+const runBtn = document.getElementById('runBtn');
+const labActionLabel = document.getElementById('labActionLabel');
+// Helpers
+function addLine(html) {
+  const div = document.createElement('div');
+  div.className = 't-line';
+  div.innerHTML = html || '&nbsp;';
+  terminalEl.appendChild(div);
+  terminalEl.scrollTop = terminalEl.scrollHeight;
+}
+function setGauge(id, value, pct, color) {
+  document.getElementById(id + 'Val').textContent = value;
+  const fill = document.getElementById(id + 'Fill');
+  fill.style.width = pct + '%';
+  if (color) fill.style.background = color;
+}
+function setRewardBars(r) {
+  for (const key of ['validity','ordering','info_gain','efficiency','novelty','penalty']) {
+    const el = document.getElementById('rw-' + key);
+    el.style.width = (r[key] * 100) + '%';
+    el.textContent = r[key] > 0.01 ? r[key].toFixed(2) : '';
+  }
+}
+function clearRewardBars() {
+  for (const key of ['validity','ordering','info_gain','efficiency','novelty','penalty']) {
+    const el = document.getElementById('rw-' + key);
+    el.style.width = '0%';
+    el.textContent = '';
+  }
+}
+function addPipeStep(step, index) {
+  const el = document.createElement('div');
+  el.className = 'pipe-step';
+  el.id = 'pipe-' + index;
+  const catColor = step.category === 'wet' ? 'var(--green)' : step.category === 'comp' ? 'var(--accent)' : 'var(--pink)';
+  el.innerHTML = `<div class="step-icon" style="color:${catColor};border-color:${catColor};">${index + 1}</div><span>${step.action}</span>`;
+  document.getElementById('pipelineSteps').appendChild(el);
+  requestAnimationFrame(() => el.classList.add('visible'));
+  return el;
+}
+function addDiscovery(d) {
+  const c = document.getElementById('discoveries');
+  if (c.querySelector('.empty-state')) c.innerHTML = '';
+  const el = document.createElement('div');
+  el.className = 'discovery';
+  el.innerHTML = `<div class="disc-icon" style="background:${d.bg};color:${d.color};">&#9670;</div><div class="disc-body"><div class="disc-title">${d.title}</div><div class="disc-detail">${d.detail}</div></div>`;
+  c.appendChild(el);
+  requestAnimationFrame(() => el.classList.add('visible'));
+}
+function addRewardHistory(step, index) {
+  const c = document.getElementById('rewardHistory');
+  if (c.querySelector('.empty-state')) c.innerHTML = '';
+  const el = document.createElement('div');
+  el.className = 'step-reward-mini';
+  el.innerHTML = `<span class="srm-name">${index + 1}. ${step.action}</span><span class="srm-val ${step.total >= 0 ? 'pos' : 'neg'}">${step.total >= 0 ? '+' : ''}${step.total.toFixed(2)}</span>`;
+  c.appendChild(el);
+  requestAnimationFrame(() => el.classList.add('visible'));
+}
+function selectScenario(el) {
+  if (running) return;
+  document.querySelectorAll('.scenario-opt').forEach(e => e.classList.remove('active'));
+  el.classList.add('active');
+}
+function wait(ms) { return new Promise(r => setTimeout(r, ms)); }
+// ---- Run ----
+async function startDemo() {
+  if (running) return;
+  running = true;
+  runBtn.disabled = true;
+  runBtn.textContent = 'Running...';
+  statusDot.classList.add('live');
+  statusText.textContent = 'Running';
+  terminalEl.innerHTML = '';
+  cumReward = 0;
+  document.getElementById('pipelineSteps').innerHTML = '';
+  document.getElementById('discoveries').innerHTML = '<div class="empty-state">No discoveries yet</div>';
+  document.getElementById('rewardHistory').innerHTML = '<div class="empty-state">No steps yet</div>';
+  document.getElementById('violations').innerHTML = '<div class="empty-state">No violations</div>';
+  clearRewardBars();
+  document.getElementById('cumReward').textContent = '0.00';
+  document.getElementById('stepRewardLabel').textContent = '--';
+  initAgent();
+  addLine('<span class="t-label">[BioEnv]</span> <span class="t-dim">Initializing environment...</span>');
+  await wait(500);
+  addLine('<span class="t-label">[BioEnv]</span> Scenario: <span class="t-str">biomarker_validation_lung</span> (Hard)');
+  await wait(200);
+  addLine('<span class="t-label">[BioEnv]</span> Organism: <span class="t-str">Homo sapiens</span> | Tissue: <span class="t-str">Lung</span>');
+  await wait(200);
+  addLine('<span class="t-label">[BioEnv]</span> Budget: <span class="t-num">$100,000</span> | Time: <span class="t-num">180 days</span> | Max steps: <span class="t-num">30</span>');
+  await wait(200);
+  addLine('<span class="t-label">[BioEnv]</span> Task: Validate <span class="t-kw">SPP1</span> as biomarker for idiopathic pulmonary fibrosis');
+  await wait(400);
+  addLine('');
+  for (let i = 0; i < EPISODE.length; i++) {
+    await runStep(i);
+    await wait(500);
+  }
+  // Done
+  moveAgentTo('idle');
+  labActionLabel.classList.remove('visible');
+  addLine('');
+  addLine('<span class="t-label">[BioEnv]</span> <span class="t-ok">Episode complete!</span>');
+  addLine('<span class="t-label">[BioEnv]</span> Total reward: <span class="t-ok">+' + cumReward.toFixed(2) + '</span> | Steps: <span class="t-num">' + EPISODE.length + '</span> | Budget remaining: <span class="t-num">$65,000</span>');
+  addLine('<span class="t-label">[BioEnv]</span> Literature match: <span class="t-ok">4/5 expected findings confirmed</span>');
+  addLine('<span class="t-label">[BioEnv]</span> Calibration: <span class="t-ok">Well-calibrated</span> (no overconfidence penalty)');
+  statusDot.classList.remove('live');
+  statusText.textContent = 'Complete';
+  runBtn.textContent = 'Run Episode';
+  runBtn.disabled = false;
+  running = false;
+}
+async function runStep(i) {
+  const step = EPISODE[i];
+  const station = ACTION_STATION[step.action] || 'computer';
+  // Move agent in lab
+  moveAgentTo(station);
+  labActionLabel.textContent = step.action + '()';
+  labActionLabel.classList.add('visible');
+  await wait(800); // wait for agent to travel
+  // Start working animation
+  setAgentWorking(step.action);
+  spawnParticles(agent.targetX, agent.targetY, STATIONS[station].color);
+  // Pipeline sidebar
+  const pipeEl = addPipeStep(step, i);
+  if (i > 0) {
+    const prev = document.getElementById('pipe-' + (i - 1));
+    prev.classList.remove('active');
+    prev.classList.add('done');
+    prev.querySelector('.step-icon').innerHTML = '&#10003;';
+  }
+  pipeEl.classList.add('active');
+  // Gauges
+  setGauge('budget', '$' + step.budget.toLocaleString(), step.budgetPct,
+    step.budgetPct > 50 ? 'var(--green)' : step.budgetPct > 25 ? 'var(--amber)' : 'var(--red)');
+  setGauge('time', step.time + ' / 180 days', step.timePct, 'var(--cyan)');
+  setGauge('step', (i + 1) + ' / 30', ((i + 1) / 30 * 100), 'var(--accent)');
+  // Terminal output
+  const catTag = step.category === 'wet' ? '<span class="t-ok">WET</span>'
+    : step.category === 'comp' ? '<span class="t-label">CMP</span>'
+    : '<span class="t-kw">META</span>';
+  addLine(`<span class="t-dim">Step ${i + 1}</span>  ${catTag}  <span class="t-fn">${step.action}</span>(<span class="t-str">${step.params}</span>)`);
+  await wait(300);
+  for (const line of step.output) {
+    addLine('  <span class="t-sub">' + line + '</span>');
+    await wait(80);
+  }
+  // Reward
+  cumReward += step.total;
+  document.getElementById('stepRewardLabel').textContent = 'Step ' + (i + 1) + ': ' + step.action;
+  setRewardBars(step.reward);
+  document.getElementById('cumReward').textContent = cumReward.toFixed(2);
+  addRewardHistory(step, i);
+  const rewardStr = step.total >= 0
+    ? '<span class="t-ok">+' + step.total.toFixed(2) + '</span>'
+    : '<span class="t-err">' + step.total.toFixed(2) + '</span>';
+  addLine(`  <span class="t-dim">reward: ${rewardStr}  <span class="t-dim">(cumulative: ${cumReward.toFixed(2)})</span></span>`);
+  addLine('');
+  if (step.discovery) addDiscovery(step.discovery);
+  // Done working
+  agent.working = false;
+  spawnParticles(agent.targetX, agent.targetY, '#34d399', 6);
+  if (step.terminal) {
+    pipeEl.classList.remove('active');
+    pipeEl.classList.add('done');
+    pipeEl.querySelector('.step-icon').innerHTML = '&#10003;';
+  }
+}
+function resetDemo() {
+  if (running) return;
+  terminalEl.innerHTML = '';
+  cumReward = 0;
+  document.getElementById('pipelineSteps').innerHTML = '';
+  document.getElementById('discoveries').innerHTML = '<div class="empty-state">No discoveries yet</div>';
+  document.getElementById('rewardHistory').innerHTML = '<div class="empty-state">No steps yet</div>';
+  document.getElementById('violations').innerHTML = '<div class="empty-state">No violations</div>';
+  clearRewardBars();
+  document.getElementById('cumReward').textContent = '0.00';
+  document.getElementById('stepRewardLabel').textContent = '--';
+  setGauge('budget', '$100,000', 100, 'var(--green)');
+  setGauge('time', '180 / 180 days', 100, 'var(--cyan)');
+  setGauge('step', '0 / 30', 0, 'var(--accent)');
+  statusDot.classList.remove('live');
+  statusText.textContent = 'Ready';
+  labActionLabel.classList.remove('visible');
+  initAgent();
+  addLine('<span class="t-dim">Environment reset. Click "Run Episode" to start.</span>');
+}
+// Init
+addLine('<span class="t-dim">BioEnv v1.0 | biomarker_validation_lung</span>');
+addLine('<span class="t-dim">Click "Run Episode" to start the demo.</span>');
+</script>
+</body>
+</html>

eval_compare.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""Compare base vs trained model on the same prompts."""
+from __future__ import annotations
+import argparse
+import json
+import random
+from typing import Dict, List
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from training_script import (
+    SYSTEM_PROMPT,
+    OpenEnvReward,
+    build_prompt_examples,
+    completion_to_text,
+    parse_action_completion,
+    selected_scenarios,
+)
+def generate_completions(
+    model,
+    tokenizer,
+    prompts: List[str],
+    max_new_tokens: int = 220,
+) -> List[str]:
+    completions = []
+    for prompt in prompts:
+        messages = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": prompt},
+        ]
+        input_text = tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
+        with torch.no_grad():
+            output = model.generate(
+                **inputs,
+                max_new_tokens=max_new_tokens,
+                do_sample=True,
+                temperature=0.7,
+                top_p=0.9,
+            )
+        generated = output[0][inputs["input_ids"].shape[1]:]
+        completions.append(tokenizer.decode(generated, skip_special_tokens=True))
+    return completions
+def evaluate_model(
+    model,
+    tokenizer,
+    examples: List[Dict[str, str]],
+    reward_fn: OpenEnvReward,
+    label: str,
+) -> Dict[str, float]:
+    prompts = [ex["prompt"] for ex in examples]
+    completions = generate_completions(model, tokenizer, prompts)
+    rewards = []
+    valid_actions = 0
+    for comp, ex in zip(completions, examples):
+        reward = reward_fn(
+            completions=[comp],
+            scenario_name=[ex.get("scenario_name")],
+            history_actions=[ex.get("history_actions")],
+        )[0]
+        rewards.append(reward)
+        if parse_action_completion(comp) is not None:
+            valid_actions += 1
+    avg_reward = sum(rewards) / len(rewards) if rewards else 0
+    valid_pct = valid_actions / len(completions) * 100 if completions else 0
+    print(f"\n{'='*50}")
+    print(f"  {label}")
+    print(f"{'='*50}")
+    print(f"  Samples:        {len(completions)}")
+    print(f"  Avg reward:     {avg_reward:.4f}")
+    print(f"  Min reward:     {min(rewards):.4f}")
+    print(f"  Max reward:     {max(rewards):.4f}")
+    print(f"  Valid actions:  {valid_actions}/{len(completions)} ({valid_pct:.1f}%)")
+    print()
+    # Show a few example completions
+    for i, (comp, r) in enumerate(zip(completions[:3], rewards[:3])):
+        print(f"  Example {i+1} (reward={r:.2f}):")
+        print(f"    {comp[:200]}")
+        print()
+    return {"avg_reward": avg_reward, "valid_pct": valid_pct, "rewards": rewards}
+def main():
+    parser = argparse.ArgumentParser(description="Compare base vs trained model")
+    parser.add_argument("--base-model", default="Qwen/Qwen3.5-0.8B",
+                        help="Base model ID from HuggingFace")
+    parser.add_argument("--trained-model", default="./grpo-output",
+                        help="Path to trained model (local dir or HF repo)")
+    parser.add_argument("--num-samples", type=int, default=16,
+                        help="Number of eval prompts")
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--trust-remote-code", action="store_true")
+    args = parser.parse_args()
+    random.seed(args.seed)
+    # Build eval prompts
+    scenarios = selected_scenarios(None)
+    examples = build_prompt_examples(
+        dataset_episodes=args.num_samples,
+        rollout_steps=1,  # one prompt per episode
+        collection_policy="heuristic",
+        scenario_names=scenarios,
+        seed=args.seed,
+        domain_randomise=False,
+    )
+    print(f"Built {len(examples)} eval prompts across {len(scenarios)} scenarios")
+    reward_fn = OpenEnvReward(reward_backend="local", base_url="")
+    # Evaluate base model
+    print(f"\nLoading base model: {args.base_model}")
+    base_tokenizer = AutoTokenizer.from_pretrained(
+        args.base_model, trust_remote_code=args.trust_remote_code
+    )
+    if base_tokenizer.pad_token is None:
+        base_tokenizer.pad_token = base_tokenizer.eos_token
+    base_model = AutoModelForCausalLM.from_pretrained(
+        args.base_model,
+        trust_remote_code=args.trust_remote_code,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+    )
+    base_results = evaluate_model(
+        base_model, base_tokenizer, examples, reward_fn, "BASE MODEL"
+    )
+    del base_model
+    torch.cuda.empty_cache()
+    # Evaluate trained model
+    print(f"\nLoading trained model: {args.trained_model}")
+    trained_tokenizer = AutoTokenizer.from_pretrained(
+        args.trained_model, trust_remote_code=args.trust_remote_code
+    )
+    if trained_tokenizer.pad_token is None:
+        trained_tokenizer.pad_token = trained_tokenizer.eos_token
+    trained_model = AutoModelForCausalLM.from_pretrained(
+        args.trained_model,
+        trust_remote_code=args.trust_remote_code,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+    )
+    trained_results = evaluate_model(
+        trained_model, trained_tokenizer, examples, reward_fn, "TRAINED MODEL"
+    )
+    # Summary
+    delta = trained_results["avg_reward"] - base_results["avg_reward"]
+    print(f"{'='*50}")
+    print(f"  COMPARISON SUMMARY")
+    print(f"{'='*50}")
+    print(f"  Base avg reward:    {base_results['avg_reward']:.4f}")
+    print(f"  Trained avg reward: {trained_results['avg_reward']:.4f}")
+    print(f"  Delta:              {delta:+.4f}")
+    print(f"  Base valid actions: {base_results['valid_pct']:.1f}%")
+    print(f"  Trained valid:      {trained_results['valid_pct']:.1f}%")
+    print()
+if __name__ == "__main__":
+    main()

models.py CHANGED Viewed

@@ -73,6 +73,1025 @@ META_ACTIONS = frozenset({
 })
 class SubagentType(str, Enum):
     WET_LAB_PLANNER = "wet_lab_planner"
     COMPUTATIONAL_ANALYST = "computational_analyst"
@@ -96,29 +1115,61 @@ class ExperimentAction(Action):
     """
     action_type: ActionType = Field(
-        ..., description="Discrete experiment or analysis step type"
     )
     input_targets: List[str] = Field(
         default_factory=list,
-        description="References to prior outputs, samples, or artifacts",
     )
     method: Optional[str] = Field(
-        None, description="Specific method or tool (e.g. 'Seurat', 'CellRanger')"
     )
     parameters: Dict[str, Any] = Field(
-        default_factory=dict, description="Method-specific parameters"
     )
     expected_output_type: Optional[str] = Field(
-        None, description="What the agent expects this step to produce"
     )
     justification: Optional[str] = Field(
-        None, description="Scientific rationale for this step"
     )
     invoked_subagent: Optional[SubagentType] = Field(
         None, description="Sub-agent to delegate to, if any"
     )
     tool_call_spec: Optional[Dict[str, Any]] = Field(
-        None, description="Structured tool invocation specification"
     )
     confidence: float = Field(
         0.5, ge=0.0, le=1.0, description="Agent confidence in this step"
@@ -216,14 +1267,22 @@ class TaskSpec(BaseModel):
     organism: str = "human"
     tissue: str = "blood"
     conditions: List[str] = Field(default_factory=list)
-    available_assays: List[str] = Field(default_factory=lambda: [
-        "10x_chromium", "smart-seq2", "bulk_rna_seq",
-        "atac-seq", "cite-seq", "spatial_transcriptomics",
-    ])
-    available_tools: List[str] = Field(default_factory=lambda: [
-        "CellRanger", "Seurat", "Scanpy", "DESeq2", "GSEA",
-        "Monocle", "scVelo", "CellChat", "SCENIC",
-    ])
     budget_limit: float = 100_000.0
     time_limit_days: float = 180.0
     prior_observations: List[str] = Field(default_factory=list)
@@ -234,7 +1293,10 @@ class TaskSpec(BaseModel):
 class ConclusionClaim(BaseModel):
-    claim: str
     evidence_steps: List[int] = Field(default_factory=list)
     confidence: float = Field(0.5, ge=0.0, le=1.0)
     claim_type: str = "correlational"
@@ -254,9 +1316,26 @@ class ExperimentObservation(Observation):
     task: TaskSpec = Field(default_factory=TaskSpec)
     step_index: int = 0
     pipeline_history: List[PipelineStepRecord] = Field(default_factory=list)
-    available_assays: List[str] = Field(default_factory=list)
-    available_tools: List[str] = Field(default_factory=list)
-    resource_usage: ResourceUsage = Field(default_factory=ResourceUsage)
     latest_output: Optional[IntermediateOutput] = None
     all_outputs: List[IntermediateOutput] = Field(default_factory=list)
     discovered_markers: List[str] = Field(default_factory=list)
@@ -266,3 +1345,313 @@ class ExperimentObservation(Observation):
     conclusions: List[ConclusionClaim] = Field(default_factory=list)
     rule_violations: List[str] = Field(default_factory=list)
     step_reward_breakdown: Dict[str, float] = Field(default_factory=dict)

 })
+# ── Tool, Assay & Modality Registries ──────────────────────────────────────
+class ToolCategory(str, Enum):
+    ALIGNMENT = "alignment"
+    PREPROCESSING = "preprocessing"
+    NORMALIZATION = "normalization"
+    DIMENSIONALITY_REDUCTION = "dimensionality_reduction"
+    CLUSTERING = "clustering"
+    DIFFERENTIAL_EXPRESSION = "differential_expression"
+    TRAJECTORY = "trajectory"
+    GENE_REGULATORY_NETWORK = "gene_regulatory_network"
+    CELL_COMMUNICATION = "cell_communication"
+    SPATIAL = "spatial"
+    MULTIMODAL_INTEGRATION = "multimodal_integration"
+    GENE_SET_ANALYSIS = "gene_set_analysis"
+    VARIANT_CALLING = "variant_calling"
+    PEAK_CALLING = "peak_calling"
+    IMPUTATION = "imputation"
+    BATCH_CORRECTION = "batch_correction"
+    CELL_TYPE_ANNOTATION = "cell_type_annotation"
+    SIMULATION = "simulation"
+    VISUALIZATION = "visualization"
+    QUALITY_CONTROL = "quality_control"
+    PERTURBATION_ANALYSIS = "perturbation_analysis"
+class ToolSpec(BaseModel):
+    """Registry entry describing a bioinformatics tool."""
+    name: str
+    category: ToolCategory
+    modalities: List[str] = Field(default_factory=list)
+    description: str = ""
+    input_types: List[str] = Field(default_factory=list)
+    output_types: List[str] = Field(default_factory=list)
+    typical_runtime_hours: float = 0.1
+    typical_cost_usd: float = 0.0
+    requires_gpu: bool = False
+    open_source: bool = True
+TOOL_REGISTRY: Dict[str, ToolSpec] = {
+    # ── Alignment & quantification ──
+    "CellRanger": ToolSpec(
+        name="CellRanger",
+        category=ToolCategory.ALIGNMENT,
+        modalities=["scRNA-seq", "scATAC-seq", "CITE-seq", "scMultiome"],
+        description="10x Genomics pipeline for alignment, barcode demux, and counting",
+        input_types=["fastq"],
+        output_types=["count_matrix", "bam"],
+        typical_runtime_hours=4.0,
+        open_source=False,
+    ),
+    "STARsolo": ToolSpec(
+        name="STARsolo",
+        category=ToolCategory.ALIGNMENT,
+        modalities=["scRNA-seq", "scATAC-seq"],
+        description="Drop-seq / 10x-compatible aligner built into STAR",
+        input_types=["fastq"],
+        output_types=["count_matrix", "bam"],
+        typical_runtime_hours=3.0,
+    ),
+    "kallisto_bustools": ToolSpec(
+        name="kallisto_bustools",
+        category=ToolCategory.ALIGNMENT,
+        modalities=["scRNA-seq"],
+        description="Pseudoalignment-based lightweight quantification",
+        input_types=["fastq"],
+        output_types=["count_matrix"],
+        typical_runtime_hours=1.0,
+    ),
+    "Salmon_alevin": ToolSpec(
+        name="Salmon_alevin",
+        category=ToolCategory.ALIGNMENT,
+        modalities=["scRNA-seq"],
+        description="Quasi-mapping quantification for single-cell RNA-seq",
+        input_types=["fastq"],
+        output_types=["count_matrix"],
+        typical_runtime_hours=1.5,
+    ),
+    "spaceranger": ToolSpec(
+        name="spaceranger",
+        category=ToolCategory.ALIGNMENT,
+        modalities=["spatial_transcriptomics"],
+        description="10x Visium spatial alignment and quantification",
+        input_types=["fastq", "image"],
+        output_types=["count_matrix", "spatial_coords"],
+        typical_runtime_hours=3.0,
+        open_source=False,
+    ),
+    # ── Preprocessing / analysis frameworks ──
+    "Scanpy": ToolSpec(
+        name="Scanpy",
+        category=ToolCategory.PREPROCESSING,
+        modalities=["scRNA-seq", "scATAC-seq", "spatial_transcriptomics"],
+        description="Python single-cell analysis framework",
+        input_types=["count_matrix", "h5ad"],
+        output_types=["h5ad", "embedding", "cluster_result"],
+        typical_runtime_hours=0.5,
+    ),
+    "Seurat": ToolSpec(
+        name="Seurat",
+        category=ToolCategory.PREPROCESSING,
+        modalities=["scRNA-seq", "CITE-seq", "spatial_transcriptomics", "scATAC-seq"],
+        description="R single-cell analysis toolkit with multimodal support",
+        input_types=["count_matrix", "h5seurat"],
+        output_types=["h5seurat", "embedding", "cluster_result"],
+        typical_runtime_hours=0.5,
+    ),
+    "Bioconductor_SingleCellExperiment": ToolSpec(
+        name="Bioconductor_SingleCellExperiment",
+        category=ToolCategory.PREPROCESSING,
+        modalities=["scRNA-seq"],
+        description="R/Bioconductor framework for single-cell experiments",
+        input_types=["count_matrix"],
+        output_types=["sce_object"],
+        typical_runtime_hours=0.3,
+    ),
+    # ── Normalization ──
+    "scran": ToolSpec(
+        name="scran",
+        category=ToolCategory.NORMALIZATION,
+        modalities=["scRNA-seq"],
+        description="Pool-based size-factor normalization",
+        input_types=["count_matrix"],
+        output_types=["normalized_matrix"],
+    ),
+    "sctransform": ToolSpec(
+        name="sctransform",
+        category=ToolCategory.NORMALIZATION,
+        modalities=["scRNA-seq"],
+        description="Variance-stabilizing transformation via regularized NB regression",
+        input_types=["count_matrix"],
+        output_types=["normalized_matrix"],
+    ),
+    # ── Dimensionality reduction ──
+    "scVI": ToolSpec(
+        name="scVI",
+        category=ToolCategory.DIMENSIONALITY_REDUCTION,
+        modalities=["scRNA-seq", "CITE-seq", "scATAC-seq"],
+        description="Deep generative model for scRNA-seq (variational inference)",
+        input_types=["count_matrix"],
+        output_types=["latent_embedding"],
+        requires_gpu=True,
+    ),
+    "UMAP": ToolSpec(
+        name="UMAP",
+        category=ToolCategory.DIMENSIONALITY_REDUCTION,
+        modalities=["scRNA-seq", "scATAC-seq", "CITE-seq", "spatial_transcriptomics"],
+        description="Uniform manifold approximation for 2D/3D visualization",
+        input_types=["pca_embedding", "latent_embedding"],
+        output_types=["2d_embedding"],
+    ),
+    # ── Clustering ──
+    "Leiden": ToolSpec(
+        name="Leiden",
+        category=ToolCategory.CLUSTERING,
+        modalities=["scRNA-seq", "scATAC-seq", "CITE-seq"],
+        description="Community detection via the Leiden algorithm",
+        input_types=["knn_graph"],
+        output_types=["cluster_result"],
+    ),
+    "Louvain": ToolSpec(
+        name="Louvain",
+        category=ToolCategory.CLUSTERING,
+        modalities=["scRNA-seq", "scATAC-seq"],
+        description="Community detection via Louvain modularity optimization",
+        input_types=["knn_graph"],
+        output_types=["cluster_result"],
+    ),
+    # ── Differential expression ──
+    "DESeq2": ToolSpec(
+        name="DESeq2",
+        category=ToolCategory.DIFFERENTIAL_EXPRESSION,
+        modalities=["bulk_rna_seq", "scRNA-seq"],
+        description="Negative binomial GLM-based differential expression",
+        input_types=["count_matrix"],
+        output_types=["de_result"],
+    ),
+    "MAST": ToolSpec(
+        name="MAST",
+        category=ToolCategory.DIFFERENTIAL_EXPRESSION,
+        modalities=["scRNA-seq"],
+        description="Two-part hurdle model for scRNA-seq DE testing",
+        input_types=["count_matrix"],
+        output_types=["de_result"],
+    ),
+    "edgeR": ToolSpec(
+        name="edgeR",
+        category=ToolCategory.DIFFERENTIAL_EXPRESSION,
+        modalities=["bulk_rna_seq", "scRNA-seq"],
+        description="Empirical Bayes quasi-likelihood DE testing",
+        input_types=["count_matrix"],
+        output_types=["de_result"],
+    ),
+    "Wilcoxon": ToolSpec(
+        name="Wilcoxon",
+        category=ToolCategory.DIFFERENTIAL_EXPRESSION,
+        modalities=["scRNA-seq"],
+        description="Rank-sum test for marker gene detection",
+        input_types=["count_matrix"],
+        output_types=["de_result"],
+    ),
+    # ── Trajectory & RNA velocity ──
+    "Monocle3": ToolSpec(
+        name="Monocle3",
+        category=ToolCategory.TRAJECTORY,
+        modalities=["scRNA-seq"],
+        description="Reversed graph embedding for pseudotime trajectories",
+        input_types=["count_matrix", "embedding"],
+        output_types=["trajectory_result", "pseudotime"],
+    ),
+    "scVelo": ToolSpec(
+        name="scVelo",
+        category=ToolCategory.TRAJECTORY,
+        modalities=["scRNA-seq"],
+        description="RNA velocity estimation via spliced/unspliced dynamics",
+        input_types=["count_matrix"],
+        output_types=["velocity_result"],
+    ),
+    "CellRank": ToolSpec(
+        name="CellRank",
+        category=ToolCategory.TRAJECTORY,
+        modalities=["scRNA-seq"],
+        description="Fate probability estimation combining velocity and transcriptomics",
+        input_types=["velocity_result", "count_matrix"],
+        output_types=["fate_probabilities"],
+    ),
+    "Slingshot": ToolSpec(
+        name="Slingshot",
+        category=ToolCategory.TRAJECTORY,
+        modalities=["scRNA-seq"],
+        description="Minimum spanning tree-based trajectory inference",
+        input_types=["embedding", "cluster_result"],
+        output_types=["trajectory_result", "pseudotime"],
+    ),
+    "PAGA": ToolSpec(
+        name="PAGA",
+        category=ToolCategory.TRAJECTORY,
+        modalities=["scRNA-seq"],
+        description="Partition-based graph abstraction for topology estimation",
+        input_types=["knn_graph", "cluster_result"],
+        output_types=["trajectory_result"],
+    ),
+    # ── Gene regulatory networks ──
+    "SCENIC": ToolSpec(
+        name="SCENIC",
+        category=ToolCategory.GENE_REGULATORY_NETWORK,
+        modalities=["scRNA-seq"],
+        description="Single-cell regulatory network inference and clustering",
+        input_types=["count_matrix"],
+        output_types=["regulon_result", "network_result"],
+        typical_runtime_hours=6.0,
+    ),
+    "CellOracle": ToolSpec(
+        name="CellOracle",
+        category=ToolCategory.GENE_REGULATORY_NETWORK,
+        modalities=["scRNA-seq", "scATAC-seq", "scMultiome"],
+        description="GRN-based in-silico perturbation prediction",
+        input_types=["count_matrix", "peak_matrix"],
+        output_types=["network_result", "perturbation_prediction"],
+        typical_runtime_hours=4.0,
+    ),
+    # ── Cell-cell communication ──
+    "CellChat": ToolSpec(
+        name="CellChat",
+        category=ToolCategory.CELL_COMMUNICATION,
+        modalities=["scRNA-seq", "spatial_transcriptomics"],
+        description="Ligand-receptor interaction inference with communication patterns",
+        input_types=["count_matrix", "cluster_result"],
+        output_types=["communication_result"],
+    ),
+    "NicheNet": ToolSpec(
+        name="NicheNet",
+        category=ToolCategory.CELL_COMMUNICATION,
+        modalities=["scRNA-seq"],
+        description="Ligand-target link prediction using prior knowledge",
+        input_types=["count_matrix", "de_result"],
+        output_types=["communication_result"],
+    ),
+    "LIANA": ToolSpec(
+        name="LIANA",
+        category=ToolCategory.CELL_COMMUNICATION,
+        modalities=["scRNA-seq", "spatial_transcriptomics"],
+        description="Framework unifying multiple ligand-receptor methods",
+        input_types=["count_matrix", "cluster_result"],
+        output_types=["communication_result"],
+    ),
+    # ── Spatial analysis ──
+    "squidpy": ToolSpec(
+        name="squidpy",
+        category=ToolCategory.SPATIAL,
+        modalities=["spatial_transcriptomics"],
+        description="Spatial omics analysis (neighborhood, co-occurrence, image features)",
+        input_types=["count_matrix", "spatial_coords"],
+        output_types=["spatial_result"],
+    ),
+    "cell2location": ToolSpec(
+        name="cell2location",
+        category=ToolCategory.SPATIAL,
+        modalities=["spatial_transcriptomics"],
+        description="Spatial deconvolution mapping cell types to tissue locations",
+        input_types=["count_matrix", "spatial_coords", "reference_h5ad"],
+        output_types=["deconvolution_result"],
+        requires_gpu=True,
+    ),
+    "BANKSY": ToolSpec(
+        name="BANKSY",
+        category=ToolCategory.SPATIAL,
+        modalities=["spatial_transcriptomics"],
+        description="Spatially-aware clustering combining cell and neighbor features",
+        input_types=["count_matrix", "spatial_coords"],
+        output_types=["cluster_result"],
+    ),
+    # ── Multimodal integration ──
+    "Harmony": ToolSpec(
+        name="Harmony",
+        category=ToolCategory.BATCH_CORRECTION,
+        modalities=["scRNA-seq", "scATAC-seq", "CITE-seq"],
+        description="Fast iterative batch correction on PCA embeddings",
+        input_types=["pca_embedding"],
+        output_types=["corrected_embedding"],
+    ),
+    "scanorama": ToolSpec(
+        name="scanorama",
+        category=ToolCategory.BATCH_CORRECTION,
+        modalities=["scRNA-seq"],
+        description="Panoramic stitching of scRNA-seq batches",
+        input_types=["count_matrix"],
+        output_types=["corrected_embedding", "corrected_matrix"],
+    ),
+    "BBKNN": ToolSpec(
+        name="BBKNN",
+        category=ToolCategory.BATCH_CORRECTION,
+        modalities=["scRNA-seq"],
+        description="Batch-balanced KNN graph construction",
+        input_types=["pca_embedding"],
+        output_types=["knn_graph"],
+    ),
+    "WNN": ToolSpec(
+        name="WNN",
+        category=ToolCategory.MULTIMODAL_INTEGRATION,
+        modalities=["CITE-seq", "scMultiome"],
+        description="Weighted nearest neighbors for multimodal integration (Seurat v4+)",
+        input_types=["rna_embedding", "protein_embedding"],
+        output_types=["multimodal_embedding"],
+    ),
+    "MOFA+": ToolSpec(
+        name="MOFA+",
+        category=ToolCategory.MULTIMODAL_INTEGRATION,
+        modalities=["scMultiome", "CITE-seq"],
+        description="Multi-omics factor analysis for unsupervised integration",
+        input_types=["count_matrix", "peak_matrix"],
+        output_types=["factor_result"],
+    ),
+    "ArchR": ToolSpec(
+        name="ArchR",
+        category=ToolCategory.PREPROCESSING,
+        modalities=["scATAC-seq", "scMultiome"],
+        description="Full-featured scATAC-seq analysis framework in R",
+        input_types=["fragments", "bam"],
+        output_types=["peak_matrix", "gene_activity_matrix"],
+        typical_runtime_hours=2.0,
+    ),
+    "Signac": ToolSpec(
+        name="Signac",
+        category=ToolCategory.PREPROCESSING,
+        modalities=["scATAC-seq", "scMultiome"],
+        description="Seurat extension for chromatin accessibility analysis",
+        input_types=["fragments", "peak_matrix"],
+        output_types=["peak_matrix", "motif_result"],
+    ),
+    "chromVAR": ToolSpec(
+        name="chromVAR",
+        category=ToolCategory.PEAK_CALLING,
+        modalities=["scATAC-seq", "scMultiome"],
+        description="TF motif accessibility deviation scoring",
+        input_types=["peak_matrix"],
+        output_types=["motif_deviation_scores"],
+    ),
+    # ── Gene set / pathway analysis ──
+    "GSEA": ToolSpec(
+        name="GSEA",
+        category=ToolCategory.GENE_SET_ANALYSIS,
+        modalities=["bulk_rna_seq", "scRNA-seq"],
+        description="Gene Set Enrichment Analysis (preranked or phenotype-based)",
+        input_types=["de_result", "ranked_gene_list"],
+        output_types=["pathway_result"],
+    ),
+    "clusterProfiler": ToolSpec(
+        name="clusterProfiler",
+        category=ToolCategory.GENE_SET_ANALYSIS,
+        modalities=["bulk_rna_seq", "scRNA-seq"],
+        description="ORA & GSEA with GO, KEGG, Reactome, and custom gene sets",
+        input_types=["de_result", "gene_list"],
+        output_types=["pathway_result"],
+    ),
+    "decoupleR": ToolSpec(
+        name="decoupleR",
+        category=ToolCategory.GENE_SET_ANALYSIS,
+        modalities=["scRNA-seq", "bulk_rna_seq", "spatial_transcriptomics"],
+        description="Unified framework for functional activity inference (TF, pathway)",
+        input_types=["count_matrix", "de_result"],
+        output_types=["activity_scores"],
+    ),
+    # ── Cell type annotation ──
+    "celltypist": ToolSpec(
+        name="celltypist",
+        category=ToolCategory.CELL_TYPE_ANNOTATION,
+        modalities=["scRNA-seq"],
+        description="Automated cell type classification with pre-trained models",
+        input_types=["count_matrix"],
+        output_types=["annotation_result"],
+    ),
+    "SingleR": ToolSpec(
+        name="SingleR",
+        category=ToolCategory.CELL_TYPE_ANNOTATION,
+        modalities=["scRNA-seq"],
+        description="Reference-based cell type annotation using correlation",
+        input_types=["count_matrix", "reference_dataset"],
+        output_types=["annotation_result"],
+    ),
+    "scArches": ToolSpec(
+        name="scArches",
+        category=ToolCategory.CELL_TYPE_ANNOTATION,
+        modalities=["scRNA-seq", "scATAC-seq", "CITE-seq"],
+        description="Reference mapping and label transfer via deep learning",
+        input_types=["count_matrix", "reference_model"],
+        output_types=["annotation_result", "latent_embedding"],
+        requires_gpu=True,
+    ),
+    # ── Imputation ──
+    "MAGIC": ToolSpec(
+        name="MAGIC",
+        category=ToolCategory.IMPUTATION,
+        modalities=["scRNA-seq"],
+        description="Markov affinity-based graph imputation of dropout zeros",
+        input_types=["count_matrix"],
+        output_types=["imputed_matrix"],
+    ),
+    # ── Perturbation analysis ──
+    "MILO": ToolSpec(
+        name="MILO",
+        category=ToolCategory.PERTURBATION_ANALYSIS,
+        modalities=["scRNA-seq"],
+        description="Differential abundance testing on KNN graph neighborhoods",
+        input_types=["count_matrix", "knn_graph"],
+        output_types=["da_result"],
+    ),
+    "Mixscape": ToolSpec(
+        name="Mixscape",
+        category=ToolCategory.PERTURBATION_ANALYSIS,
+        modalities=["Perturb-seq", "CROP-seq"],
+        description="Seurat extension for CRISPR screen perturbation analysis",
+        input_types=["count_matrix", "guide_assignments"],
+        output_types=["perturbation_result"],
+    ),
+    "MIMOSCA": ToolSpec(
+        name="MIMOSCA",
+        category=ToolCategory.PERTURBATION_ANALYSIS,
+        modalities=["Perturb-seq", "CROP-seq"],
+        description="Multi-input multi-output single-cell analysis for screens",
+        input_types=["count_matrix", "guide_assignments"],
+        output_types=["perturbation_result"],
+    ),
+    # ── Quality control ──
+    "scrublet": ToolSpec(
+        name="scrublet",
+        category=ToolCategory.QUALITY_CONTROL,
+        modalities=["scRNA-seq"],
+        description="Computational doublet detection via synthetic doublets",
+        input_types=["count_matrix"],
+        output_types=["doublet_scores"],
+    ),
+    "DoubletFinder": ToolSpec(
+        name="DoubletFinder",
+        category=ToolCategory.QUALITY_CONTROL,
+        modalities=["scRNA-seq"],
+        description="Artificial nearest-neighbor doublet detection",
+        input_types=["count_matrix"],
+        output_types=["doublet_scores"],
+    ),
+    "SoupX": ToolSpec(
+        name="SoupX",
+        category=ToolCategory.QUALITY_CONTROL,
+        modalities=["scRNA-seq"],
+        description="Ambient RNA contamination estimation and removal",
+        input_types=["count_matrix", "raw_count_matrix"],
+        output_types=["corrected_matrix"],
+    ),
+    "DecontX": ToolSpec(
+        name="DecontX",
+        category=ToolCategory.QUALITY_CONTROL,
+        modalities=["scRNA-seq"],
+        description="Bayesian ambient RNA decontamination",
+        input_types=["count_matrix"],
+        output_types=["corrected_matrix"],
+    ),
+    # ── Simulation ──
+    "Splatter": ToolSpec(
+        name="Splatter",
+        category=ToolCategory.SIMULATION,
+        modalities=["scRNA-seq"],
+        description="Flexible scRNA-seq data simulation framework",
+        input_types=["simulation_params"],
+        output_types=["simulated_count_matrix"],
+    ),
+}
+class Modality(str, Enum):
+    SCRNA_SEQ = "scRNA-seq"
+    SCATAC_SEQ = "scATAC-seq"
+    CITE_SEQ = "CITE-seq"
+    SPATIAL_TRANSCRIPTOMICS = "spatial_transcriptomics"
+    BULK_RNA_SEQ = "bulk_rna_seq"
+    SCRNA_MULTIOME = "scMultiome"
+    PERTURB_SEQ = "Perturb-seq"
+    CROP_SEQ = "CROP-seq"
+    SMART_SEQ2 = "Smart-seq2"
+    SLIDE_SEQ = "Slide-seq"
+    MERFISH = "MERFISH"
+    SEQFISH = "seqFISH"
+    PATCH_SEQ = "Patch-seq"
+    SHARE_SEQ = "SHARE-seq"
+    SNARE_SEQ = "SNARE-seq"
+    SC_HI_C = "scHi-C"
+    SCBS_SEQ = "scBS-seq"
+    SCNMT_SEQ = "scNMT-seq"
+class ModalitySpec(BaseModel):
+    """Registry entry for a single-cell or bulk assay modality."""
+    name: str
+    modality: Modality
+    measurement: str = ""
+    resolution: str = "single-cell"
+    multiplexable: bool = False
+    typical_cells: str = "1k-20k"
+    typical_cost_per_sample_usd: float = 5000.0
+    compatible_tools: List[str] = Field(default_factory=list)
+    description: str = ""
+MODALITY_REGISTRY: Dict[str, ModalitySpec] = {
+    "scRNA-seq": ModalitySpec(
+        name="scRNA-seq",
+        modality=Modality.SCRNA_SEQ,
+        measurement="mRNA transcripts",
+        typical_cells="5k-20k",
+        typical_cost_per_sample_usd=5000.0,
+        compatible_tools=[
+            "CellRanger", "STARsolo", "kallisto_bustools", "Scanpy", "Seurat",
+            "scVI", "Leiden", "DESeq2", "MAST", "Monocle3", "scVelo", "SCENIC",
+            "CellChat", "GSEA", "celltypist", "scrublet",
+        ],
+        description="Droplet-based single-cell RNA sequencing (e.g. 10x Chromium)",
+    ),
+    "scATAC-seq": ModalitySpec(
+        name="scATAC-seq",
+        modality=Modality.SCATAC_SEQ,
+        measurement="open chromatin regions",
+        typical_cells="5k-15k",
+        typical_cost_per_sample_usd=6000.0,
+        compatible_tools=[
+            "CellRanger", "ArchR", "Signac", "chromVAR", "Scanpy", "Leiden",
+        ],
+        description="Single-cell Assay for Transposase-Accessible Chromatin",
+    ),
+    "CITE-seq": ModalitySpec(
+        name="CITE-seq",
+        modality=Modality.CITE_SEQ,
+        measurement="mRNA + surface proteins (ADT)",
+        multiplexable=True,
+        typical_cells="5k-20k",
+        typical_cost_per_sample_usd=8000.0,
+        compatible_tools=[
+            "CellRanger", "Seurat", "WNN", "MOFA+", "Scanpy", "Leiden",
+        ],
+        description="Cellular Indexing of Transcriptomes and Epitopes by Sequencing",
+    ),
+    "spatial_transcriptomics": ModalitySpec(
+        name="spatial_transcriptomics",
+        modality=Modality.SPATIAL_TRANSCRIPTOMICS,
+        measurement="spatially resolved transcripts",
+        resolution="spot (55µm) or subcellular",
+        typical_cells="1k-10k spots",
+        typical_cost_per_sample_usd=7000.0,
+        compatible_tools=[
+            "spaceranger", "squidpy", "cell2location", "BANKSY", "Scanpy", "Seurat",
+        ],
+        description="Spatially resolved transcriptomics (Visium, MERFISH, Slide-seq, etc.)",
+    ),
+    "bulk_rna_seq": ModalitySpec(
+        name="bulk_rna_seq",
+        modality=Modality.BULK_RNA_SEQ,
+        measurement="aggregate mRNA across cells",
+        resolution="bulk",
+        typical_cells="N/A",
+        typical_cost_per_sample_usd=500.0,
+        compatible_tools=["DESeq2", "edgeR", "GSEA", "clusterProfiler"],
+        description="Standard bulk RNA sequencing",
+    ),
+    "scMultiome": ModalitySpec(
+        name="scMultiome",
+        modality=Modality.SCRNA_MULTIOME,
+        measurement="mRNA + open chromatin (joint)",
+        typical_cells="5k-15k",
+        typical_cost_per_sample_usd=10000.0,
+        compatible_tools=[
+            "CellRanger", "ArchR", "Signac", "Seurat", "MOFA+", "CellOracle",
+        ],
+        description="10x Multiome (joint scRNA + scATAC from same cell)",
+    ),
+    "Perturb-seq": ModalitySpec(
+        name="Perturb-seq",
+        modality=Modality.PERTURB_SEQ,
+        measurement="mRNA + CRISPR guide assignment",
+        multiplexable=True,
+        typical_cells="10k-100k",
+        typical_cost_per_sample_usd=15000.0,
+        compatible_tools=[
+            "CellRanger", "Scanpy", "Seurat", "Mixscape", "MIMOSCA",
+        ],
+        description="Pooled CRISPR screens with single-cell RNA readout",
+    ),
+    "CROP-seq": ModalitySpec(
+        name="CROP-seq",
+        modality=Modality.CROP_SEQ,
+        measurement="mRNA + CRISPR guide assignment",
+        multiplexable=True,
+        typical_cells="10k-50k",
+        typical_cost_per_sample_usd=12000.0,
+        compatible_tools=[
+            "CellRanger", "Scanpy", "Seurat", "Mixscape", "MIMOSCA",
+        ],
+        description="CRISPR dropout screen with single-cell RNA readout",
+    ),
+    "Smart-seq2": ModalitySpec(
+        name="Smart-seq2",
+        modality=Modality.SMART_SEQ2,
+        measurement="full-length mRNA transcripts",
+        typical_cells="100-1000",
+        typical_cost_per_sample_usd=10000.0,
+        compatible_tools=["Scanpy", "Seurat", "DESeq2", "MAST", "Monocle3"],
+        description="Plate-based full-length scRNA-seq with high sensitivity",
+    ),
+    "MERFISH": ModalitySpec(
+        name="MERFISH",
+        modality=Modality.MERFISH,
+        measurement="in situ mRNA (imaging-based)",
+        resolution="subcellular",
+        typical_cells="10k-1M",
+        typical_cost_per_sample_usd=20000.0,
+        compatible_tools=["squidpy", "Scanpy", "BANKSY"],
+        description="Multiplexed Error-Robust FISH for spatial transcriptomics",
+    ),
+    "Slide-seq": ModalitySpec(
+        name="Slide-seq",
+        modality=Modality.SLIDE_SEQ,
+        measurement="spatially resolved mRNA (bead array)",
+        resolution="10µm",
+        typical_cells="10k-50k beads",
+        typical_cost_per_sample_usd=8000.0,
+        compatible_tools=["squidpy", "cell2location", "Scanpy"],
+        description="Near-cellular spatial transcriptomics on bead arrays",
+    ),
+    "Patch-seq": ModalitySpec(
+        name="Patch-seq",
+        modality=Modality.PATCH_SEQ,
+        measurement="mRNA + electrophysiology + morphology",
+        typical_cells="10-500",
+        typical_cost_per_sample_usd=50000.0,
+        compatible_tools=["Scanpy", "Seurat"],
+        description="Combined patch-clamp electrophysiology and scRNA-seq",
+    ),
+    "scHi-C": ModalitySpec(
+        name="scHi-C",
+        modality=Modality.SC_HI_C,
+        measurement="3D chromatin contacts",
+        typical_cells="1k-10k",
+        typical_cost_per_sample_usd=15000.0,
+        compatible_tools=["Scanpy"],
+        description="Single-cell chromosome conformation capture",
+    ),
+    "scBS-seq": ModalitySpec(
+        name="scBS-seq",
+        modality=Modality.SCBS_SEQ,
+        measurement="DNA methylation (CpG)",
+        typical_cells="100-5k",
+        typical_cost_per_sample_usd=12000.0,
+        compatible_tools=["Scanpy"],
+        description="Single-cell bisulfite sequencing for DNA methylation",
+    ),
+    "scNMT-seq": ModalitySpec(
+        name="scNMT-seq",
+        modality=Modality.SCNMT_SEQ,
+        measurement="nucleosome + methylation + transcription (joint)",
+        typical_cells="100-1k",
+        typical_cost_per_sample_usd=25000.0,
+        compatible_tools=["MOFA+", "Scanpy"],
+        description="Joint single-cell nucleosome, methylation, and transcription",
+    ),
+}
+class AssayCategory(str, Enum):
+    SEQUENCING = "sequencing"
+    IMAGING = "imaging"
+    PERTURBATION = "perturbation"
+    FUNCTIONAL = "functional"
+    EPIGENOMICS = "epigenomics"
+    PROTEOMICS = "proteomics"
+    METABOLOMICS = "metabolomics"
+class AssaySpec(BaseModel):
+    """Registry entry for a laboratory assay or protocol."""
+    name: str
+    category: AssayCategory
+    modalities: List[str] = Field(default_factory=list)
+    description: str = ""
+    typical_duration_days: float = 1.0
+    typical_cost_usd: float = 1000.0
+    requires_live_cells: bool = False
+    requires_fresh_tissue: bool = False
+    throughput: str = "medium"
+    outputs: List[str] = Field(default_factory=list)
+ASSAY_REGISTRY: Dict[str, AssaySpec] = {
+    "10x_chromium": AssaySpec(
+        name="10x_chromium",
+        category=AssayCategory.SEQUENCING,
+        modalities=["scRNA-seq", "scATAC-seq", "CITE-seq", "scMultiome"],
+        description="10x Genomics Chromium droplet-based single-cell partitioning",
+        typical_duration_days=2.0,
+        typical_cost_usd=5000.0,
+        requires_live_cells=True,
+        throughput="high (500-20k cells)",
+        outputs=["fastq", "count_matrix"],
+    ),
+    "smart-seq2": AssaySpec(
+        name="smart-seq2",
+        category=AssayCategory.SEQUENCING,
+        modalities=["Smart-seq2"],
+        description="Plate-based full-length cDNA scRNA-seq",
+        typical_duration_days=3.0,
+        typical_cost_usd=10000.0,
+        requires_live_cells=True,
+        throughput="low (96-384 cells)",
+        outputs=["fastq", "count_matrix"],
+    ),
+    "smart-seq3": AssaySpec(
+        name="smart-seq3",
+        category=AssayCategory.SEQUENCING,
+        modalities=["Smart-seq2"],
+        description="Improved full-length scRNA-seq with UMIs",
+        typical_duration_days=3.0,
+        typical_cost_usd=10000.0,
+        requires_live_cells=True,
+        throughput="low (96-384 cells)",
+        outputs=["fastq", "count_matrix"],
+    ),
+    "bulk_rna_seq": AssaySpec(
+        name="bulk_rna_seq",
+        category=AssayCategory.SEQUENCING,
+        modalities=["bulk_rna_seq"],
+        description="Standard bulk RNA sequencing with poly-A or ribo-depletion",
+        typical_duration_days=3.0,
+        typical_cost_usd=500.0,
+        throughput="high",
+        outputs=["fastq", "count_matrix"],
+    ),
+    "atac-seq": AssaySpec(
+        name="atac-seq",
+        category=AssayCategory.EPIGENOMICS,
+        modalities=["scATAC-seq"],
+        description="Assay for Transposase-Accessible Chromatin using sequencing",
+        typical_duration_days=2.0,
+        typical_cost_usd=6000.0,
+        requires_live_cells=True,
+        outputs=["fastq", "fragments", "peak_matrix"],
+    ),
+    "cite-seq": AssaySpec(
+        name="cite-seq",
+        category=AssayCategory.PROTEOMICS,
+        modalities=["CITE-seq"],
+        description="Simultaneous RNA + surface protein via DNA-barcoded antibodies",
+        typical_duration_days=2.0,
+        typical_cost_usd=8000.0,
+        requires_live_cells=True,
+        throughput="high (5k-20k cells)",
+        outputs=["fastq", "count_matrix", "adt_matrix"],
+    ),
+    "10x_multiome": AssaySpec(
+        name="10x_multiome",
+        category=AssayCategory.SEQUENCING,
+        modalities=["scMultiome"],
+        description="Joint scRNA-seq + scATAC-seq from the same cell",
+        typical_duration_days=2.0,
+        typical_cost_usd=10000.0,
+        requires_live_cells=True,
+        throughput="high (5k-15k cells)",
+        outputs=["fastq", "count_matrix", "fragments"],
+    ),
+    "visium": AssaySpec(
+        name="visium",
+        category=AssayCategory.SEQUENCING,
+        modalities=["spatial_transcriptomics"],
+        description="10x Visium spatially barcoded capture on tissue sections",
+        typical_duration_days=3.0,
+        typical_cost_usd=7000.0,
+        requires_fresh_tissue=True,
+        throughput="medium (1k-5k spots)",
+        outputs=["fastq", "count_matrix", "spatial_coords", "image"],
+    ),
+    "visium_hd": AssaySpec(
+        name="visium_hd",
+        category=AssayCategory.SEQUENCING,
+        modalities=["spatial_transcriptomics"],
+        description="High-definition Visium with 2µm bin resolution",
+        typical_duration_days=3.0,
+        typical_cost_usd=10000.0,
+        requires_fresh_tissue=True,
+        throughput="high",
+        outputs=["fastq", "count_matrix", "spatial_coords", "image"],
+    ),
+    "merfish": AssaySpec(
+        name="merfish",
+        category=AssayCategory.IMAGING,
+        modalities=["MERFISH"],
+        description="Multiplexed Error-Robust FISH imaging-based spatial",
+        typical_duration_days=5.0,
+        typical_cost_usd=20000.0,
+        requires_fresh_tissue=True,
+        throughput="high (100-1000 genes, millions of transcripts)",
+        outputs=["transcript_coords", "cell_segmentation"],
+    ),
+    "seqfish_plus": AssaySpec(
+        name="seqfish_plus",
+        category=AssayCategory.IMAGING,
+        modalities=["seqFISH"],
+        description="Sequential FISH for imaging-based spatial transcriptomics",
+        typical_duration_days=5.0,
+        typical_cost_usd=15000.0,
+        requires_fresh_tissue=True,
+        outputs=["transcript_coords"],
+    ),
+    "slide-seq": AssaySpec(
+        name="slide-seq",
+        category=AssayCategory.SEQUENCING,
+        modalities=["Slide-seq"],
+        description="Near-cellular spatial transcriptomics on bead arrays",
+        typical_duration_days=3.0,
+        typical_cost_usd=8000.0,
+        requires_fresh_tissue=True,
+        outputs=["count_matrix", "spatial_coords"],
+    ),
+    "perturb-seq": AssaySpec(
+        name="perturb-seq",
+        category=AssayCategory.PERTURBATION,
+        modalities=["Perturb-seq"],
+        description="Pooled CRISPR screen + scRNA-seq readout",
+        typical_duration_days=14.0,
+        typical_cost_usd=15000.0,
+        requires_live_cells=True,
+        throughput="high (10k-100k cells)",
+        outputs=["fastq", "count_matrix", "guide_assignments"],
+    ),
+    "crop-seq": AssaySpec(
+        name="crop-seq",
+        category=AssayCategory.PERTURBATION,
+        modalities=["CROP-seq"],
+        description="CRISPR dropout screening with scRNA-seq readout",
+        typical_duration_days=14.0,
+        typical_cost_usd=12000.0,
+        requires_live_cells=True,
+        throughput="high (10k-50k cells)",
+        outputs=["fastq", "count_matrix", "guide_assignments"],
+    ),
+    "patch-seq": AssaySpec(
+        name="patch-seq",
+        category=AssayCategory.FUNCTIONAL,
+        modalities=["Patch-seq"],
+        description="Patch-clamp electrophysiology + scRNA-seq on same neuron",
+        typical_duration_days=7.0,
+        typical_cost_usd=50000.0,
+        requires_live_cells=True,
+        throughput="very low (10-100 cells)",
+        outputs=["fastq", "count_matrix", "ephys_trace", "morphology"],
+    ),
+    "sc_hi_c": AssaySpec(
+        name="sc_hi_c",
+        category=AssayCategory.EPIGENOMICS,
+        modalities=["scHi-C"],
+        description="Single-cell chromosome conformation capture",
+        typical_duration_days=5.0,
+        typical_cost_usd=15000.0,
+        outputs=["contact_matrix"],
+    ),
+    "sc_bisulfite": AssaySpec(
+        name="sc_bisulfite",
+        category=AssayCategory.EPIGENOMICS,
+        modalities=["scBS-seq"],
+        description="Single-cell bisulfite sequencing for DNA methylation profiling",
+        typical_duration_days=5.0,
+        typical_cost_usd=12000.0,
+        outputs=["methylation_matrix"],
+    ),
+    "sc_nmt_seq": AssaySpec(
+        name="sc_nmt_seq",
+        category=AssayCategory.EPIGENOMICS,
+        modalities=["scNMT-seq"],
+        description="Joint nucleosome occupancy, methylation, and transcription",
+        typical_duration_days=7.0,
+        typical_cost_usd=25000.0,
+        requires_live_cells=True,
+        throughput="low (100-1k cells)",
+        outputs=["count_matrix", "methylation_matrix", "accessibility_matrix"],
+    ),
+    "flow_cytometry": AssaySpec(
+        name="flow_cytometry",
+        category=AssayCategory.FUNCTIONAL,
+        modalities=[],
+        description="Fluorescence-based cell sorting and phenotyping",
+        typical_duration_days=1.0,
+        typical_cost_usd=500.0,
+        requires_live_cells=True,
+        throughput="very high (millions of cells)",
+        outputs=["cell_counts", "sorted_cells"],
+    ),
+    "mass_cytometry_CyTOF": AssaySpec(
+        name="mass_cytometry_CyTOF",
+        category=AssayCategory.PROTEOMICS,
+        modalities=[],
+        description="Mass-tag cytometry for 40+ protein markers per cell",
+        typical_duration_days=2.0,
+        typical_cost_usd=3000.0,
+        requires_live_cells=True,
+        throughput="high (100k-1M cells)",
+        outputs=["protein_expression_matrix"],
+    ),
+    "western_blot": AssaySpec(
+        name="western_blot",
+        category=AssayCategory.PROTEOMICS,
+        modalities=[],
+        description="Protein detection and semi-quantification by size separation",
+        typical_duration_days=2.0,
+        typical_cost_usd=200.0,
+        outputs=["band_image", "relative_quantification"],
+    ),
+    "qPCR": AssaySpec(
+        name="qPCR",
+        category=AssayCategory.FUNCTIONAL,
+        modalities=[],
+        description="Quantitative PCR for targeted gene expression validation",
+        typical_duration_days=1.0,
+        typical_cost_usd=100.0,
+        throughput="low (target genes)",
+        outputs=["ct_values", "fold_change"],
+    ),
+    "immunofluorescence": AssaySpec(
+        name="immunofluorescence",
+        category=AssayCategory.IMAGING,
+        modalities=[],
+        description="Antibody-based fluorescence imaging of proteins in situ",
+        typical_duration_days=2.0,
+        typical_cost_usd=500.0,
+        outputs=["fluorescence_image"],
+    ),
+    "elisa": AssaySpec(
+        name="elisa",
+        category=AssayCategory.PROTEOMICS,
+        modalities=[],
+        description="Enzyme-linked immunosorbent assay for secreted protein quantification",
+        typical_duration_days=1.0,
+        typical_cost_usd=300.0,
+        throughput="medium (96-384 well)",
+        outputs=["protein_concentration"],
+    ),
+    "cell_viability_assay": AssaySpec(
+        name="cell_viability_assay",
+        category=AssayCategory.FUNCTIONAL,
+        modalities=[],
+        description="MTT/CellTiter-Glo viability and proliferation measurement",
+        typical_duration_days=1.0,
+        typical_cost_usd=200.0,
+        requires_live_cells=True,
+        throughput="high (96-384 well)",
+        outputs=["viability_scores"],
+    ),
+}
+# ── Registry helper functions ──────────────────────────────────────────────
+def tools_for_modality(modality: str) -> List[ToolSpec]:
+    """Return all registered tools compatible with a given modality."""
+    return [t for t in TOOL_REGISTRY.values() if modality in t.modalities]
+def assays_for_modality(modality: str) -> List[AssaySpec]:
+    """Return all registered assays that produce a given modality."""
+    return [a for a in ASSAY_REGISTRY.values() if modality in a.modalities]
+def tools_by_category(category: ToolCategory) -> List[ToolSpec]:
+    """Return all registered tools in a given category."""
+    return [t for t in TOOL_REGISTRY.values() if t.category == category]
+# ── Sub-agents ───────────────────────────────────────────────────���─────────
 class SubagentType(str, Enum):
     WET_LAB_PLANNER = "wet_lab_planner"
     COMPUTATIONAL_ANALYST = "computational_analyst"
     """
     action_type: ActionType = Field(
+        ...,
+        description=(
+            "Discrete simulator step type. The environment enforces scientific "
+            "prerequisites between steps, so actions should follow a valid "
+            "pipeline order."
+        ),
     )
     input_targets: List[str] = Field(
         default_factory=list,
+        description=(
+            "Optional references to prior samples, outputs, or artifacts that "
+            "this step consumes."
+        ),
     )
     method: Optional[str] = Field(
+        None,
+        description=(
+            "Optional named tool or protocol (for example 'Seurat' or "
+            "'CellRanger'). Prefer methods compatible with the current "
+            "modality and available tool list because tool choice can change "
+            "runtime, cost, and scientific fit."
+        ),
     )
     parameters: Dict[str, Any] = Field(
+        default_factory=dict,
+        description=(
+            "Action-specific settings such as comparison labels, perturbation "
+            "targets, or analysis options. Use only parameters that materially "
+            "change the scientific step."
+        ),
     )
     expected_output_type: Optional[str] = Field(
+        None,
+        description=(
+            "Optional expected artifact or summary that should result from the "
+            "step, such as a count matrix, QC report, DE table, or validation "
+            "result."
+        ),
     )
     justification: Optional[str] = Field(
+        None,
+        description=(
+            "Short scientific rationale explaining why this is the right next "
+            "step in the current environment state."
+        ),
     )
     invoked_subagent: Optional[SubagentType] = Field(
         None, description="Sub-agent to delegate to, if any"
     )
     tool_call_spec: Optional[Dict[str, Any]] = Field(
+        None,
+        description=(
+            "Optional structured tool invocation payload when the action needs "
+            "a more explicit tool execution plan."
+        ),
     )
     confidence: float = Field(
         0.5, ge=0.0, le=1.0, description="Agent confidence in this step"
     organism: str = "human"
     tissue: str = "blood"
     conditions: List[str] = Field(default_factory=list)
+    available_assays: List[str] = Field(
+        default_factory=lambda: list(ASSAY_REGISTRY.keys()),
+        description=(
+            "Assays that are scientifically compatible with this task's "
+            "modality. These are the relevant assay choices for the episode, "
+            "not an unrestricted catalog."
+        ),
+    )
+    available_tools: List[str] = Field(
+        default_factory=lambda: list(TOOL_REGISTRY.keys()),
+        description=(
+            "Tools filtered to those compatible with the current task "
+            "modality. The agent should treat this list as the preferred tool "
+            "set for the episode."
+        ),
+    )
     budget_limit: float = 100_000.0
     time_limit_days: float = 180.0
     prior_observations: List[str] = Field(default_factory=list)
 class ConclusionClaim(BaseModel):
+    claim: str = ""
+    top_markers: List[str] = Field(default_factory=list)
+    causal_mechanisms: List[str] = Field(default_factory=list)
+    predicted_pathways: Dict[str, float] = Field(default_factory=dict)
     evidence_steps: List[int] = Field(default_factory=list)
     confidence: float = Field(0.5, ge=0.0, le=1.0)
     claim_type: str = "correlational"
     task: TaskSpec = Field(default_factory=TaskSpec)
     step_index: int = 0
     pipeline_history: List[PipelineStepRecord] = Field(default_factory=list)
+    available_assays: List[str] = Field(
+        default_factory=list,
+        description=(
+            "Episode-specific assay choices already filtered to the current "
+            "modality and task context."
+        ),
+    )
+    available_tools: List[str] = Field(
+        default_factory=list,
+        description=(
+            "Episode-specific compatible tools. These are the methods the "
+            "agent should prefer instead of inventing incompatible tools."
+        ),
+    )
+    resource_usage: ResourceUsage = Field(
+        default_factory=ResourceUsage,
+        description=(
+            "Running budget, time, and compute usage after previous actions."
+        ),
+    )
     latest_output: Optional[IntermediateOutput] = None
     all_outputs: List[IntermediateOutput] = Field(default_factory=list)
     discovered_markers: List[str] = Field(default_factory=list)
     conclusions: List[ConclusionClaim] = Field(default_factory=list)
     rule_violations: List[str] = Field(default_factory=list)
     step_reward_breakdown: Dict[str, float] = Field(default_factory=dict)
+AGENT_ACTION_GUIDANCE: Dict[ActionType, str] = {
+    ActionType.COLLECT_SAMPLE: (
+        "Wet-lab entry point. One successful collection usually provides enough "
+        "material to continue unless the output shows poor yield or quality."
+    ),
+    ActionType.SELECT_COHORT: (
+        "Use when subject stratification is part of the scientific question "
+        "before downstream experimental work."
+    ),
+    ActionType.PREPARE_LIBRARY: (
+        "Requires collected samples and converts biological material into "
+        "sequence-ready libraries."
+    ),
+    ActionType.CULTURE_CELLS: (
+        "Requires collected samples and adds substantial time; use only when "
+        "live-cell expansion or later perturbation is needed."
+    ),
+    ActionType.PERTURB_GENE: (
+        "Requires samples. Use for causal tests, not as a default discovery "
+        "step."
+    ),
+    ActionType.PERTURB_COMPOUND: (
+        "Requires samples. Best for mechanistic follow-up or treatment "
+        "response questions."
+    ),
+    ActionType.SEQUENCE_CELLS: (
+        "Requires prepared libraries and produces the raw sequencing-derived "
+        "artifacts used by downstream QC and analysis."
+    ),
+    ActionType.RUN_QC: (
+        "Requires sequencing and returns summarized quality metrics such as "
+        "doublets, mitochondrial fraction, and ambient RNA."
+    ),
+    ActionType.FILTER_DATA: (
+        "Requires QC and removes poor-quality cells, changing downstream cell "
+        "counts and data retention."
+    ),
+    ActionType.NORMALIZE_DATA: (
+        "Requires filtered data and unlocks clustering, differential "
+        "expression, trajectory, and network analyses."
+    ),
+    ActionType.INTEGRATE_BATCHES: (
+        "Requires normalized data. Use when batch effects are likely to "
+        "confound interpretation; it is not always necessary."
+    ),
+    ActionType.CLUSTER_CELLS: (
+        "Requires normalized data and identifies cell populations or states "
+        "for downstream interpretation."
+    ),
+    ActionType.DIFFERENTIAL_EXPRESSION: (
+        "Requires normalized data and is the main route to candidate genes "
+        "for pathway analysis and marker selection."
+    ),
+    ActionType.TRAJECTORY_ANALYSIS: (
+        "Requires normalized data and is most useful when lineage progression "
+        "or pseudotime is central to the task."
+    ),
+    ActionType.PATHWAY_ENRICHMENT: (
+        "Requires differential expression. Results are less reliable without a "
+        "strong DE gene list."
+    ),
+    ActionType.REGULATORY_NETWORK_INFERENCE: (
+        "Requires normalized data and is most helpful once cell states or "
+        "trajectories are already characterized."
+    ),
+    ActionType.MARKER_SELECTION: (
+        "Requires differential expression and turns candidate genes into a "
+        "short list for validation."
+    ),
+    ActionType.VALIDATE_MARKER: (
+        "Requires discovered markers and is an expensive wet-lab confirmation "
+        "step that should follow strong computational evidence."
+    ),
+    ActionType.DESIGN_FOLLOWUP: (
+        "Use to propose targeted next experiments once remaining uncertainty "
+        "is clear."
+    ),
+    ActionType.REQUEST_SUBAGENT_REVIEW: (
+        "Use for critique or planning support, not as a substitute for "
+        "missing experimental evidence."
+    ),
+    ActionType.SYNTHESIZE_CONCLUSION: (
+        "Use once the evidence is sufficient. Do not spend budget on redundant "
+        "steps just because more actions are possible."
+    ),
+}
+AGENT_ENVIRONMENT_RULES: List[str] = [
+    (
+        "Each successful action already returns summarized scientific evidence, "
+        "so repeated sampling or repeated analysis is not the default."
+    ),
+    (
+        "Repeat a step only when the task demands it or when prior outputs show "
+        "poor quality, insufficient yield, unresolved batch effects, or another "
+        "clear failure mode."
+    ),
+    (
+        "The available tool and assay lists are already filtered to the current "
+        "task modality, so prefer them over inventing incompatible methods."
+    ),
+    (
+        "Hard scientific prerequisites are enforced by the environment, so "
+        "invalid pipeline orderings will be blocked."
+    ),
+]
+_TOOL_CATEGORY_AGENT_NOTES: Dict[ToolCategory, str] = {
+    ToolCategory.ALIGNMENT: (
+        "Best immediately after sequencing to turn FASTQ-like inputs into "
+        "count-style matrices for downstream analysis."
+    ),
+    ToolCategory.PREPROCESSING: (
+        "Useful for general single-cell data handling before specialized "
+        "downstream analyses."
+    ),
+    ToolCategory.NORMALIZATION: (
+        "Applies after filtering to produce normalized matrices for downstream "
+        "modeling."
+    ),
+    ToolCategory.DIMENSIONALITY_REDUCTION: (
+        "Builds latent embeddings that support clustering or trajectory work."
+    ),
+    ToolCategory.CLUSTERING: (
+        "Best once data are normalized and the goal is to resolve cell states "
+        "or populations."
+    ),
+    ToolCategory.DIFFERENTIAL_EXPRESSION: (
+        "Tests contrasts and produces ranked genes for biological "
+        "interpretation."
+    ),
+    ToolCategory.TRAJECTORY: (
+        "Useful when the task asks about developmental progression, state "
+        "transitions, or pseudotime."
+    ),
+    ToolCategory.GENE_REGULATORY_NETWORK: (
+        "Most useful after normalized data and some cell-state structure are "
+        "already established."
+    ),
+    ToolCategory.GENE_SET_ANALYSIS: (
+        "Best after differential expression to interpret gene lists at the "
+        "pathway level."
+    ),
+    ToolCategory.BATCH_CORRECTION: (
+        "Use when batch effects would confound interpretation; unnecessary use "
+        "adds extra steps."
+    ),
+    ToolCategory.MULTIMODAL_INTEGRATION: (
+        "Useful only when combining modalities or batches is part of the "
+        "scientific question."
+    ),
+    ToolCategory.QUALITY_CONTROL: (
+        "Helps identify low-quality cells or technical artifacts before "
+        "filtering."
+    ),
+    ToolCategory.CELL_TYPE_ANNOTATION: (
+        "Best after clustering when assigning biological identities to groups."
+    ),
+    ToolCategory.PERTURBATION_ANALYSIS: (
+        "Use when perturbations were actually applied and the goal is to model "
+        "their transcriptional effects."
+    ),
+    ToolCategory.SPATIAL: (
+        "Only useful when the modality includes spatial coordinates or tissue "
+        "context."
+    ),
+}
+def _format_currency(value: float) -> str:
+    return f"${value:,.0f}"
+def _format_runtime_hours(hours: float) -> str:
+    if hours < 1.0:
+        return f"{int(round(hours * 60))}m"
+    if float(hours).is_integer():
+        return f"{int(hours)}h"
+    return f"{hours:.1f}h"
+def describe_tool_for_agent(tool_name: str) -> str:
+    """Return a compact environment-aware tool description for prompts."""
+    tool = TOOL_REGISTRY.get(tool_name)
+    if tool is None:
+        return tool_name
+    parts = [f"{tool.name}: {tool.description}."]
+    if tool.input_types or tool.output_types:
+        inputs = ", ".join(tool.input_types) or "upstream artifacts"
+        outputs = ", ".join(tool.output_types) or "analysis artifacts"
+        parts.append(f"Consumes {inputs}; yields {outputs}.")
+    category_note = _TOOL_CATEGORY_AGENT_NOTES.get(tool.category)
+    if category_note:
+        parts.append(category_note)
+    resource_bits: List[str] = []
+    if tool.typical_cost_usd > 0:
+        resource_bits.append(_format_currency(tool.typical_cost_usd))
+    if tool.typical_runtime_hours > 0:
+        resource_bits.append(_format_runtime_hours(tool.typical_runtime_hours))
+    if tool.requires_gpu:
+        resource_bits.append("GPU")
+    if resource_bits:
+        parts.append(f"Typical resources: {', '.join(resource_bits)}.")
+    return " ".join(parts)
+def describe_assay_for_agent(assay_name: str) -> str:
+    """Return a compact environment-aware assay description for prompts."""
+    assay = ASSAY_REGISTRY.get(assay_name)
+    if assay is None:
+        return assay_name
+    parts = [f"{assay.name}: {assay.description}."]
+    if assay.outputs:
+        parts.append(f"Produces {', '.join(assay.outputs)}.")
+    requirements: List[str] = []
+    if assay.requires_live_cells:
+        requirements.append("live cells")
+    if assay.requires_fresh_tissue:
+        requirements.append("fresh tissue")
+    if requirements:
+        parts.append(f"Requires {' and '.join(requirements)}.")
+    parts.append(
+        "Typical resources: "
+        f"{_format_currency(assay.typical_cost_usd)}, "
+        f"{assay.typical_duration_days:.1f}d."
+    )
+    return " ".join(parts)
+def build_agent_system_prompt() -> str:
+    """Build the shared agent system prompt for training and inference."""
+    lines = [
+        "You are an expert biologist planning a single-cell experiment pipeline.",
+        "",
+        "At each turn you see the experiment state and must pick the next scientifically justified step.",
+        "",
+        "Environment-specific reasoning rules:",
+    ]
+    lines.extend(f"  - {rule}" for rule in AGENT_ENVIRONMENT_RULES)
+    lines.append("")
+    lines.append("Action guidance:")
+    lines.extend(
+        f"  - {action_type.value}: {AGENT_ACTION_GUIDANCE[action_type]}"
+        for action_type in ActionType
+    )
+    lines.extend([
+        "",
+        "Respond with ONLY valid JSON, nothing else:",
+        '{"action_type": "...", "method": null, "parameters": {}, "justification": "...", "confidence": 0.8}',
+        "",
+        "For synthesize_conclusion, use structured claims:",
+        '{"action_type": "synthesize_conclusion", "parameters": {"claims": [{"top_markers": ["GENE1", "GENE2"], "causal_mechanisms": ["mechanism description"], "predicted_pathways": {"pathway_name": 0.8}, "confidence": 0.8, "claim_type": "causal", "claim": "optional free text"}]}, "justification": "...", "confidence": 0.8}',
+    ])
+    return "\n".join(lines)
+def build_agent_observation_context(
+    obs: ExperimentObservation,
+    *,
+    max_tools: int = 6,
+    max_assays: int = 3,
+) -> str:
+    """Summarize modality-specific tool and assay context for the agent."""
+    sections: List[str] = []
+    modality_spec = MODALITY_REGISTRY.get(obs.task.modality)
+    if modality_spec is not None:
+        sections.append(
+            "Modality context: "
+            f"{modality_spec.name} measures {modality_spec.measurement} at "
+            f"{modality_spec.resolution} resolution; typical scale "
+            f"{modality_spec.typical_cells}."
+        )
+    else:
+        sections.append(f"Modality context: {obs.task.modality}.")
+    tool_names = list(dict.fromkeys(obs.available_tools or obs.task.available_tools))
+    if tool_names:
+        sections.append("Available tools (already filtered to this modality):")
+        for tool_name in tool_names[:max_tools]:
+            sections.append(f"  - {describe_tool_for_agent(tool_name)}")
+        if len(tool_names) > max_tools:
+            remainder = ", ".join(tool_names[max_tools:max_tools + 6])
+            sections.append(
+                "  - Additional compatible tools not shown in full: "
+                f"{remainder}"
+            )
+    assay_names = list(dict.fromkeys(obs.available_assays or obs.task.available_assays))
+    if assay_names:
+        sections.append("Available assays:")
+        for assay_name in assay_names[:max_assays]:
+            sections.append(f"  - {describe_assay_for_agent(assay_name)}")
+        if len(assay_names) > max_assays:
+            remainder = ", ".join(assay_names[max_assays:max_assays + 4])
+            sections.append(
+                "  - Additional compatible assays not shown in full: "
+                f"{remainder}"
+            )
+    return "\n".join(sections)

my_env/README.md ADDED Viewed

	@@ -0,0 +1,255 @@

+---
+title: My Env Environment Server
+emoji: 🎤
+colorFrom: gray
+colorTo: indigo
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+---
+# My Env Environment
+A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
+## Quick Start
+The simplest way to use the My Env environment is through the `MyEnv` class:
+```python
+from my_env import MyAction, MyEnv
+try:
+    # Create environment from Docker image
+    my_envenv = MyEnv.from_docker_image("my_env-env:latest")
+    # Reset
+    result = my_envenv.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Send multiple messages
+    messages = ["Hello, World!", "Testing echo", "Final message"]
+    for msg in messages:
+        result = my_envenv.step(MyAction(message=msg))
+        print(f"Sent: '{msg}'")
+        print(f"  → Echoed: '{result.observation.echoed_message}'")
+        print(f"  → Length: {result.observation.message_length}")
+        print(f"  → Reward: {result.reward}")
+finally:
+    # Always clean up
+    my_envenv.close()
+```
+That's it! The `MyEnv.from_docker_image()` method handles:
+- Starting the Docker container
+- Waiting for the server to be ready
+- Connecting to the environment
+- Container cleanup when you call `close()`
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t my_env-env:latest -f server/Dockerfile .
+```
+## Deploying to Hugging Face Spaces
+You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
+```bash
+# From the environment directory (where openenv.yaml is located)
+openenv push
+# Or specify options
+openenv push --namespace my-org --private
+```
+The `openenv push` command will:
+1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
+2. Prepare a custom build for Hugging Face Docker space (enables web interface)
+3. Upload to Hugging Face (ensuring you're logged in)
+### Prerequisites
+- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
+### Options
+- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
+- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
+- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
+- `--private`: Deploy the space as private (default: public)
+### Examples
+```bash
+# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
+openenv push
+# Push to a specific repository
+openenv push --repo-id my-org/my-env
+# Push with a custom base image
+openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
+# Push as a private space
+openenv push --private
+# Combine options
+openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
+```
+After deployment, your space will be available at:
+`https://huggingface.co/spaces/<repo-id>`
+The deployed space includes:
+- **Web Interface** at `/web` - Interactive UI for exploring the environment
+- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
+- **Health Check** at `/health` - Container health monitoring
+- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
+## Environment Details
+### Action
+**MyAction**: Contains a single field
+- `message` (str) - The message to echo back
+### Observation
+**MyObservation**: Contains the echo response and metadata
+- `echoed_message` (str) - The message echoed back
+- `message_length` (int) - Length of the message
+- `reward` (float) - Reward based on message length (length × 0.1)
+- `done` (bool) - Always False for echo environment
+- `metadata` (dict) - Additional info like step count
+### Reward
+The reward is calculated as: `message_length × 0.1`
+- "Hi" → reward: 0.2
+- "Hello, World!" → reward: 1.3
+- Empty message → reward: 0.0
+## Advanced Usage
+### Connecting to an Existing Server
+If you already have a My Env environment server running, you can connect directly:
+```python
+from my_env import MyEnv
+# Connect to existing server
+my_envenv = MyEnv(base_url="<ENV_HTTP_URL_HERE>")
+# Use as normal
+result = my_envenv.reset()
+result = my_envenv.step(MyAction(message="Hello!"))
+```
+Note: When connecting to an existing server, `my_envenv.close()` will NOT stop the server.
+### Using the Context Manager
+The client supports context manager usage for automatic connection management:
+```python
+from my_env import MyAction, MyEnv
+# Connect with context manager (auto-connects and closes)
+with MyEnv(base_url="http://localhost:8000") as env:
+    result = env.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Multiple steps with low latency
+    for msg in ["Hello", "World", "!"]:
+        result = env.step(MyAction(message=msg))
+        print(f"Echoed: {result.observation.echoed_message}")
+```
+The client uses WebSocket connections for:
+- **Lower latency**: No HTTP connection overhead per request
+- **Persistent session**: Server maintains your environment state
+- **Efficient for episodes**: Better for many sequential steps
+### Concurrent WebSocket Sessions
+The server supports multiple concurrent WebSocket connections. To enable this,
+modify `server/app.py` to use factory mode:
+```python
+# In server/app.py - use factory mode for concurrent sessions
+app = create_app(
+    MyEnvironment,  # Pass class, not instance
+    MyAction,
+    MyObservation,
+    max_concurrent_envs=4,  # Allow 4 concurrent sessions
+)
+```
+Then multiple clients can connect simultaneously:
+```python
+from my_env import MyAction, MyEnv
+from concurrent.futures import ThreadPoolExecutor
+def run_episode(client_id: int):
+    with MyEnv(base_url="http://localhost:8000") as env:
+        result = env.reset()
+        for i in range(10):
+            result = env.step(MyAction(message=f"Client {client_id}, step {i}"))
+        return client_id, result.observation.message_length
+# Run 4 episodes concurrently
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = list(executor.map(run_episode, range(4)))
+```
+## Development & Testing
+### Direct Environment Testing
+Test the environment logic directly without starting the HTTP server:
+```bash
+# From the server directory
+python3 server/my_env_environment.py
+```
+This verifies that:
+- Environment resets correctly
+- Step executes actions properly
+- State tracking works
+- Rewards are calculated correctly
+### Running Locally
+Run the server locally for development:
+```bash
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+my_env/
+├── .dockerignore         # Docker build exclusions
+├── __init__.py            # Module exports
+├── README.md              # This file
+├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Project metadata and dependencies
+├── uv.lock                # Locked dependencies (generated)
+├── client.py              # MyEnv client
+├── models.py              # Action and Observation models
+└── server/
+    ├── __init__.py        # Server module exports
+    ├── my_env_environment.py  # Core environment logic
+    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
+    └── Dockerfile         # Container image definition
+```

my_env/__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""My Env Environment."""
+from .client import MyEnv
+from .models import MyAction, MyObservation
+__all__ = [
+    "MyAction",
+    "MyObservation",
+    "MyEnv",
+]

my_env/client.py ADDED Viewed

	@@ -0,0 +1,99 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""My Env Environment Client."""
+from typing import Dict
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from openenv.core import EnvClient
+from .models import MyAction, MyObservation
+class MyEnv(
+    EnvClient[MyAction, MyObservation]
+):
+    """
+    Client for the My Env Environment.
+    This client maintains a persistent WebSocket connection to the environment server,
+    enabling efficient multi-step interactions with lower latency.
+    Each client instance has its own dedicated environment session on the server.
+    Example:
+        >>> # Connect to a running server
+        >>> with MyEnv(base_url="http://localhost:8000") as client:
+        ...     result = client.reset()
+        ...     print(result.observation.echoed_message)
+        ...
+        ...     result = client.step(MyAction(message="Hello!"))
+        ...     print(result.observation.echoed_message)
+    Example with Docker:
+        >>> # Automatically start container and connect
+        >>> client = MyEnv.from_docker_image("my_env-env:latest")
+        >>> try:
+        ...     result = client.reset()
+        ...     result = client.step(MyAction(message="Test"))
+        ... finally:
+        ...     client.close()
+    """
+    def _step_payload(self, action: MyAction) -> Dict:
+        """
+        Convert MyAction to JSON payload for step message.
+        Args:
+            action: MyAction instance
+        Returns:
+            Dictionary representation suitable for JSON encoding
+        """
+        return {
+            "message": action.message,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[MyObservation]:
+        """
+        Parse server response into StepResult[MyObservation].
+        Args:
+            payload: JSON response data from server
+        Returns:
+            StepResult with MyObservation
+        """
+        obs_data = payload.get("observation", {})
+        observation = MyObservation(
+            echoed_message=obs_data.get("echoed_message", ""),
+            message_length=obs_data.get("message_length", 0),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        """
+        Parse server response into State object.
+        Args:
+            payload: JSON response from state request
+        Returns:
+            State object with episode_id and step_count
+        """
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

my_env/models.py ADDED Viewed

	@@ -0,0 +1,28 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Data models for the My Env Environment.
+The my_env environment is a simple test environment that echoes back messages.
+"""
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation
+class MyAction(Action):
+    """Action for the My Env environment - just a message to echo."""
+    message: str = Field(..., description="Message to echo back")
+class MyObservation(Observation):
+    """Observation from the My Env environment - the echoed message."""
+    echoed_message: str = Field(default="", description="The echoed message")
+    message_length: int = Field(default=0, description="Length of the echoed message")

my_env/openenv.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+spec_version: 1
+name: my_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

my_env/pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-my_env"
+version = "0.1.0"
+description = "My Env environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.0",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m my_env.server.app
+server = "my_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["my_env", "my_env.server"]
+package-dir = { "my_env" = ".", "my_env.server" = "server" }

my_env/server/Dockerfile ADDED Viewed

	@@ -0,0 +1,80 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=my_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

my_env/server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""My Env environment server components."""
+from .my_env_environment import MyEnvironment
+__all__ = ["MyEnvironment"]

my_env/server/app.py ADDED Viewed

	@@ -0,0 +1,81 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the My Env Environment.
+This module creates an HTTP server that exposes the MyEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:  # pragma: no cover
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+# Import from local models.py (PYTHONPATH includes /app/env in Docker)
+from models import MyAction, MyObservation
+from .my_env_environment import MyEnvironment
+# Create the app with web interface and README integration
+app = create_app(
+    MyEnvironment,
+    MyAction,
+    MyObservation,
+    env_name="my_env",
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m my_env.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 8000)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn my_env.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

my_env/server/my_env_environment.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+My Env Environment Implementation.
+A simple test environment that echoes back messages sent to it.
+Perfect for testing HTTP server infrastructure.
+"""
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+from models import MyAction, MyObservation
+class MyEnvironment(Environment):
+    """
+    A simple echo environment that echoes back messages.
+    This environment is designed for testing the HTTP server infrastructure.
+    It maintains minimal state and simply echoes back whatever message it receives.
+    Example:
+        >>> env = MyEnvironment()
+        >>> obs = env.reset()
+        >>> print(obs.echoed_message)  # "My Env environment ready!"
+        >>>
+        >>> obs = env.step(MyAction(message="Hello"))
+        >>> print(obs.echoed_message)  # "Hello"
+        >>> print(obs.message_length)  # 5
+    """
+    # Enable concurrent WebSocket sessions.
+    # Set to True if your environment isolates state between instances.
+    # When True, multiple WebSocket clients can connect simultaneously, each
+    # getting their own environment instance (when using factory mode in app.py).
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        """Initialize the my_env environment."""
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._reset_count = 0
+    def reset(self) -> MyObservation:
+        """
+        Reset the environment.
+        Returns:
+            MyObservation with a ready message
+        """
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._reset_count += 1
+        return MyObservation(
+            echoed_message="My Env environment ready!",
+            message_length=0,
+            done=False,
+            reward=0.0,
+        )
+    def step(self, action: MyAction) -> MyObservation:  # type: ignore[override]
+        """
+        Execute a step in the environment by echoing the message.
+        Args:
+            action: MyAction containing the message to echo
+        Returns:
+            MyObservation with the echoed message and its length
+        """
+        self._state.step_count += 1
+        message = action.message
+        length = len(message)
+        # Simple reward: longer messages get higher rewards
+        reward = length * 0.1
+        return MyObservation(
+            echoed_message=message,
+            message_length=length,
+            done=False,
+            reward=reward,
+            metadata={"original_message": message, "step": self._state.step_count},
+        )
+    @property
+    def state(self) -> State:
+        """
+        Get the current environment state.
+        Returns:
+            Current State with episode_id and step_count
+        """
+        return self._state

my_env/server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0

pyproject.toml CHANGED Viewed

@@ -21,9 +21,6 @@ dependencies = [
 ]
 [project.optional-dependencies]
-train = [
-    "gymnasium>=0.29.0",
-]
 bio = [
     "biopython>=1.84",
     "gseapy>=1.1.3",
@@ -32,7 +29,16 @@ bio = [
 dev = [
     "pytest>=8.0.0",
     "pytest-cov>=4.0.0",
-    "gymnasium>=0.29.0",
 ]
 [project.scripts]

 ]
 [project.optional-dependencies]
 bio = [
     "biopython>=1.84",
     "gseapy>=1.1.3",
 dev = [
     "pytest>=8.0.0",
     "pytest-cov>=4.0.0",
+]
+train = [
+    "accelerate>=1.13.0",
+    "bitsandbytes>=0.45.0",
+    "datasets>=4.6.1",
+    "matplotlib>=3.10.8",
+    "peft>=0.15.0",
+    "torch>=2.10.0",
+    "transformers>=5.3.0",
+    "trl>=0.29.0",
 ]
 [project.scripts]

run_agent.py CHANGED Viewed

@@ -1,292 +1,978 @@
-"""Run the bio-experiment environment with Qwen3.5-2B as the planning agent."""
-from __future__ import annotations
-import json
-import re
-import sys
-import time
-from typing import Any, Dict, List, Optional
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
-from models import ActionType, ExperimentAction, ExperimentObservation
-from server.hackathon_environment import BioExperimentEnvironment
-MODEL_ID = "Qwen/Qwen3.5-0.8B"
-MAX_EPISODE_STEPS = 12
-PIPELINE_TASK = "image-text-to-text"
-USE_PIPELINE = True
-ACTION_TYPES = [a.value for a in ActionType]
-SYSTEM_PROMPT = """\
-You are an expert biologist planning a single-cell experiment pipeline.
-At each turn you see the experiment state and must pick the next step.
-Action types (in typical order):
-  collect_sample, prepare_library, sequence_cells, run_qc, filter_data,
-  normalize_data, cluster_cells, differential_expression,
-  pathway_enrichment, marker_selection, validate_marker, synthesize_conclusion
-Other actions: select_cohort, culture_cells, perturb_gene, perturb_compound,
-  integrate_batches, trajectory_analysis, regulatory_network_inference,
-  design_followup_experiment, request_subagent_review
-Respond with ONLY valid JSON, nothing else:
-{"action_type": "...", "method": null, "parameters": {}, "justification": "...", "confidence": 0.8}
-"""
-def format_observation(obs: ExperimentObservation) -> str:
-    parts = [
-        f"TASK: {obs.task.problem_statement}",
-        f"Organism: {obs.task.organism} | Tissue: {obs.task.tissue}",
-        f"Conditions: {', '.join(obs.task.conditions) or 'N/A'}",
-        f"Step: {obs.step_index} | Budget: ${obs.resource_usage.budget_remaining:,.0f} | Time: {obs.resource_usage.time_remaining_days:.0f}d",
-    ]
-    if obs.pipeline_history:
-        last5 = obs.pipeline_history[-5:]
-        parts.append("History:")
-        for h in last5:
-            tag = "OK" if h.success else "FAIL"
-            parts.append(f"  [{tag}] {h.action_type.value}: {h.output_summary[:80]}")
-    if obs.rule_violations:
-        parts.append(f"VIOLATIONS: {obs.rule_violations}")
-    if obs.discovered_markers:
-        parts.append(f"Markers: {obs.discovered_markers[:5]}")
-    return "\n".join(parts)
-def parse_action(text: str) -> Optional[ExperimentAction]:
-    match = re.search(r"\{[^{}]*\}", text, re.DOTALL)
-    if not match:
-        return None
-    try:
-        d = json.loads(match.group())
-    except json.JSONDecodeError:
-        return None
-    action_type = d.get("action_type")
-    if action_type not in ACTION_TYPES:
-        return None
-    return ExperimentAction(
-        action_type=ActionType(action_type),
-        method=d.get("method"),
-        parameters=d.get("parameters") or {},
-        justification=d.get("justification"),
-        confidence=min(1.0, max(0.0, float(d.get("confidence", 0.5)))),
-    )
-FALLBACK_SEQUENCE = [
-    ActionType.COLLECT_SAMPLE,
-    ActionType.PREPARE_LIBRARY,
-    ActionType.SEQUENCE_CELLS,
-    ActionType.RUN_QC,
-    ActionType.FILTER_DATA,
-    ActionType.NORMALIZE_DATA,
-    ActionType.CLUSTER_CELLS,
-    ActionType.DIFFERENTIAL_EXPRESSION,
-    ActionType.PATHWAY_ENRICHMENT,
-    ActionType.MARKER_SELECTION,
-    ActionType.SYNTHESIZE_CONCLUSION,
-]
-def fallback_action(step: int) -> ExperimentAction:
-    idx = min(step, len(FALLBACK_SEQUENCE) - 1)
-    return ExperimentAction(
-        action_type=FALLBACK_SEQUENCE[idx],
-        justification="fallback",
-        confidence=0.3,
-    )
-def log(msg: str) -> None:
-    print(msg, flush=True)
-def build_observation_prompt(obs: ExperimentObservation) -> str:
-    return format_observation(obs)
-def run_with_pipeline(pipe, prompt: str) -> str:
-    attempts = [
-        {"text": prompt},
-        {"text": prompt, "image": None},
-        {"image": prompt},
-    ]
-    for payload in attempts:
-        try:
-            result = pipe(payload, max_new_tokens=220)
-            if isinstance(result, list) and result:
-                result = result[0]
-            if isinstance(result, dict):
-                text = result.get("generated_text") or result.get("text") or result.get("answer")
-            elif isinstance(result, str):
-                text = result
-            else:
-                text = ""
-            if isinstance(text, str) and text.strip():
-                return text.strip()
-        except Exception:
-            continue
-    return ""
-def main():
-    tokenizer = None
-    model = None
-    eos_ids: List[int] = []
-    active_pipeline = None
-    if USE_PIPELINE:
-        log(f"Loading pipeline ({PIPELINE_TASK}) for {MODEL_ID} ...")
-        try:
-            active_pipeline = pipeline(
-                PIPELINE_TASK,
-                model=MODEL_ID,
-                trust_remote_code=True,
-                torch_dtype=torch.bfloat16,
-            )
-            log("Pipeline loaded.")
-        except Exception as exc:
-            log(f"Pipeline load failed ({exc}), falling back to tokenizer+model.")
-    if active_pipeline is None:
-        log(f"Loading tokenizer for {MODEL_ID} ...")
-        tokenizer = AutoTokenizer.from_pretrained(
-            MODEL_ID, trust_remote_code=True,
-        )
-        log("Tokenizer loaded. Loading model (this downloads ~4 GB on first run) ...")
-        model = AutoModelForCausalLM.from_pretrained(
-            MODEL_ID,
-            torch_dtype=torch.bfloat16,
-            device_map="auto",
-            trust_remote_code=True,
-        )
-        log(f"Model loaded. Device: {model.device}")
-        if tokenizer.eos_token_id is not None:
-            eos_ids.append(tokenizer.eos_token_id)
-        extra = tokenizer.convert_tokens_to_ids(["<|im_end|>", "<|endoftext|>"])
-        for tid in extra:
-            if isinstance(tid, int) and tid not in eos_ids:
-                eos_ids.append(tid)
-        log(f"EOS token ids: {eos_ids}")
-    env = BioExperimentEnvironment()
-    obs = env.reset()
-    log("\n" + "=" * 70)
-    log(f"TASK: {obs.task.problem_statement}")
-    log(f"Conditions: {obs.task.conditions}")
-    log(f"Budget: ${obs.task.budget_limit:,.0f} | Time: {obs.task.time_limit_days:.0f} days")
-    log("=" * 70)
-    cumulative_reward = 0.0
-    for step in range(MAX_EPISODE_STEPS):
-        user_msg = build_observation_prompt(obs)
-        messages = [
-            {"role": "system", "content": SYSTEM_PROMPT},
-            {"role": "user", "content": user_msg},
-        ]
-        if tokenizer is None:
-            # Pipeline path usually ignores chat templates.
-            prompt = f"{SYSTEM_PROMPT}\n\n{user_msg}"
-        else:
-            try:
-                prompt = tokenizer.apply_chat_template(
-                    messages,
-                    tokenize=False,
-                    add_generation_prompt=True,
-                    enable_thinking=False,
-                )
-            except TypeError:
-                prompt = tokenizer.apply_chat_template(
-                    messages,
-                    tokenize=False,
-                    add_generation_prompt=True,
-                )
-        t0 = time.time()
-        if active_pipeline is not None:
-            response = run_with_pipeline(active_pipeline, prompt)
-            if not response:
-                response = format_observation(obs)
-        else:
-            assert tokenizer is not None and model is not None
-            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-            n_input = inputs["input_ids"].shape[1]
-            with torch.no_grad():
-                output_ids = model.generate(
-                    **inputs,
-                    max_new_tokens=200,
-                    do_sample=True,
-                    temperature=0.7,
-                    top_p=0.8,
-                    top_k=20,
-                    repetition_penalty=1.3,
-                    eos_token_id=eos_ids if eos_ids else None,
-                )
-            new_tokens = output_ids[0][n_input:]
-            response = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
-        gen_time = time.time() - t0
-        action = parse_action(response)
-        used_fallback = False
-        if action is None:
-            log(f"\n  [!] Parse failed, using fallback. Raw: {response[:150]}")
-            action = fallback_action(step)
-            used_fallback = True
-        tag = " [FALLBACK]" if used_fallback else ""
-        log(f"\nStep {step + 1}: {action.action_type.value}{tag}  ({gen_time:.1f}s)")
-        if action.justification:
-            log(f"  Rationale: {action.justification}")
-        obs = env.step(action)
-        if obs.latest_output:
-            lo = obs.latest_output
-            status = "OK" if lo.success else "FAIL"
-            log(f"  [{status}] {lo.summary}")
-            if lo.warnings:
-                log(f"  Warnings: {lo.warnings}")
-        step_reward = obs.reward
-        cumulative_reward += step_reward
-        log(f"  Reward: {step_reward:+.3f}  (cum: {cumulative_reward:+.3f})")
-        log(f"  Budget: ${obs.resource_usage.budget_remaining:,.0f} | Time: {obs.resource_usage.time_remaining_days:.0f}d")
-        if obs.rule_violations:
-            log(f"  Violations: {obs.rule_violations}")
-        if obs.done:
-            break
-    log(f"\n{'=' * 70}")
-    log("EPISODE COMPLETE" if obs.done else f"MAX STEPS ({MAX_EPISODE_STEPS})")
-    log(f"  Steps: {obs.step_index}")
-    log(f"  Total reward: {cumulative_reward:+.3f}")
-    log(f"  Budget used: ${obs.resource_usage.budget_used:,.0f}")
-    log(f"  Time used: {obs.resource_usage.time_used_days:.0f} days")
-    if obs.conclusions:
-        log("  Conclusions:")
-        for c in obs.conclusions:
-            log(f"    [{c.claim_type}, conf={c.confidence:.2f}] {c.claim}")
-    log("=" * 70)
-if __name__ == "__main__":
-    main()

+"""Run the bio-experiment environment with Qwen3.5-0.8B as the planning agent."""
+from __future__ import annotations
+import json
+import os
+import re
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+from models import (
+    ActionType,
+    ExperimentAction,
+    ExperimentObservation,
+    build_agent_observation_context,
+    build_agent_system_prompt,
+)
+from server.hackathon_environment import BioExperimentEnvironment
+DASHBOARD_STATE_PATH = Path(__file__).parent / "_dashboard_state.json"
+DASHBOARD_CMD_PATH = Path(__file__).parent / "_dashboard_cmd.json"
+USE_PIPELINE = os.getenv("RUN_AGENT_USE_PIPELINE", "0").strip().lower() not in {"0", "false", "off"}
+def _parse_thinking_flag() -> bool:
+    import sys
+    if "--no-thinking" in sys.argv:
+        return False
+    if "--thinking" in sys.argv:
+        return True
+    return os.getenv("RUN_AGENT_ENABLE_THINKING", "1").strip().lower() not in {"0", "false", "off"}
+ENABLE_THINKING = _parse_thinking_flag()
+MODEL_ID = "Qwen/Qwen3.5-2B"
+MAX_EPISODE_STEPS = int(os.getenv("RUN_AGENT_MAX_EPISODE_STEPS", "20"))
+PIPELINE_TASK = "text-generation"
+ACTION_TYPES = [a.value for a in ActionType]
+ACTION_TYPE_ALIASES = {
+    "collect_samples": ActionType.COLLECT_SAMPLE.value,
+    "collect_sample_from_bone_marrow": ActionType.COLLECT_SAMPLE.value,
+    "collect_samples_from_bone_marrow": ActionType.COLLECT_SAMPLE.value,
+    "prepare_sc_library": ActionType.PREPARE_LIBRARY.value,
+    "sequence_single_cells": ActionType.SEQUENCE_CELLS.value,
+    "qc": ActionType.RUN_QC.value,
+    "run_quality_control": ActionType.RUN_QC.value,
+    "cluster": ActionType.CLUSTER_CELLS.value,
+    "de_analysis": ActionType.DIFFERENTIAL_EXPRESSION.value,
+    "differential_expression_analysis": ActionType.DIFFERENTIAL_EXPRESSION.value,
+    "trajectory_inference": ActionType.TRAJECTORY_ANALYSIS.value,
+    "infer_trajectory": ActionType.TRAJECTORY_ANALYSIS.value,
+    "network_inference": ActionType.REGULATORY_NETWORK_INFERENCE.value,
+    "select_markers": ActionType.MARKER_SELECTION.value,
+    "final_conclusion": ActionType.SYNTHESIZE_CONCLUSION.value,
+}
+SYSTEM_PROMPT = build_agent_system_prompt()
+STANDARD_PIPELINE_ORDER = [
+    ActionType.COLLECT_SAMPLE,
+    ActionType.SELECT_COHORT,
+    ActionType.PREPARE_LIBRARY,
+    ActionType.SEQUENCE_CELLS,
+    ActionType.RUN_QC,
+    ActionType.FILTER_DATA,
+    ActionType.NORMALIZE_DATA,
+    ActionType.INTEGRATE_BATCHES,
+    ActionType.CLUSTER_CELLS,
+    ActionType.DIFFERENTIAL_EXPRESSION,
+    ActionType.PATHWAY_ENRICHMENT,
+    ActionType.MARKER_SELECTION,
+    ActionType.TRAJECTORY_ANALYSIS,
+    ActionType.REGULATORY_NETWORK_INFERENCE,
+    ActionType.SYNTHESIZE_CONCLUSION,
+]
+MODEL_RESPONSE_PREVIEW_CHARS = int(
+    os.getenv("RUN_AGENT_MODEL_RESPONSE_PREVIEW_CHARS", "240")
+)
+def compact_preview(value: Any, max_chars: int = 160) -> str:
+    try:
+        text = json.dumps(value, ensure_ascii=True, sort_keys=True)
+    except TypeError:
+        text = str(value)
+    text = re.sub(r"\s+", " ", text).strip()
+    if len(text) <= max_chars:
+        return text
+    return text[: max_chars - 3] + "..."
+def format_observation(obs: ExperimentObservation) -> str:
+    parts = [
+        f"TASK: {obs.task.problem_statement}",
+        f"Organism: {obs.task.organism} | Tissue: {obs.task.tissue}",
+        f"Conditions: {', '.join(obs.task.conditions) or 'N/A'}",
+        f"Step: {obs.step_index} | Budget: ${obs.resource_usage.budget_remaining:,.0f} | Time: {obs.resource_usage.time_remaining_days:.0f}d",
+    ]
+    context = build_agent_observation_context(obs, max_tools=5, max_assays=2)
+    if context:
+        parts.append(context)
+    if obs.pipeline_history:
+        last5 = obs.pipeline_history[-5:]
+        parts.append("Recent history:")
+        for h in last5:
+            tag = "OK" if h.success else "FAIL"
+            line = f"  [{tag}] {h.action_type.value}"
+            if h.method:
+                line += f" ({h.method})"
+            line += f": {h.output_summary[:80]}"
+            parts.append(line)
+        completed = {h.action_type for h in obs.pipeline_history if h.success}
+        if completed:
+            parts.append(f"Completed steps (do NOT repeat): {', '.join(sorted(a.value for a in completed))}")
+        remaining = [a.value for a in STANDARD_PIPELINE_ORDER if a not in completed]
+        if remaining:
+            parts.append(f"Remaining steps (choose one): {', '.join(remaining)}")
+    if obs.latest_output and obs.latest_output.data:
+        parts.append(
+            f"Latest data: {compact_preview(obs.latest_output.data, 200)}"
+        )
+    if obs.rule_violations:
+        parts.append(f"VIOLATIONS: {obs.rule_violations}")
+    if obs.discovered_markers:
+        parts.append(f"Markers found so far: {obs.discovered_markers[:5]}")
+    parts.append(
+        'Output ONLY a single JSON object with these exact keys, no comments, no extra text:\n'
+        '{"action_type": "<one of the remaining steps>", "method": null, "parameters": {}, "justification": "<why>", "confidence": 0.8}'
+    )
+    return "\n".join(parts)
+def _repair_truncated_json(text: str) -> Optional[str]:
+    """Try to repair JSON truncated mid-value (common with small LLMs)."""
+    s = text.strip()
+    if not s.startswith("{"):
+        return None
+    # Drop dangling partial keys or empty key/value stubs at the tail.
+    s = re.sub(r',\s*"[^"\n]*$', '', s)
+    s = re.sub(r',\s*"[^"\n]*"\s*:\s*$', '', s)
+    in_string = False
+    escape = False
+    for ch in s:
+        if escape:
+            escape = False
+            continue
+        if ch == "\\":
+            escape = True
+            continue
+        if ch == '"':
+            in_string = not in_string
+    if in_string:
+        s += '"'
+    open_braces = s.count("{") - s.count("}")
+    open_brackets = s.count("[") - s.count("]")
+    s += "]" * max(0, open_brackets)
+    s += "}" * max(0, open_braces)
+    try:
+        obj = json.loads(s)
+        if isinstance(obj, dict):
+            return s
+    except json.JSONDecodeError:
+        pass
+    s = re.sub(r',\s*([}\]])', r'\1', s)
+    try:
+        obj = json.loads(s)
+        if isinstance(obj, dict):
+            return s
+    except json.JSONDecodeError:
+        pass
+    return None
+def _normalize_jsonish_text(text: str) -> str:
+    """Normalize common near-JSON artifacts emitted by small local models."""
+    text = _strip_js_comments(text)
+    text = re.sub(r'(?<=:\s)\bNone\b', 'null', text)
+    text = re.sub(r'(?<=:\s)\bTrue\b', 'true', text)
+    text = re.sub(r'(?<=:\s)\bFalse\b', 'false', text)
+    text = re.sub(r'"([^"\n]+?):"\s*,', r'"\1": "",', text)
+    return text
+def _strip_js_comments(text: str) -> str:
+    """Remove // and /* */ comments that small LLMs inject into JSON."""
+    text = re.sub(r'//[^\n]*', '', text)
+    text = re.sub(r'/\*.*?\*/', '', text, flags=re.DOTALL)
+    return text
+def extract_json_object(text: str) -> Optional[Dict[str, Any]]:
+    stripped = _normalize_jsonish_text(text).strip()
+    fence_prefix = "```"
+    if stripped.startswith(fence_prefix) and stripped.endswith(fence_prefix):
+        lines = stripped.splitlines()
+        if len(lines) >= 3:
+            stripped = "\n".join(lines[1:-1]).strip()
+    candidates: List[str] = [stripped]
+    start = stripped.find("{")
+    while start != -1:
+        depth = 0
+        for idx in range(start, len(stripped)):
+            char = stripped[idx]
+            if char == "{":
+                depth += 1
+            elif char == "}":
+                depth -= 1
+                if depth == 0:
+                    candidates.append(stripped[start:idx + 1])
+                    break
+        start = stripped.find("{", start + 1)
+    first_brace = stripped.find("{")
+    if first_brace != -1:
+        repaired = _repair_truncated_json(stripped[first_brace:])
+        if repaired is not None:
+            candidates.append(repaired)
+    candidates.sort(key=len, reverse=True)
+    for candidate in candidates:
+        try:
+            parsed = json.loads(candidate)
+        except json.JSONDecodeError:
+            continue
+        if isinstance(parsed, dict):
+            return parsed
+    return None
+def _edit_distance(a: str, b: str) -> int:
+    if len(a) < len(b):
+        return _edit_distance(b, a)
+    if not b:
+        return len(a)
+    prev = list(range(len(b) + 1))
+    for i, ca in enumerate(a):
+        curr = [i + 1]
+        for j, cb in enumerate(b):
+            curr.append(min(prev[j + 1] + 1, curr[j] + 1, prev[j] + (ca != cb)))
+        prev = curr
+    return prev[-1]
+def get_payload_value(payload: Dict[str, Any], *names: str) -> Any:
+    for name in names:
+        if name in payload:
+            return payload[name]
+    lowered = {
+        str(key).lower(): value
+        for key, value in payload.items()
+    }
+    for name in names:
+        if name.lower() in lowered:
+            return lowered[name.lower()]
+    for key, value in lowered.items():
+        for name in names:
+            threshold = max(2, len(name) // 3)
+            if _edit_distance(key, name.lower()) <= threshold:
+                return value
+    return None
+def normalize_optional_string(value: Any) -> Optional[str]:
+    if value is None or isinstance(value, bool):
+        return None
+    if isinstance(value, str):
+        value = value.strip()
+        return value or None
+    if isinstance(value, (int, float)):
+        return str(value)
+    return compact_preview(value, 80)
+def normalize_action_type(raw_action_type: Any) -> Optional[str]:
+    if not isinstance(raw_action_type, str):
+        return None
+    candidate = raw_action_type.strip().lower()
+    if candidate in ACTION_TYPES:
+        return candidate
+    if candidate in ACTION_TYPE_ALIASES:
+        return ACTION_TYPE_ALIASES[candidate]
+    candidate = re.sub(r"[^a-z0-9]+", "_", candidate).strip("_")
+    if candidate in ACTION_TYPES:
+        return candidate
+    if candidate in ACTION_TYPE_ALIASES:
+        return ACTION_TYPE_ALIASES[candidate]
+    heuristics = [
+        (("collect", "sample"), ActionType.COLLECT_SAMPLE.value),
+        (("library",), ActionType.PREPARE_LIBRARY.value),
+        (("sequence",), ActionType.SEQUENCE_CELLS.value),
+        (("qc",), ActionType.RUN_QC.value),
+        (("quality", "control"), ActionType.RUN_QC.value),
+        (("filter",), ActionType.FILTER_DATA.value),
+        (("normal",), ActionType.NORMALIZE_DATA.value),
+        (("integrat", "batch"), ActionType.INTEGRATE_BATCHES.value),
+        (("cluster",), ActionType.CLUSTER_CELLS.value),
+        (("differential", "expression"), ActionType.DIFFERENTIAL_EXPRESSION.value),
+        (("pathway",), ActionType.PATHWAY_ENRICHMENT.value),
+        (("trajectory",), ActionType.TRAJECTORY_ANALYSIS.value),
+        (("network",), ActionType.REGULATORY_NETWORK_INFERENCE.value),
+        (("marker",), ActionType.MARKER_SELECTION.value),
+        (("validat", "marker"), ActionType.VALIDATE_MARKER.value),
+        (("followup",), ActionType.DESIGN_FOLLOWUP.value),
+        (("review",), ActionType.REQUEST_SUBAGENT_REVIEW.value),
+        (("conclusion",), ActionType.SYNTHESIZE_CONCLUSION.value),
+    ]
+    for fragments, normalized in heuristics:
+        if all(fragment in candidate for fragment in fragments):
+            return normalized
+    return None
+def should_block_failed_reattempt(
+    history: List[Any], action_type: ActionType
+) -> bool:
+    last_failed_idx = None
+    last_success_idx = None
+    for idx, record in enumerate(history):
+        if record.action_type != action_type:
+            continue
+        if record.success:
+            last_success_idx = idx
+        else:
+            last_failed_idx = idx
+    if last_failed_idx is None:
+        return False
+    # Allow retry after the same action has already succeeded once, or after the
+    # pipeline made progress with a different successful step since the failure.
+    if last_success_idx is not None and last_success_idx > last_failed_idx:
+        return False
+    for record in history[last_failed_idx + 1:]:
+        if record.success and record.action_type != action_type:
+            return False
+    return True
+def parse_action(text: str) -> Optional[ExperimentAction]:
+    d = extract_json_object(text)
+    if d is not None:
+        action_type = normalize_action_type(get_payload_value(d, "action_type"))
+        if action_type is None:
+            return None
+        parameters = get_payload_value(d, "parameters", "params") or {}
+        if not isinstance(parameters, dict):
+            parameters = {}
+        confidence = get_payload_value(d, "confidence")
+        if confidence is None:
+            confidence = 0.5
+        try:
+            confidence = float(confidence)
+        except (TypeError, ValueError):
+            confidence = 0.5
+        justification = get_payload_value(
+            d, "justification", "reasoning", "rationale", "reason"
+        )
+        if justification is not None and not isinstance(justification, str):
+            justification = compact_preview(justification, 200)
+        method = normalize_optional_string(get_payload_value(d, "method"))
+        return ExperimentAction(
+            action_type=ActionType(action_type),
+            method=method,
+            parameters=parameters,
+            justification=justification,
+            confidence=min(1.0, max(0.0, confidence)),
+        )
+    action_match = re.search(
+        r'["\']action_type["\']\s*:\s*["\']([^"\']+)',
+        text,
+        re.IGNORECASE,
+    )
+    if not action_match:
+        return None
+    action_type = normalize_action_type(action_match.group(1))
+    if action_type is None:
+        return None
+    method_match = re.search(
+        r'["\']method["\']\s*:\s*("((?:[^"\\]|\\.)*)"|null|none|true|false|-?\d+(?:\.\d+)?)',
+        text,
+        re.IGNORECASE,
+    )
+    confidence_match = re.search(
+        r'["\']confidence["\']\s*:\s*([0-9]*\.?[0-9]+)',
+        text,
+        re.IGNORECASE,
+    )
+    justification_match = re.search(
+        r'["\'](?:justif\w*|reasoning|rationale|reason)["\']\s*:\s*"((?:[^"\\]|\\.)*)',
+        text,
+        re.DOTALL | re.IGNORECASE,
+    )
+    confidence = 0.5
+    if confidence_match:
+        try:
+            confidence = float(confidence_match.group(1))
+        except ValueError:
+            confidence = 0.5
+    justification = None
+    if justification_match:
+        try:
+            justification = json.loads(f'"{justification_match.group(1)}"')
+        except json.JSONDecodeError:
+            justification = justification_match.group(1)
+    method = None
+    if method_match:
+        raw_method = method_match.group(1)
+        if raw_method.startswith('"') and raw_method.endswith('"'):
+            try:
+                method = json.loads(raw_method)
+            except json.JSONDecodeError:
+                method = raw_method.strip('"')
+        elif raw_method.lower() not in {"null", "none", "true", "false"}:
+            method = raw_method
+    method = normalize_optional_string(method)
+    return ExperimentAction(
+        action_type=ActionType(action_type),
+        method=method,
+        parameters={},
+        justification=justification,
+        confidence=min(1.0, max(0.0, confidence)),
+    )
+def should_force_terminal_conclusion(
+    action: ExperimentAction,
+    completed_types: set[ActionType],
+) -> bool:
+    meta_repeatables = {
+        ActionType.DESIGN_FOLLOWUP,
+        ActionType.REQUEST_SUBAGENT_REVIEW,
+    }
+    return (
+        action.action_type in meta_repeatables
+        and action.action_type in completed_types
+        and ActionType.SYNTHESIZE_CONCLUSION not in completed_types
+    )
+def write_dashboard_state(
+    env: BioExperimentEnvironment,
+    obs: ExperimentObservation,
+    *,
+    step: int,
+    cumulative_reward: float,
+    model_response: str = "",
+    model_thinking: str = "",
+    action: Optional[ExperimentAction] = None,
+    gen_time: float = 0.0,
+    episode_done: bool = False,
+) -> None:
+    """Serialise the full world state (observable + latent) for the dashboard."""
+    latent = env._latent
+    snapshot: Dict[str, Any] = {
+        "timestamp": time.time(),
+        "step": step,
+        "episode_done": episode_done,
+        "cumulative_reward": cumulative_reward,
+        "gen_time_s": round(gen_time, 2),
+        "model_response_raw": model_response[:600],
+        "model_thinking": model_thinking[:800],
+        "thinking_enabled": ENABLE_THINKING,
+    }
+    snapshot["task"] = {
+        "problem_statement": obs.task.problem_statement,
+        "organism": obs.task.organism,
+        "tissue": obs.task.tissue,
+        "modality": obs.task.modality,
+        "conditions": obs.task.conditions,
+        "budget_limit": obs.task.budget_limit,
+        "time_limit_days": obs.task.time_limit_days,
+    }
+    snapshot["resources"] = {
+        "budget_used": round(obs.resource_usage.budget_used, 2),
+        "budget_remaining": round(obs.resource_usage.budget_remaining, 2),
+        "time_used_days": round(obs.resource_usage.time_used_days, 1),
+        "time_remaining_days": round(obs.resource_usage.time_remaining_days, 1),
+        "samples_consumed": obs.resource_usage.samples_consumed,
+        "compute_hours_used": round(obs.resource_usage.compute_hours_used, 2),
+    }
+    snapshot["pipeline_history"] = [
+        {
+            "step_index": h.step_index,
+            "action_type": h.action_type.value,
+            "method": h.method,
+            "output_summary": h.output_summary[:120],
+            "success": h.success,
+            "quality_score": round(h.quality_score, 3),
+            "resource_cost": round(h.resource_cost, 2),
+            "time_cost_days": round(h.time_cost_days, 1),
+        }
+        for h in obs.pipeline_history
+    ]
+    if action:
+        snapshot["current_action"] = {
+            "action_type": action.action_type.value,
+            "method": action.method,
+            "parameters": action.parameters,
+            "justification": action.justification,
+            "confidence": action.confidence,
+        }
+    if obs.latest_output:
+        lo = obs.latest_output
+        snapshot["latest_output"] = {
+            "summary": lo.summary,
+            "success": lo.success,
+            "quality_score": round(lo.quality_score, 3),
+            "uncertainty": round(lo.uncertainty, 3),
+            "warnings": lo.warnings,
+            "data_preview": compact_preview(lo.data, 300) if lo.data else None,
+        }
+    snapshot["discovered_markers"] = obs.discovered_markers[:20]
+    snapshot["candidate_mechanisms"] = obs.candidate_mechanisms[:20]
+    snapshot["rule_violations"] = obs.rule_violations
+    snapshot["uncertainty_summary"] = {
+        k: round(v, 3) for k, v in obs.uncertainty_summary.items()
+    }
+    snapshot["reward_breakdown"] = {
+        k: round(v, 4) for k, v in obs.step_reward_breakdown.items()
+    }
+    if obs.conclusions:
+        snapshot["conclusions"] = [
+            {
+                "claim": c.claim,
+                "claim_type": c.claim_type,
+                "confidence": c.confidence,
+                "top_markers": c.top_markers,
+                "causal_mechanisms": c.causal_mechanisms,
+                "predicted_pathways": c.predicted_pathways,
+            }
+            for c in obs.conclusions
+        ]
+    if latent:
+        bio = latent.biology
+        snapshot["latent"] = {
+            "cell_populations": [
+                {
+                    "name": cp.name,
+                    "proportion": round(cp.proportion, 3),
+                    "marker_genes": cp.marker_genes[:8],
+                    "state": cp.state,
+                }
+                for cp in bio.cell_populations
+            ],
+            "true_markers": bio.true_markers,
+            "causal_mechanisms": bio.causal_mechanisms,
+            "true_pathways": {
+                k: round(v, 3) for k, v in list(bio.true_pathways.items())[:15]
+            },
+            "true_de_genes_count": sum(
+                len(genes) for genes in bio.true_de_genes.values()
+            ),
+            "true_regulatory_network_size": sum(
+                len(targets) for targets in bio.true_regulatory_network.values()
+            ),
+            "confounders": bio.confounders,
+            "n_true_cells": bio.n_true_cells,
+            "technical": {
+                "ambient_rna_fraction": latent.technical.ambient_rna_fraction,
+                "doublet_rate": latent.technical.doublet_rate,
+                "dropout_rate": latent.technical.dropout_rate,
+                "sample_quality": latent.technical.sample_quality,
+                "library_complexity": latent.technical.library_complexity,
+                "capture_efficiency": latent.technical.capture_efficiency,
+            },
+            "progress": latent.progress.model_dump(),
+            "hidden_failure_conditions": latent.hidden_failure_conditions,
+        }
+    try:
+        DASHBOARD_STATE_PATH.write_text(
+            json.dumps(snapshot, indent=2, default=str), encoding="utf-8"
+        )
+    except Exception:
+        pass
+def log(msg: str) -> None:
+    print(msg, flush=True)
+def build_observation_prompt(obs: ExperimentObservation) -> str:
+    return format_observation(obs)
+def run_with_pipeline(pipe, prompt: str) -> str:
+    try:
+        _pipe_max = 2048 if ENABLE_THINKING else 300
+        result = pipe(prompt, max_new_tokens=_pipe_max, return_full_text=False)
+    except Exception:
+        return ""
+    if isinstance(result, list) and result:
+        result = result[0]
+    if isinstance(result, dict):
+        text = result.get("generated_text") or result.get("text") or result.get("answer")
+    elif isinstance(result, str):
+        text = result
+    else:
+        text = ""
+    return text.strip() if isinstance(text, str) else ""
+def resolve_torch_runtime() -> Dict[str, Any]:
+    use_cuda = torch.cuda.is_available()
+    bf16 = bool(getattr(torch.cuda, "is_bf16_supported", lambda: False)()) if use_cuda else False
+    dtype = torch.bfloat16 if bf16 else (
+        torch.float16 if use_cuda else torch.float32
+    )
+    return {
+        "use_cuda": use_cuda,
+        "device": "cuda:0" if use_cuda else "cpu",
+        "dtype": dtype,
+        "device_map": "auto" if use_cuda else None,
+        "device_name": torch.cuda.get_device_name(0) if use_cuda else "cpu",
+    }
+def main():
+    tokenizer = None
+    model = None
+    eos_ids: List[int] = []
+    active_pipeline = None
+    runtime = resolve_torch_runtime()
+    log(
+        f"Using local model runtime: device={runtime['device']} "
+        f"name={runtime['device_name']} dtype={runtime['dtype']}"
+    )
+    if USE_PIPELINE:
+        log(f"Loading pipeline ({PIPELINE_TASK}) for {MODEL_ID} ...")
+        try:
+            active_pipeline = pipeline(
+                PIPELINE_TASK,
+                model=MODEL_ID,
+                trust_remote_code=True,
+                dtype=runtime["dtype"],
+                device=0 if runtime["use_cuda"] else -1,
+            )
+            log("Pipeline loaded.")
+        except Exception as exc:
+            log(f"Pipeline load failed ({exc}), falling back to tokenizer+model.")
+    if active_pipeline is None:
+        log(f"Loading tokenizer for {MODEL_ID} ...")
+        tokenizer = AutoTokenizer.from_pretrained(
+            MODEL_ID, trust_remote_code=True,
+        )
+        log("Tokenizer loaded. Loading model (this may download files on first run) ...")
+        model = AutoModelForCausalLM.from_pretrained(
+            MODEL_ID,
+            dtype=runtime["dtype"],
+            device_map=runtime["device_map"],
+            trust_remote_code=True,
+        )
+        log(f"Model loaded. Device: {model.device}")
+        if tokenizer.eos_token_id is not None:
+            eos_ids.append(tokenizer.eos_token_id)
+        extra = tokenizer.convert_tokens_to_ids(["<|im_end|>", "<|endoftext|>"])
+        for tid in extra:
+            if isinstance(tid, int) and tid not in eos_ids:
+                eos_ids.append(tid)
+        log(f"EOS token ids: {eos_ids}")
+    def check_dashboard_command() -> Optional[Dict[str, Any]]:
+        """Read and consume a command file written by the dashboard."""
+        try:
+            raw = DASHBOARD_CMD_PATH.read_text(encoding="utf-8")
+            DASHBOARD_CMD_PATH.unlink(missing_ok=True)
+            return json.loads(raw)
+        except (FileNotFoundError, json.JSONDecodeError):
+            return None
+    def run_episode(
+        scenario_name: Optional[str] = None,
+        custom_ground_truth: Optional[Dict[str, Any]] = None,
+    ):
+        env = BioExperimentEnvironment(scenario_name=scenario_name)
+        obs = env.reset()
+        if custom_ground_truth and env._latent:
+            gt = custom_ground_truth
+            bio = env._latent.biology
+            if gt.get("true_markers"):
+                bio.true_markers = gt["true_markers"]
+            if gt.get("causal_mechanisms"):
+                bio.causal_mechanisms = gt["causal_mechanisms"]
+            if gt.get("true_pathways"):
+                bio.true_pathways = {
+                    k: float(v) for k, v in gt["true_pathways"].items()
+                }
+        log("\n" + "=" * 70)
+        log(f"TASK: {obs.task.problem_statement}")
+        log(f"Conditions: {obs.task.conditions}")
+        log(f"Budget: ${obs.task.budget_limit:,.0f} | Time: {obs.task.time_limit_days:.0f} days")
+        if ENABLE_THINKING:
+            log("Reasoning mode: ENABLED")
+        log("=" * 70)
+        cumulative_reward = 0.0
+        write_dashboard_state(env, obs, step=0, cumulative_reward=0.0)
+        for step in range(MAX_EPISODE_STEPS):
+            cmd = check_dashboard_command()
+            if cmd and cmd.get("action") == "restart":
+                log("\n[DASHBOARD] Restart requested — ending episode early.")
+                break
+            user_msg = build_observation_prompt(obs)
+            messages = [
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_msg},
+            ]
+            if active_pipeline is not None:
+                prompt = f"{SYSTEM_PROMPT}\n\n{user_msg}"
+            else:
+                try:
+                    prompt = tokenizer.apply_chat_template(
+                        messages,
+                        tokenize=False,
+                        add_generation_prompt=True,
+                        enable_thinking=ENABLE_THINKING,
+                    )
+                except TypeError:
+                    prompt = tokenizer.apply_chat_template(
+                        messages,
+                        tokenize=False,
+                        add_generation_prompt=True,
+                    )
+            t0 = time.time()
+            if active_pipeline is not None:
+                response = run_with_pipeline(active_pipeline, prompt)
+                if not response:
+                    response = format_observation(obs)
+            else:
+                assert tokenizer is not None and model is not None
+                inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+                n_input = inputs["input_ids"].shape[1]
+                max_new = 2048 if ENABLE_THINKING else 300
+                with torch.no_grad():
+                    output_ids = model.generate(
+                        **inputs,
+                        max_new_tokens=max_new,
+                        do_sample=True,
+                        temperature=0.7,
+                        top_p=0.8,
+                        top_k=20,
+                        repetition_penalty=1.3,
+                        eos_token_id=eos_ids if eos_ids else None,
+                    )
+                new_tokens = output_ids[0][n_input:]
+                response = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
+            gen_time = time.time() - t0
+            thinking = ""
+            if ENABLE_THINKING:
+                think_match = re.search(
+                    r"<think>(.*?)</think>", response, re.DOTALL
+                )
+                if think_match:
+                    thinking = think_match.group(1).strip()
+                    response = response[think_match.end():].strip()
+                elif response.startswith("<think>"):
+                    parts = response.split("</think>", 1)
+                    if len(parts) == 2:
+                        thinking = parts[0].replace("<think>", "").strip()
+                        response = parts[1].strip()
+            is_last_step = (step == MAX_EPISODE_STEPS - 1)
+            action = parse_action(response)
+            if action is None:
+                if is_last_step:
+                    log(f"\n  [!] Parse failed on final step — forcing synthesize_conclusion.")
+                    action = ExperimentAction(
+                        action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                        justification="forced terminal conclusion",
+                        confidence=0.5,
+                    )
+                else:
+                    log(f"\n  [!] Parse failed, skipping step. Raw: {response[:150]}")
+                    continue
+            completed_types = {
+                r.action_type for r in obs.pipeline_history if r.success
+            }
+            failed_types = {
+                r.action_type
+                for r in obs.pipeline_history
+                if not r.success
+            }
+            if should_force_terminal_conclusion(action, completed_types):
+                log(
+                    f"\n  [!] repeated completed meta step {action.action_type.value} "
+                    f"— forcing synthesize_conclusion."
+                )
+                action = ExperimentAction(
+                    action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                    justification="repeated completed meta step forced terminal conclusion",
+                    confidence=action.confidence,
+                )
+                completed_types = {
+                    r.action_type for r in obs.pipeline_history if r.success
+                }
+            skip_reason = None
+            if action.action_type in completed_types:
+                skip_reason = (
+                    f"blocked repeat of completed step {action.action_type.value}"
+                )
+            elif action.action_type in failed_types:
+                if should_block_failed_reattempt(
+                    obs.pipeline_history, action.action_type
+                ):
+                    skip_reason = (
+                        f"blocked re-attempt of failed step {action.action_type.value}"
+                    )
+            if skip_reason:
+                if is_last_step:
+                    log(f"\n  [!] {skip_reason} on final step — forcing synthesize_conclusion.")
+                    action = ExperimentAction(
+                        action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                        justification="forced terminal conclusion",
+                        confidence=0.5,
+                    )
+                else:
+                    log(f"\n  [!] {skip_reason}, skipping step.")
+                    continue
+            if is_last_step and action.action_type != ActionType.SYNTHESIZE_CONCLUSION:
+                log(f"\n  [!] Final step — overriding {action.action_type.value} with synthesize_conclusion.")
+                action = ExperimentAction(
+                    action_type=ActionType.SYNTHESIZE_CONCLUSION,
+                    justification="forced terminal conclusion",
+                    confidence=action.confidence,
+                )
+            log(f"\nStep {step + 1}: {action.action_type.value}  ({gen_time:.1f}s)")
+            if thinking:
+                log(f"  Thinking: {thinking[:200]}")
+            if action.justification:
+                log(f"  Rationale: {action.justification}")
+            else:
+                log("  Rationale: [model did not provide one]")
+            if action.parameters:
+                log(f"  Parameters: {compact_preview(action.parameters, 200)}")
+            elif not action.justification and response:
+                log(
+                    f"  Model response: "
+                    f"{compact_preview(response, MODEL_RESPONSE_PREVIEW_CHARS)}"
+                )
+            obs = env.step(action)
+            if obs.latest_output:
+                lo = obs.latest_output
+                status = "OK" if lo.success else "FAIL"
+                log(f"  [{status}] {lo.summary}")
+                if lo.warnings:
+                    log(f"  Warnings: {lo.warnings}")
+            step_reward = obs.reward
+            cumulative_reward += step_reward
+            log(f"  Reward: {step_reward:+.3f}  (cum: {cumulative_reward:+.3f})")
+            log(f"  Budget: ${obs.resource_usage.budget_remaining:,.0f} | Time: {obs.resource_usage.time_remaining_days:.0f}d")
+            write_dashboard_state(
+                env, obs,
+                step=step + 1,
+                cumulative_reward=cumulative_reward,
+                model_response=response,
+                model_thinking=thinking,
+                action=action,
+                gen_time=gen_time,
+                episode_done=obs.done,
+            )
+            if obs.rule_violations:
+                log(f"  Violations: {obs.rule_violations}")
+            if obs.done:
+                break
+        log(f"\n{'=' * 70}")
+        log("EPISODE COMPLETE" if obs.done else f"MAX STEPS ({MAX_EPISODE_STEPS})")
+        log(f"  Steps: {obs.step_index}")
+        log(f"  Total reward: {cumulative_reward:+.3f}")
+        log(f"  Budget used: ${obs.resource_usage.budget_used:,.0f}")
+        log(f"  Time used: {obs.resource_usage.time_used_days:.0f} days")
+        if obs.conclusions:
+            log("  Conclusions:")
+            for c in obs.conclusions:
+                log(f"    [{c.claim_type}, conf={c.confidence:.2f}] {c.claim}")
+                if c.top_markers:
+                    log(f"      Markers: {c.top_markers}")
+                if c.causal_mechanisms:
+                    log(f"      Mechanisms: {c.causal_mechanisms}")
+                if c.predicted_pathways:
+                    log(f"      Pathways: {c.predicted_pathways}")
+        log("=" * 70)
+    DASHBOARD_CMD_PATH.unlink(missing_ok=True)
+    run_episode()
+    while True:
+        log("\nWaiting for dashboard command (restart / new task) ...")
+        while True:
+            cmd = check_dashboard_command()
+            if cmd:
+                break
+            time.sleep(1.0)
+        action_type = cmd.get("action", "restart")
+        if action_type == "quit":
+            log("Quit requested.")
+            break
+        scenario = cmd.get("scenario_name")
+        ground_truth = cmd.get("ground_truth")
+        log(f"\n[DASHBOARD] {action_type} — scenario={scenario}")
+        run_episode(scenario_name=scenario, custom_ground_truth=ground_truth)
+if __name__ == "__main__":
+    main()

server/app.py CHANGED Viewed

@@ -6,8 +6,12 @@ Endpoints:
     - GET  /state:  Get current environment state
     - GET  /schema: Get action/observation schemas
     - WS   /ws:     WebSocket endpoint for persistent sessions
 """
 try:
     from openenv.core.env_server.http_server import create_app
 except Exception as e:  # pragma: no cover
@@ -16,6 +20,7 @@ except Exception as e:  # pragma: no cover
         "Install dependencies with 'uv sync'"
     ) from e
 from models import ExperimentAction, ExperimentObservation
 from .hackathon_environment import BioExperimentEnvironment
@@ -24,12 +29,24 @@ app = create_app(
     ExperimentAction,
     ExperimentObservation,
     env_name="bio_experiment",
-    max_concurrent_envs=1,
 )
-def main(host: str = "0.0.0.0", port: int = 8000):
     import uvicorn
     uvicorn.run(app, host=host, port=port)
@@ -37,9 +54,6 @@ if __name__ == "__main__":
     import argparse
     parser = argparse.ArgumentParser()
     parser.add_argument("--host", default="0.0.0.0")
-    parser.add_argument("--port", type=int, default=8000)
     args = parser.parse_args()
-    if args.host == "0.0.0.0" and args.port == 8000:
-        main()
-    else:
-        main(host=args.host, port=args.port)

     - GET  /state:  Get current environment state
     - GET  /schema: Get action/observation schemas
     - WS   /ws:     WebSocket endpoint for persistent sessions
+    - GET  /        Demo UI
 """
+import os
+from pathlib import Path
 try:
     from openenv.core.env_server.http_server import create_app
 except Exception as e:  # pragma: no cover
         "Install dependencies with 'uv sync'"
     ) from e
+from fastapi.responses import HTMLResponse
 from models import ExperimentAction, ExperimentObservation
 from .hackathon_environment import BioExperimentEnvironment
     ExperimentAction,
     ExperimentObservation,
     env_name="bio_experiment",
+    max_concurrent_envs=int(os.environ.get("MAX_ENVS", "4")),
 )
+# Serve demo UI at root
+DEMO_HTML = Path(__file__).resolve().parent.parent / "demo.html"
+@app.get("/", response_class=HTMLResponse)
+async def demo_ui():
+    if DEMO_HTML.exists():
+        return HTMLResponse(content=DEMO_HTML.read_text(), status_code=200)
+    return HTMLResponse(content="<h1>BioEnv API</h1><p>Visit /docs for API documentation.</p>", status_code=200)
+def main(host: str = "0.0.0.0", port: int = None):
     import uvicorn
+    if port is None:
+        port = int(os.environ.get("PORT", "8000"))
     uvicorn.run(app, host=host, port=port)
     import argparse
     parser = argparse.ArgumentParser()
     parser.add_argument("--host", default="0.0.0.0")
+    parser.add_argument("--port", type=int, default=None)
     args = parser.parse_args()
+    main(host=args.host, port=args.port)

server/biology/__init__.py ADDED Viewed

File without changes

server/biology/gene_index.py ADDED Viewed

	@@ -0,0 +1,225 @@

+"""Pathway-aware gene similarity index for structured reward scoring.
+Uses gseapy pathway libraries (KEGG + Reactome) to build binary pathway
+membership vectors per gene, enabling cosine-similarity-based set scoring
+instead of substring matching.
+Mechanism comparison uses sentence-transformers for semantic similarity.
+"""
+from __future__ import annotations
+import logging
+from functools import lru_cache
+from typing import Dict, List, Optional, Tuple
+import numpy as np
+logger = logging.getLogger(__name__)
+_PATHWAY_SETS: Optional[Dict[str, List[str]]] = None
+_PATHWAY_NAMES: Optional[List[str]] = None
+_GENE_TO_PATHWAY_IDX: Optional[Dict[str, List[int]]] = None
+_N_PATHWAYS: int = 0
+_SENTENCE_MODEL = None
+def _ensure_pathway_index() -> None:
+    """Lazily build the inverted gene→pathway index on first use."""
+    global _PATHWAY_SETS, _PATHWAY_NAMES, _GENE_TO_PATHWAY_IDX, _N_PATHWAYS
+    if _PATHWAY_NAMES is not None:
+        return
+    try:
+        import gseapy as gp
+    except ImportError:
+        logger.warning("gseapy not installed; pathway scoring will use fallback.")
+        _PATHWAY_SETS = {}
+        _PATHWAY_NAMES = []
+        _GENE_TO_PATHWAY_IDX = {}
+        _N_PATHWAYS = 0
+        return
+    combined: Dict[str, List[str]] = {}
+    for lib_name in ("KEGG_2021_Human", "Reactome_2022"):
+        try:
+            combined.update(gp.get_library(lib_name))
+        except Exception as exc:
+            logger.warning("Failed to load %s: %s", lib_name, exc)
+    _PATHWAY_SETS = combined
+    _PATHWAY_NAMES = sorted(combined.keys())
+    _N_PATHWAYS = len(_PATHWAY_NAMES)
+    inv: Dict[str, List[int]] = {}
+    for idx, pw_name in enumerate(_PATHWAY_NAMES):
+        for gene in combined[pw_name]:
+            gene_upper = gene.upper().strip()
+            inv.setdefault(gene_upper, []).append(idx)
+    _GENE_TO_PATHWAY_IDX = inv
+    logger.info(
+        "Pathway index built: %d pathways, %d genes indexed.",
+        _N_PATHWAYS, len(inv),
+    )
+def _ensure_sentence_model():
+    """Lazily load the sentence-transformer model."""
+    global _SENTENCE_MODEL
+    if _SENTENCE_MODEL is not None:
+        return
+    try:
+        from sentence_transformers import SentenceTransformer
+        _SENTENCE_MODEL = SentenceTransformer("all-MiniLM-L6-v2")
+    except ImportError:
+        logger.warning(
+            "sentence-transformers not installed; mechanism scoring will use fallback."
+        )
+        _SENTENCE_MODEL = None
+def gene_vector(gene: str) -> np.ndarray:
+    """L2-normalised binary pathway membership vector for *gene*."""
+    _ensure_pathway_index()
+    vec = np.zeros(_N_PATHWAYS, dtype=np.float32)
+    indices = _GENE_TO_PATHWAY_IDX.get(gene.upper().strip(), [])
+    if indices:
+        vec[indices] = 1.0
+        norm = np.linalg.norm(vec)
+        if norm > 0:
+            vec /= norm
+    return vec
+def pathway_similarity(g1: str, g2: str) -> float:
+    """Cosine similarity between two genes in pathway space."""
+    v1 = gene_vector(g1)
+    v2 = gene_vector(g2)
+    dot = float(np.dot(v1, v2))
+    return max(0.0, min(1.0, dot))
+def marker_set_score(
+    predicted: List[str],
+    truth: List[str],
+    sigma: float = 0.3,
+) -> float:
+    """Pathway-weighted Gaussian set similarity for marker genes.
+    For each true marker, finds the best-matching predicted gene by
+    pathway cosine similarity, then applies a Gaussian kernel:
+        score_i = exp(-d^2 / (2 * sigma^2))   where d = 1 - sim
+    Returns the mean score over all true markers.
+    """
+    if not truth:
+        return 0.0
+    if not predicted:
+        return 0.0
+    _ensure_pathway_index()
+    if _N_PATHWAYS == 0:
+        return _fallback_marker_score(predicted, truth)
+    pred_vecs = [gene_vector(g) for g in predicted]
+    scores: List[float] = []
+    for true_gene in truth:
+        tv = gene_vector(true_gene)
+        best_sim = 0.0
+        for pv in pred_vecs:
+            sim = float(np.dot(tv, pv))
+            if sim > best_sim:
+                best_sim = sim
+        d = 1.0 - best_sim
+        scores.append(float(np.exp(-(d ** 2) / (2.0 * sigma ** 2))))
+    return sum(scores) / len(scores)
+def _fallback_marker_score(predicted: List[str], truth: List[str]) -> float:
+    """Exact-match fallback when pathway data is unavailable."""
+    pred_set = {g.upper().strip() for g in predicted}
+    hits = sum(1 for g in truth if g.upper().strip() in pred_set)
+    return hits / len(truth) if truth else 0.0
+def mechanism_set_score(predicted: List[str], truth: List[str]) -> float:
+    """Sentence-transformer semantic similarity for mechanism strings.
+    For each truth mechanism, finds the best-matching predicted mechanism
+    by cosine similarity and returns the mean of best matches.
+    """
+    if not truth:
+        return 0.0
+    if not predicted:
+        return 0.0
+    _ensure_sentence_model()
+    if _SENTENCE_MODEL is None:
+        return _fallback_mechanism_score(predicted, truth)
+    pred_embs = _SENTENCE_MODEL.encode(predicted, convert_to_numpy=True)
+    truth_embs = _SENTENCE_MODEL.encode(truth, convert_to_numpy=True)
+    pred_norms = pred_embs / (
+        np.linalg.norm(pred_embs, axis=1, keepdims=True) + 1e-9
+    )
+    truth_norms = truth_embs / (
+        np.linalg.norm(truth_embs, axis=1, keepdims=True) + 1e-9
+    )
+    sim_matrix = truth_norms @ pred_norms.T
+    best_per_truth = sim_matrix.max(axis=1)
+    return float(np.mean(np.clip(best_per_truth, 0.0, 1.0)))
+def _fallback_mechanism_score(predicted: List[str], truth: List[str]) -> float:
+    """Token-overlap fallback when sentence-transformers is unavailable."""
+    scores: List[float] = []
+    for t in truth:
+        t_tokens = set(t.lower().split())
+        best = 0.0
+        for p in predicted:
+            p_tokens = set(p.lower().split())
+            union = t_tokens | p_tokens
+            if union:
+                overlap = len(t_tokens & p_tokens) / len(union)
+                best = max(best, overlap)
+        scores.append(best)
+    return sum(scores) / len(scores) if scores else 0.0
+def score_pathways(
+    predicted: Dict[str, float],
+    truth: Dict[str, float],
+) -> float:
+    """Score predicted pathway activations against ground truth.
+    Uses normalised key matching with activity-level weighting.
+    """
+    if not truth:
+        return 0.0
+    if not predicted:
+        return 0.0
+    pred_norm = {k.lower().strip(): v for k, v in predicted.items()}
+    total_weight = 0.0
+    weighted_score = 0.0
+    for pw, true_activity in truth.items():
+        pw_key = pw.lower().strip()
+        weight = true_activity
+        total_weight += weight
+        if pw_key in pred_norm:
+            pred_activity = pred_norm[pw_key]
+            diff = abs(pred_activity - true_activity)
+            match_score = max(0.0, 1.0 - diff)
+            weighted_score += weight * match_score
+    return weighted_score / total_weight if total_weight > 0 else 0.0

server/hackathon_environment.py CHANGED Viewed

@@ -28,7 +28,7 @@ from server.rules.engine import RuleEngine
 from server.rewards.reward import RewardBreakdown, RewardComputer
 from server.simulator.latent_state import FullLatentState
 from server.simulator.noise import NoiseModel
-from server.simulator.transition import ACTION_COSTS, TransitionEngine
 from server.tasks.generator import TaskGenerator
@@ -70,8 +70,8 @@ class BioExperimentEnvironment(Environment):
     # ── Environment interface ───────────────────────────────────────────
-    def reset(self) -> ExperimentObservation:
-        seed = hash(uuid4()) % (2**31)
         self._noise.reseed(seed)
         self._state = State(episode_id=str(uuid4()), step_count=0)
@@ -116,7 +116,7 @@ class BioExperimentEnvironment(Environment):
             action, prev_state, self._latent, result.output, hard_v, soft_v,
         )
-        cost_budget, cost_time = ACTION_COSTS.get(action.action_type, (0, 0))
         self._history.append(PipelineStepRecord(
             step_index=self._state.step_count,
             action_type=action.action_type,
@@ -143,7 +143,11 @@ class BioExperimentEnvironment(Environment):
         terminal_rb = RewardBreakdown()
         if done:
             terminal_rb = self._rewards.terminal_reward(
-                self._latent, self._conclusions, self._task.success_criteria,
             )
         total_reward = step_rb.total + terminal_rb.total
@@ -158,6 +162,7 @@ class BioExperimentEnvironment(Environment):
             latest_output=result.output,
             rule_violations=hard_v + soft_v,
             reward_breakdown=breakdown,
         )
     @property
@@ -179,10 +184,18 @@ class BioExperimentEnvironment(Environment):
         latest_output: Optional[IntermediateOutput] = None,
         rule_violations: Optional[List[str]] = None,
         reward_breakdown: Optional[Dict[str, float]] = None,
     ) -> ExperimentObservation:
         assert self._task is not None
         assert self._latent is not None
         res = self._latent.resources
         return ExperimentObservation(
             task=self._task,
             step_index=self._state.step_count,
@@ -205,14 +218,10 @@ class BioExperimentEnvironment(Environment):
             subagent_outputs=list(self._subagent_outputs),
             conclusions=list(self._conclusions),
             rule_violations=rule_violations or [],
-            step_reward_breakdown=reward_breakdown or {},
             done=done,
             reward=reward,
-            metadata={
-                "episode_id": self._state.episode_id,
-                "step": self._state.step_count,
-                "cumulative_reward": self._cumulative_reward,
-            },
         )
     def _compute_uncertainty_summary(self) -> Dict[str, float]:
@@ -228,12 +237,22 @@ class BioExperimentEnvironment(Environment):
     ) -> None:
         if action.action_type == ActionType.MARKER_SELECTION:
             markers = output.data.get("markers", [])
-            self._discovered_markers.extend(markers)
         if action.action_type == ActionType.REGULATORY_NETWORK_INFERENCE:
             regs = output.data.get("top_regulators", [])
-            self._candidate_mechanisms.extend(regs)
         if action.action_type == ActionType.PATHWAY_ENRICHMENT:
             pathways = output.data.get("top_pathways", [])
-            self._candidate_mechanisms.extend(
-                [p["pathway"] for p in pathways if isinstance(p, dict)]
-            )

 from server.rewards.reward import RewardBreakdown, RewardComputer
 from server.simulator.latent_state import FullLatentState
 from server.simulator.noise import NoiseModel
+from server.simulator.transition import ACTION_COSTS, TransitionEngine, compute_action_cost
 from server.tasks.generator import TaskGenerator
     # ── Environment interface ───────────────────────────────────────────
+    def reset(self, seed: Optional[int] = None) -> ExperimentObservation:
+        seed = seed if seed is not None else hash(uuid4()) % (2**31)
         self._noise.reseed(seed)
         self._state = State(episode_id=str(uuid4()), step_count=0)
             action, prev_state, self._latent, result.output, hard_v, soft_v,
         )
+        cost_budget, cost_time = compute_action_cost(action)
         self._history.append(PipelineStepRecord(
             step_index=self._state.step_count,
             action_type=action.action_type,
         terminal_rb = RewardBreakdown()
         if done:
             terminal_rb = self._rewards.terminal_reward(
+                self._latent,
+                self._conclusions,
+                self._task.success_criteria,
+                discovered_markers=self._discovered_markers,
+                candidate_mechanisms=self._candidate_mechanisms,
             )
         total_reward = step_rb.total + terminal_rb.total
             latest_output=result.output,
             rule_violations=hard_v + soft_v,
             reward_breakdown=breakdown,
+            metadata_extra={"reward_breakdown": breakdown},
         )
     @property
         latest_output: Optional[IntermediateOutput] = None,
         rule_violations: Optional[List[str]] = None,
         reward_breakdown: Optional[Dict[str, float]] = None,
+        metadata_extra: Optional[Dict[str, Any]] = None,
     ) -> ExperimentObservation:
         assert self._task is not None
         assert self._latent is not None
         res = self._latent.resources
+        meta: Dict[str, Any] = {
+            "episode_id": self._state.episode_id,
+            "step": self._state.step_count,
+            "cumulative_reward": self._cumulative_reward,
+        }
+        if metadata_extra:
+            meta.update(metadata_extra)
         return ExperimentObservation(
             task=self._task,
             step_index=self._state.step_count,
             subagent_outputs=list(self._subagent_outputs),
             conclusions=list(self._conclusions),
             rule_violations=rule_violations or [],
+            step_reward_breakdown={},
             done=done,
             reward=reward,
+            metadata=meta,
         )
     def _compute_uncertainty_summary(self) -> Dict[str, float]:
     ) -> None:
         if action.action_type == ActionType.MARKER_SELECTION:
             markers = output.data.get("markers", [])
+            existing = set(self._discovered_markers)
+            for m in markers:
+                if m not in existing:
+                    self._discovered_markers.append(m)
+                    existing.add(m)
         if action.action_type == ActionType.REGULATORY_NETWORK_INFERENCE:
             regs = output.data.get("top_regulators", [])
+            existing = set(self._candidate_mechanisms)
+            for r in regs:
+                if r not in existing:
+                    self._candidate_mechanisms.append(r)
+                    existing.add(r)
         if action.action_type == ActionType.PATHWAY_ENRICHMENT:
             pathways = output.data.get("top_pathways", [])
+            existing = set(self._candidate_mechanisms)
+            for p in pathways:
+                if isinstance(p, dict) and p["pathway"] not in existing:
+                    self._candidate_mechanisms.append(p["pathway"])
+                    existing.add(p["pathway"])

server/requirements.txt CHANGED Viewed

@@ -1,6 +1,9 @@
-openenv[core]>=0.2.0
 fastapi>=0.115.0
 uvicorn>=0.24.0

+openenv-core[core]>=0.2.0
 fastapi>=0.115.0
 uvicorn>=0.24.0
+numpy>=1.24.0
+scipy>=1.10.0
+pydantic>=2.0.0
+gseapy>=1.1.0
+sentence-transformers>=3.0.0
+scikit-learn>=1.4.0

server/rewards/reward.py CHANGED Viewed

@@ -15,7 +15,7 @@ Potential-based shaping
 The final step reward is:
   R_t = r_validity + r_ordering + r_info_gain + r_efficiency
-        + r_novelty + r_penalty + γ[φ(s_{t+1}) − φ(s_t)]
 The terminal reward adds:
   R_T += r_terminal
@@ -32,9 +32,15 @@ from models import (
     ExperimentAction,
     IntermediateOutput,
     META_ACTIONS,
     WET_LAB_ACTIONS,
 )
 from server.simulator.latent_state import FullLatentState
@@ -84,20 +90,16 @@ class RewardComputer:
     Parameters
     ----------
-    gamma : float
-        Discount factor for potential-based shaping (default 0.99).
     efficiency_weight : float
         Relative importance of resource efficiency.
     """
     def __init__(
         self,
-        gamma: float = 0.99,
         efficiency_weight: float = 0.3,
         info_gain_weight: float = 0.4,
         validity_weight: float = 0.3,
     ):
-        self.gamma = gamma
         self.w_eff = efficiency_weight
         self.w_ig = info_gain_weight
         self.w_val = validity_weight
@@ -124,11 +126,19 @@ class RewardComputer:
         rb.validity = self.w_val * (1.0 if output.success else 0.0)
-        # ordering bonus: +0.2 if the step was a natural next step
-        rb.ordering = 0.2 * self._ordering_score(action, prev_state)
         # information gain proxy: quality × (1 - uncertainty)
         rb.info_gain = self.w_ig * output.quality_score * (1.0 - output.uncertainty)
         # efficiency: normalised cost relative to budget
         budget_frac = (
@@ -141,13 +151,25 @@ class RewardComputer:
         if not soft_violations:
             rb.novelty = 0.1
         # penalties
         rb.penalty = -0.15 * len(soft_violations)
-        # potential-based shaping
         phi_prev = self._potential(prev_state)
         phi_next = self._potential(next_state)
-        rb.shaping = self.gamma * phi_next - phi_prev
         return rb
@@ -158,8 +180,12 @@ class RewardComputer:
         state: FullLatentState,
         conclusions: List[ConclusionClaim],
         task_success_criteria: List[str],
     ) -> RewardBreakdown:
         rb = RewardBreakdown()
         # pipeline completeness (0-1)
         completeness = self._completeness(state)
@@ -183,11 +209,22 @@ class RewardComputer:
         overconf = self._overconfidence_penalty(state, conclusions)
         rb.components["overconfidence_penalty"] = overconf
         rb.terminal = (
             3.0 * completeness
             + 4.0 * calibration
-            + 1.0 * (budget_eff + time_eff) / 2.0
             + overconf
         )
         return rb
@@ -196,7 +233,7 @@ class RewardComputer:
     def _ordering_score(
         self, action: ExperimentAction, s: FullLatentState
     ) -> float:
-        """Heuristic: 1.0 if this step naturally follows the current progress."""
         at = action.action_type
         p = s.progress
         NATURAL_NEXT = {
@@ -215,10 +252,26 @@ class RewardComputer:
                 p.de_performed or p.cells_clustered
             ) and not p.conclusion_reached,
         }
-        return 1.0 if NATURAL_NEXT.get(at, False) else 0.3
     def _potential(self, s: FullLatentState) -> float:
-        """Progress potential φ(s) — counts completed milestones."""
         p = s.progress
         milestones = [
             p.samples_collected,
@@ -252,9 +305,38 @@ class RewardComputer:
     def _calibration(
         self, s: FullLatentState, conclusions: List[ConclusionClaim]
     ) -> float:
         if not conclusions:
             return 0.0
         true_mechanisms = set(s.biology.causal_mechanisms)
         true_markers = set(s.biology.true_markers)
         score = 0.0
@@ -270,16 +352,121 @@ class RewardComputer:
                 score -= 0.3
         return max(0.0, min(1.0, score / max(n, 1)))
     def _overconfidence_penalty(
         self, s: FullLatentState, conclusions: List[ConclusionClaim]
     ) -> float:
-        """Penalise high-confidence claims that disagree with ground truth."""
         penalty = 0.0
-        true_set = set(
-            m.lower() for m in s.biology.causal_mechanisms + s.biology.true_markers
-        )
         for c in conclusions:
-            is_correct = any(t in c.claim.lower() for t in true_set)
-            if c.confidence > 0.8 and not is_correct:
                 penalty -= 0.5 * c.confidence
         return penalty

 The final step reward is:
   R_t = r_validity + r_ordering + r_info_gain + r_efficiency
+        + r_novelty + r_penalty + [φ(s_{t+1}) − φ(s_t)]
 The terminal reward adds:
   R_T += r_terminal
     ExperimentAction,
     IntermediateOutput,
     META_ACTIONS,
+    TOOL_REGISTRY,
     WET_LAB_ACTIONS,
 )
+from server.biology.gene_index import (
+    marker_set_score,
+    mechanism_set_score,
+    score_pathways,
+)
 from server.simulator.latent_state import FullLatentState
     Parameters
     ----------
     efficiency_weight : float
         Relative importance of resource efficiency.
     """
     def __init__(
         self,
         efficiency_weight: float = 0.3,
         info_gain_weight: float = 0.4,
         validity_weight: float = 0.3,
     ):
         self.w_eff = efficiency_weight
         self.w_ig = info_gain_weight
         self.w_val = validity_weight
         rb.validity = self.w_val * (1.0 if output.success else 0.0)
+        ordering_score = self._ordering_score(action, prev_state)
+        rb.ordering = 0.2 * ordering_score
+        if ordering_score < 0:
+            rb.penalty += ordering_score * 0.3
         # information gain proxy: quality × (1 - uncertainty)
         rb.info_gain = self.w_ig * output.quality_score * (1.0 - output.uncertainty)
+        if action.action_type in META_ACTIONS and not (
+            prev_state.progress.de_performed
+            or prev_state.progress.cells_clustered
+        ):
+            # Meta actions before substantive analysis should not dominate reward.
+            rb.info_gain *= 0.2
         # efficiency: normalised cost relative to budget
         budget_frac = (
         if not soft_violations:
             rb.novelty = 0.1
+        # tool-modality fit bonus/penalty
+        tool_fit = self._tool_fit_score(action, prev_state)
+        rb.components["tool_fit"] = tool_fit
+        rb.validity += 0.15 * tool_fit
         # penalties
         rb.penalty = -0.15 * len(soft_violations)
+        if action.action_type in META_ACTIONS and not (
+            prev_state.progress.de_performed
+            or prev_state.progress.cells_clustered
+        ):
+            rb.penalty -= 0.25
+            rb.components["premature_meta_action_penalty"] = -0.25
+        # potential-based shaping (γ=1 so it doesn't depend on the
+        # training algorithm's discount factor)
         phi_prev = self._potential(prev_state)
         phi_next = self._potential(next_state)
+        rb.shaping = phi_next - phi_prev
         return rb
         state: FullLatentState,
         conclusions: List[ConclusionClaim],
         task_success_criteria: List[str],
+        discovered_markers: Optional[List[str]] = None,
+        candidate_mechanisms: Optional[List[str]] = None,
     ) -> RewardBreakdown:
         rb = RewardBreakdown()
+        discovered_markers = discovered_markers or []
+        candidate_mechanisms = candidate_mechanisms or []
         # pipeline completeness (0-1)
         completeness = self._completeness(state)
         overconf = self._overconfidence_penalty(state, conclusions)
         rb.components["overconfidence_penalty"] = overconf
+        discovery_alignment = self._discovery_alignment(
+            state,
+            discovered_markers,
+            candidate_mechanisms,
+        )
+        discovery_error_penalty = -2.5 * (1.0 - discovery_alignment)
+        rb.components["discovery_alignment"] = discovery_alignment
+        rb.components["discovery_error_penalty"] = discovery_error_penalty
+        eff_bonus = (budget_eff + time_eff) / 2.0 if completeness >= 0.3 else 0.0
         rb.terminal = (
             3.0 * completeness
             + 4.0 * calibration
+            + 1.0 * eff_bonus
             + overconf
+            + discovery_error_penalty
         )
         return rb
     def _ordering_score(
         self, action: ExperimentAction, s: FullLatentState
     ) -> float:
+        """Heuristic: 1.0 if natural next, 0.3 if acceptable, -1.0 if premature."""
         at = action.action_type
         p = s.progress
         NATURAL_NEXT = {
                 p.de_performed or p.cells_clustered
             ) and not p.conclusion_reached,
         }
+        if NATURAL_NEXT.get(at, False):
+            return 1.0
+        has_evidence = any([
+            p.cells_clustered, p.de_performed, p.trajectories_inferred,
+            p.pathways_analyzed, p.networks_inferred, p.markers_discovered,
+        ])
+        if at in META_ACTIONS and not has_evidence:
+            return -1.0
+        return 0.3
     def _potential(self, s: FullLatentState) -> float:
+        """Progress potential φ(s) — counts completed milestones.
+        Returns 0.0 at terminal states so that the shaping signal
+        telescopes correctly over the episode.
+        """
+        if s.progress.conclusion_reached:
+            return 0.0
         p = s.progress
         milestones = [
             p.samples_collected,
     def _calibration(
         self, s: FullLatentState, conclusions: List[ConclusionClaim]
     ) -> float:
+        """Structured set-similarity calibration against hidden ground truth.
+        Uses pathway-weighted Gaussian similarity for markers, semantic
+        similarity for mechanisms, and activity-weighted matching for pathways.
+        Falls back to legacy substring matching when structured fields are empty.
+        """
         if not conclusions:
             return 0.0
+        pred_markers = [g for c in conclusions for g in c.top_markers]
+        pred_mechs = [m for c in conclusions for m in c.causal_mechanisms]
+        pred_pathways = {
+            p: v for c in conclusions for p, v in c.predicted_pathways.items()
+        }
+        has_structured = bool(pred_markers or pred_mechs or pred_pathways)
+        if has_structured:
+            m_score = marker_set_score(pred_markers, s.biology.true_markers)
+            mech_score = mechanism_set_score(
+                pred_mechs, s.biology.causal_mechanisms
+            )
+            pw_score = score_pathways(pred_pathways, s.biology.true_pathways)
+            return 0.50 * m_score + 0.35 * mech_score + 0.15 * pw_score
+        return self._legacy_calibration(s, conclusions)
+    @staticmethod
+    def _legacy_calibration(
+        s: FullLatentState, conclusions: List[ConclusionClaim]
+    ) -> float:
+        """Substring-based calibration kept for backward compatibility."""
         true_mechanisms = set(s.biology.causal_mechanisms)
         true_markers = set(s.biology.true_markers)
         score = 0.0
                 score -= 0.3
         return max(0.0, min(1.0, score / max(n, 1)))
+    _METHOD_TO_TOOL: Dict[str, str] = {
+        "scanpy.pp.calculate_qc_metrics": "Scanpy",
+        "scanpy.pp.filter_cells": "Scanpy",
+        "scanpy.pp.filter_genes": "Scanpy",
+        "scanpy.pp.normalize_total": "Scanpy",
+        "scanpy.pp.log1p": "Scanpy",
+        "scanpy.pp.highly_variable_genes": "Scanpy",
+        "scanpy.pp.neighbors": "Scanpy",
+        "scanpy.tl.leiden": "Leiden",
+        "scanpy.tl.louvain": "Louvain",
+        "scanpy.tl.rank_genes_groups": "Scanpy",
+        "scanpy.tl.paga": "PAGA",
+        "scanpy.tl.umap": "UMAP",
+        "gseapy.prerank": "Scanpy",
+        "gseapy.gsea": "Scanpy",
+        "10x_chromium": "CellRanger",
+        "NovaSeq": "CellRanger",
+    }
+    @staticmethod
+    def _tool_fit_score(
+        action: ExperimentAction, s: FullLatentState
+    ) -> float:
+        """Score how well the chosen tool matches the task modality.
+        Returns +1.0 for a perfect match, 0.0 if no tool specified,
+        -1.0 for a known tool used on an incompatible modality.
+        """
+        method = action.method
+        if not method:
+            return 0.0
+        resolved = RewardComputer._METHOD_TO_TOOL.get(method, method)
+        tool_spec = TOOL_REGISTRY.get(resolved)
+        if tool_spec is None:
+            return -0.5
+        modality = getattr(s, "task_modality", None)
+        if not modality or not tool_spec.modalities:
+            return 0.0
+        if modality in tool_spec.modalities:
+            return 1.0
+        return -1.0
     def _overconfidence_penalty(
         self, s: FullLatentState, conclusions: List[ConclusionClaim]
     ) -> float:
+        """Penalise high-confidence claims that disagree with ground truth.
+        Checks structured fields (top_markers, causal_mechanisms) first;
+        falls back to claim substring matching for backward compatibility.
+        """
         penalty = 0.0
+        true_markers_lower = {m.lower() for m in s.biology.true_markers}
+        true_mechs_lower = {m.lower() for m in s.biology.causal_mechanisms}
+        true_set = true_markers_lower | true_mechs_lower
         for c in conclusions:
+            if c.confidence <= 0.8:
+                continue
+            has_structured = bool(c.top_markers or c.causal_mechanisms)
+            if has_structured:
+                marker_hit = any(
+                    g.upper().strip() in {m.upper() for m in s.biology.true_markers}
+                    for g in c.top_markers
+                )
+                mech_hit = any(
+                    any(kw in m.lower() for kw in t.lower().split())
+                    for m in c.causal_mechanisms
+                    for t in s.biology.causal_mechanisms
+                )
+                is_correct = marker_hit or mech_hit
+            else:
+                is_correct = any(t in c.claim.lower() for t in true_set)
+            if not is_correct:
                 penalty -= 0.5 * c.confidence
         return penalty
+    def _discovery_alignment(
+        self,
+        s: FullLatentState,
+        discovered_markers: List[str],
+        candidate_mechanisms: List[str],
+    ) -> float:
+        """Symmetric end-of-episode similarity for discovered biology.
+        Forward scoring measures recall against hidden truth. Reverse scoring
+        measures how well the agent's discoveries map back onto real biology,
+        which penalizes extra hallucinated markers or mechanisms.
+        """
+        components: List[float] = []
+        if s.biology.true_markers or discovered_markers:
+            marker_recall = marker_set_score(
+                discovered_markers,
+                s.biology.true_markers,
+            )
+            marker_precision = marker_set_score(
+                s.biology.true_markers,
+                discovered_markers,
+            )
+            components.append((marker_recall + marker_precision) / 2.0)
+        if s.biology.causal_mechanisms or candidate_mechanisms:
+            mechanism_recall = mechanism_set_score(
+                candidate_mechanisms,
+                s.biology.causal_mechanisms,
+            )
+            mechanism_precision = mechanism_set_score(
+                s.biology.causal_mechanisms,
+                candidate_mechanisms,
+            )
+            components.append((mechanism_recall + mechanism_precision) / 2.0)
+        if not components:
+            return 1.0
+        return sum(components) / len(components)

server/rules/engine.py CHANGED Viewed

@@ -10,7 +10,7 @@ from dataclasses import dataclass
 from enum import Enum
 from typing import List
-from models import ActionType, ExperimentAction
 from server.simulator.latent_state import FullLatentState
@@ -32,6 +32,19 @@ class RuleEngine:
     latent state before each action is applied.
     """
     def check(
         self, action: ExperimentAction, state: FullLatentState
     ) -> List[RuleViolation]:
@@ -40,6 +53,7 @@ class RuleEngine:
         violations.extend(self._check_resource_constraints(action, state))
         violations.extend(self._check_redundancy(action, state))
         violations.extend(self._check_causal_validity(action, state))
         return violations
     def hard_violations(self, violations: List[RuleViolation]) -> List[str]:
@@ -106,6 +120,9 @@ class RuleEngine:
             ActionType.CULTURE_CELLS: [
                 ("samples_collected", "Cannot culture without samples"),
             ],
         }
         for flag, msg in REQUIRES.get(at, []):
@@ -127,22 +144,22 @@ class RuleEngine:
             vs.append(RuleViolation(
                 rule_id="budget_exhausted",
                 severity=Severity.HARD,
-                message="Budget exhausted — no further actions possible",
             ))
         if s.resources.time_exhausted:
             vs.append(RuleViolation(
                 rule_id="time_exhausted",
                 severity=Severity.HARD,
-                message="Time limit reached — no further actions possible",
             ))
         remaining = s.resources.budget_remaining
-        from server.simulator.transition import ACTION_COSTS
-        cost, _ = ACTION_COSTS.get(action.action_type, (0, 0))
         if cost > remaining and remaining > 0:
             vs.append(RuleViolation(
                 rule_id="budget_insufficient",
-                severity=Severity.SOFT,
                 message=f"Action costs ${cost:,.0f} but only ${remaining:,.0f} remains",
             ))
         return vs
@@ -163,13 +180,23 @@ class RuleEngine:
             ActionType.RUN_QC: "qc_performed",
             ActionType.FILTER_DATA: "data_filtered",
             ActionType.NORMALIZE_DATA: "data_normalized",
         }
         flag = REDUNDANT.get(at)
         if flag and getattr(p, flag, False):
             vs.append(RuleViolation(
                 rule_id=f"redundant_{at.value}",
-                severity=Severity.SOFT,
-                message=f"Step '{at.value}' already completed — redundant action",
             ))
         return vs
@@ -179,12 +206,36 @@ class RuleEngine:
         self, action: ExperimentAction, s: FullLatentState
     ) -> List[RuleViolation]:
         vs: List[RuleViolation] = []
         if action.action_type == ActionType.SYNTHESIZE_CONCLUSION:
             if not s.progress.de_performed and not s.progress.cells_clustered:
                 vs.append(RuleViolation(
                     rule_id="premature_conclusion",
-                    severity=Severity.SOFT,
-                    message="Synthesising conclusion without substantive analysis",
                 ))
             claims = action.parameters.get("claims", [])
@@ -206,3 +257,72 @@ class RuleEngine:
                     message="Pathway enrichment without DE may yield unreliable results",
                 ))
         return vs

 from enum import Enum
 from typing import List
+from models import ActionType, ExperimentAction, TOOL_REGISTRY
 from server.simulator.latent_state import FullLatentState
     latent state before each action is applied.
     """
+    @staticmethod
+    def _has_analysis_evidence(s: FullLatentState) -> bool:
+        p = s.progress
+        return any([
+            p.cells_clustered,
+            p.de_performed,
+            p.trajectories_inferred,
+            p.pathways_analyzed,
+            p.networks_inferred,
+            p.markers_discovered,
+            p.markers_validated,
+        ])
     def check(
         self, action: ExperimentAction, state: FullLatentState
     ) -> List[RuleViolation]:
         violations.extend(self._check_resource_constraints(action, state))
         violations.extend(self._check_redundancy(action, state))
         violations.extend(self._check_causal_validity(action, state))
+        violations.extend(self._check_tool_compatibility(action, state))
         return violations
     def hard_violations(self, violations: List[RuleViolation]) -> List[str]:
             ActionType.CULTURE_CELLS: [
                 ("samples_collected", "Cannot culture without samples"),
             ],
+            ActionType.SYNTHESIZE_CONCLUSION: [
+                ("data_normalized", "Cannot synthesize conclusions before data normalization"),
+            ],
         }
         for flag, msg in REQUIRES.get(at, []):
             vs.append(RuleViolation(
                 rule_id="budget_exhausted",
                 severity=Severity.HARD,
+                message="Budget exhausted - no further actions possible",
             ))
         if s.resources.time_exhausted:
             vs.append(RuleViolation(
                 rule_id="time_exhausted",
                 severity=Severity.HARD,
+                message="Time limit reached - no further actions possible",
             ))
         remaining = s.resources.budget_remaining
+        from server.simulator.transition import compute_action_cost
+        cost, _ = compute_action_cost(action)
         if cost > remaining and remaining > 0:
             vs.append(RuleViolation(
                 rule_id="budget_insufficient",
+                severity=Severity.HARD,
                 message=f"Action costs ${cost:,.0f} but only ${remaining:,.0f} remains",
             ))
         return vs
             ActionType.RUN_QC: "qc_performed",
             ActionType.FILTER_DATA: "data_filtered",
             ActionType.NORMALIZE_DATA: "data_normalized",
+            ActionType.CLUSTER_CELLS: "cells_clustered",
+            ActionType.DIFFERENTIAL_EXPRESSION: "de_performed",
+            ActionType.TRAJECTORY_ANALYSIS: "trajectories_inferred",
+            ActionType.PATHWAY_ENRICHMENT: "pathways_analyzed",
+            ActionType.REGULATORY_NETWORK_INFERENCE: "networks_inferred",
+            ActionType.MARKER_SELECTION: "markers_discovered",
+            ActionType.VALIDATE_MARKER: "markers_validated",
+            ActionType.DESIGN_FOLLOWUP: "followup_designed",
+            ActionType.REQUEST_SUBAGENT_REVIEW: "subagent_review_requested",
+            ActionType.SYNTHESIZE_CONCLUSION: "conclusion_reached",
         }
         flag = REDUNDANT.get(at)
         if flag and getattr(p, flag, False):
             vs.append(RuleViolation(
                 rule_id=f"redundant_{at.value}",
+                severity=Severity.HARD,
+                message=f"Step '{at.value}' already completed — redundant action blocked",
             ))
         return vs
         self, action: ExperimentAction, s: FullLatentState
     ) -> List[RuleViolation]:
         vs: List[RuleViolation] = []
+        has_analysis_evidence = self._has_analysis_evidence(s)
+        if action.action_type == ActionType.DESIGN_FOLLOWUP:
+            if not has_analysis_evidence:
+                vs.append(RuleViolation(
+                    rule_id="premature_followup_design",
+                    severity=Severity.HARD,
+                    message=(
+                        "Follow-up design without prior analysis is blocked; "
+                        "complete wet-lab and computational steps first"
+                    ),
+                ))
+        if action.action_type == ActionType.REQUEST_SUBAGENT_REVIEW:
+            if not has_analysis_evidence:
+                vs.append(RuleViolation(
+                    rule_id="premature_subagent_review",
+                    severity=Severity.HARD,
+                    message=(
+                        "Subagent review without prior analysis is blocked; "
+                        "generate evidence first"
+                    ),
+                ))
         if action.action_type == ActionType.SYNTHESIZE_CONCLUSION:
             if not s.progress.de_performed and not s.progress.cells_clustered:
                 vs.append(RuleViolation(
                     rule_id="premature_conclusion",
+                    severity=Severity.HARD,
+                    message="Cannot synthesise conclusion without substantive analysis",
                 ))
             claims = action.parameters.get("claims", [])
                     message="Pathway enrichment without DE may yield unreliable results",
                 ))
         return vs
+    # ── tool / modality compatibility ────────────────────────────────────
+    _KNOWN_METHODS = {
+        "scanpy.pp.calculate_qc_metrics", "scanpy.pp.filter_cells",
+        "scanpy.pp.filter_genes", "scanpy.pp.normalize_total",
+        "scanpy.pp.log1p", "scanpy.pp.highly_variable_genes",
+        "scanpy.pp.neighbors", "scanpy.tl.leiden", "scanpy.tl.louvain",
+        "scanpy.tl.rank_genes_groups", "scanpy.tl.paga", "scanpy.tl.umap",
+        "gseapy.prerank", "gseapy.gsea", "10x_chromium", "NovaSeq",
+    }
+    _METHOD_TO_TOOL = {
+        "scanpy.pp.calculate_qc_metrics": "Scanpy",
+        "scanpy.pp.filter_cells": "Scanpy",
+        "scanpy.pp.filter_genes": "Scanpy",
+        "scanpy.pp.normalize_total": "Scanpy",
+        "scanpy.pp.log1p": "Scanpy",
+        "scanpy.pp.highly_variable_genes": "Scanpy",
+        "scanpy.pp.neighbors": "Scanpy",
+        "scanpy.tl.leiden": "Leiden",
+        "scanpy.tl.louvain": "Louvain",
+        "scanpy.tl.rank_genes_groups": "Scanpy",
+        "scanpy.tl.paga": "PAGA",
+        "scanpy.tl.umap": "UMAP",
+        "gseapy.prerank": "Scanpy",
+        "gseapy.gsea": "Scanpy",
+        "10x_chromium": "CellRanger",
+        "NovaSeq": "CellRanger",
+    }
+    def _check_tool_compatibility(
+        self, action: ExperimentAction, s: FullLatentState
+    ) -> List[RuleViolation]:
+        """Warn when the chosen tool is incompatible with the task modality."""
+        vs: List[RuleViolation] = []
+        method = action.method
+        if not method:
+            return vs
+        resolved = self._METHOD_TO_TOOL.get(method, method)
+        tool_spec = TOOL_REGISTRY.get(resolved)
+        if tool_spec is None and method not in self._KNOWN_METHODS:
+            vs.append(RuleViolation(
+                rule_id="unknown_tool",
+                severity=Severity.SOFT,
+                message=f"Tool '{method}' is not in the registry — results may be unreliable",
+            ))
+            return vs
+        if tool_spec is None:
+            return vs
+        # Check modality compatibility (modality lives on the task, which is
+        # stored in the latent state's associated TaskSpec — but the latent
+        # state doesn't carry the TaskSpec directly.  We can still check via
+        # the action's own context or fall back gracefully).
+        task_modality = getattr(s, "task_modality", None)
+        if task_modality and tool_spec.modalities:
+            if task_modality not in tool_spec.modalities:
+                vs.append(RuleViolation(
+                    rule_id="tool_modality_mismatch",
+                    severity=Severity.SOFT,
+                    message=(
+                        f"Tool '{method}' is designed for "
+                        f"{', '.join(tool_spec.modalities)} but task modality "
+                        f"is '{task_modality}'"
+                    ),
+                ))
+        return vs

server/simulator/latent_state.py CHANGED Viewed

@@ -88,8 +88,11 @@ class ExperimentProgress(BaseModel):
     networks_inferred: bool = False
     markers_discovered: bool = False
     markers_validated: bool = False
     conclusion_reached: bool = False
     n_cells_after_filter: Optional[int] = None
     n_clusters_found: Optional[int] = None
     n_de_genes_found: Optional[int] = None
@@ -139,5 +142,12 @@ class FullLatentState(BaseModel):
     mechanism_confidence: Dict[str, float] = Field(default_factory=dict)
     discovered_de_genes: List[str] = Field(default_factory=list)
     discovered_clusters: List[str] = Field(default_factory=list)
     step_count: int = 0
     rng_seed: int = 42

     networks_inferred: bool = False
     markers_discovered: bool = False
     markers_validated: bool = False
+    followup_designed: bool = False
+    subagent_review_requested: bool = False
     conclusion_reached: bool = False
+    n_cells_sequenced: Optional[int] = None
     n_cells_after_filter: Optional[int] = None
     n_clusters_found: Optional[int] = None
     n_de_genes_found: Optional[int] = None
     mechanism_confidence: Dict[str, float] = Field(default_factory=dict)
     discovered_de_genes: List[str] = Field(default_factory=list)
     discovered_clusters: List[str] = Field(default_factory=list)
+    task_modality: str = "scRNA-seq"
     step_count: int = 0
     rng_seed: int = 42
+    # Transient fields for passing sampled values from the transition engine
+    # to the output generator within a single step (not serialized).
+    last_retain_frac: Optional[float] = Field(None, exclude=True)
+    last_n_clusters: Optional[int] = Field(None, exclude=True)
+    last_perturbation_efficiency: Optional[float] = Field(None, exclude=True)

server/simulator/noise.py CHANGED Viewed

@@ -30,7 +30,11 @@ class NoiseModel:
     ) -> Dict[str, float]:
         noisy: Dict[str, float] = {}
         for gene, value in true_values.items():
-            if self.rng.random() < dropout_rate:
                 noisy[gene] = 0.0
             else:
                 sigma = noise_level * abs(value) + 0.1

     ) -> Dict[str, float]:
         noisy: Dict[str, float] = {}
         for gene, value in true_values.items():
+            # Dropout probability is inversely proportional to expression
+            # magnitude: lowly expressed genes drop out much more readily,
+            # matching the zero-inflation pattern in real scRNA-seq data.
+            p_drop = dropout_rate / (1.0 + abs(value))
+            if self.rng.random() < p_drop:
                 noisy[gene] = 0.0
             else:
                 sigma = noise_level * abs(value) + 0.1

server/simulator/output_generator.py CHANGED Viewed

@@ -14,6 +14,15 @@ from models import (
 from .latent_state import FullLatentState
 from .noise import NoiseModel
 class OutputGenerator:
     """Creates structured ``IntermediateOutput`` objects conditioned on the
@@ -91,7 +100,13 @@ class OutputGenerator:
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         days = action.parameters.get("days", 7)
-        viability = self.noise.sample_qc_metric(0.92, 0.05, 0.5, 1.0)
         return IntermediateOutput(
             output_type=OutputType.CULTURE_RESULT,
             step_index=idx,
@@ -101,20 +116,54 @@ class OutputGenerator:
             artifacts_available=["cultured_cells"],
         )
-    def _perturb(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         target = action.parameters.get("target", "unknown")
-        efficiency = self.noise.sample_qc_metric(0.75, 0.15, 0.0, 1.0)
         return IntermediateOutput(
             output_type=OutputType.PERTURBATION_RESULT,
             step_index=idx,
             quality_score=efficiency,
-            summary=f"Perturbation of {target} (efficiency={efficiency:.2f})",
             data={
                 "target": target,
                 "efficiency": efficiency,
                 "type": action.action_type.value,
             },
             artifacts_available=["perturbed_cells"],
         )
@@ -122,11 +171,18 @@ class OutputGenerator:
     def _sequence_cells(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         depth = s.technical.sequencing_depth_factor
-        n_cells = self.noise.sample_count(
             s.biology.n_true_cells * s.technical.capture_efficiency
         )
-        n_genes = self.noise.sample_count(18_000)
         median_umi = self.noise.sample_count(int(3000 * depth))
         quality = self.noise.quality_degradation(
             s.technical.sample_quality,
@@ -157,7 +213,18 @@ class OutputGenerator:
         doublet_frac = self.noise.sample_qc_metric(
             s.technical.doublet_rate, 0.01, 0.0, 0.2
         )
-        mito_frac = self.noise.sample_qc_metric(0.05, 0.02, 0.0, 0.3)
         ambient_frac = self.noise.sample_qc_metric(
             s.technical.ambient_rna_fraction, 0.01, 0.0, 0.2
         )
@@ -186,9 +253,9 @@ class OutputGenerator:
     def _filter_data(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
-        retain_frac = self.noise.sample_qc_metric(0.85, 0.05, 0.5, 1.0)
-        n_before = s.biology.n_true_cells
-        n_after = max(100, int(n_before * retain_frac))
         return IntermediateOutput(
             output_type=OutputType.COUNT_MATRIX_SUMMARY,
             step_index=idx,
@@ -238,14 +305,15 @@ class OutputGenerator:
     ) -> IntermediateOutput:
         n_true = len(s.biology.cell_populations) or 5
         quality = self.noise.quality_degradation(0.8, [0.95])
-        n_clusters = self.noise.sample_cluster_count(n_true, quality)
         cluster_names = [f"cluster_{i}" for i in range(n_clusters)]
-        sizes = self._random_partition(s.biology.n_true_cells, n_clusters)
         return IntermediateOutput(
             output_type=OutputType.CLUSTER_RESULT,
             step_index=idx,
             quality_score=quality,
-            summary=f"Found {n_clusters} clusters (ground-truth populations: {n_true})",
             data={
                 "n_clusters": n_clusters,
                 "cluster_names": cluster_names,
@@ -260,10 +328,22 @@ class OutputGenerator:
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         comparison = action.parameters.get("comparison", "disease_vs_healthy")
         true_effects = s.biology.true_de_genes.get(comparison, {})
         n_cells = s.progress.n_cells_after_filter or s.biology.n_true_cells
-        noise_level = s.technical.dropout_rate + 0.1 * (1.0 - s.technical.sample_quality)
         observed = self.noise.sample_effect_sizes(true_effects, n_cells, noise_level)
         fp_genes = self.noise.generate_false_positives(5000, 0.002 + noise_level * 0.01)
@@ -299,10 +379,16 @@ class OutputGenerator:
         quality = self.noise.quality_degradation(0.7 if has_trajectory else 0.3, [0.9])
         summary_data: Dict[str, Any] = {"method": action.method or "monocle3"}
         if has_trajectory:
             summary_data.update({
-                "n_lineages": s.biology.true_trajectory.get("n_lineages", 1),
                 "pseudotime_range": [0.0, 1.0],
-                "branching_detected": s.biology.true_trajectory.get("branching", False),
             })
         else:
             summary_data["n_lineages"] = self.noise.sample_count(1) + 1
@@ -323,24 +409,38 @@ class OutputGenerator:
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         true_pathways = s.biology.true_pathways
-        noise_level = 0.15
         observed: Dict[str, float] = {}
         for pw, activity in true_pathways.items():
             observed[pw] = activity + float(self.noise.rng.normal(0, noise_level))
-        for i in range(self.noise.sample_count(2)):
             observed[f"FP_PATHWAY_{i}"] = float(self.noise.rng.uniform(0.3, 0.6))
         top = sorted(observed.items(), key=lambda kv: kv[1], reverse=True)[:15]
         return IntermediateOutput(
             output_type=OutputType.PATHWAY_RESULT,
             step_index=idx,
-            quality_score=self.noise.quality_degradation(0.8, [0.95]),
             summary=f"Pathway enrichment: {len(top)} significant pathways",
             data={
                 "method": action.method or "GSEA",
                 "top_pathways": [
-                    {"pathway": p, "score": round(s, 3)} for p, s in top
                 ],
             },
             uncertainty=noise_level,
@@ -353,6 +453,25 @@ class OutputGenerator:
         true_net = s.biology.true_regulatory_network
         n_edges_true = sum(len(v) for v in true_net.values())
         noise_edges = self.noise.sample_count(max(5, int(n_edges_true * 0.3)))
         return IntermediateOutput(
             output_type=OutputType.NETWORK_RESULT,
             step_index=idx,
@@ -362,7 +481,7 @@ class OutputGenerator:
                 "method": action.method or "SCENIC",
                 "n_regulons": len(true_net) + self.noise.sample_count(3),
                 "n_edges": n_edges_true + noise_edges,
-                "top_regulators": list(true_net.keys())[:10],
             },
             uncertainty=0.35,
             artifacts_available=["regulon_table", "grn_adjacency"],
@@ -407,8 +526,11 @@ class OutputGenerator:
                 "marker": marker,
                 "validated": validated,
                 "assay": action.method or "qPCR",
                 "effect_size": self.noise.sample_qc_metric(
-                    1.5 if is_true else 0.2, 0.3, -0.5, 5.0
                 ),
             },
             artifacts_available=["validation_data"],
@@ -417,22 +539,54 @@ class OutputGenerator:
     def _design_followup(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         return IntermediateOutput(
             output_type=OutputType.FOLLOWUP_DESIGN,
             step_index=idx,
-            summary="Follow-up experiment design proposed",
-            data={"proposal": action.parameters},
             artifacts_available=["followup_proposal"],
         )
     def _subagent_review(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         return IntermediateOutput(
             output_type=OutputType.SUBAGENT_REPORT,
             step_index=idx,
             summary=f"Subagent review ({action.invoked_subagent or 'general'})",
-            data={"subagent": action.invoked_subagent, "notes": "Review complete."},
             artifacts_available=["subagent_report"],
         )
@@ -469,14 +623,46 @@ class OutputGenerator:
         sizes[0] += diff
         return sizes
 _HANDLERS = {
     ActionType.COLLECT_SAMPLE: OutputGenerator._collect_sample,
     ActionType.SELECT_COHORT: OutputGenerator._select_cohort,
     ActionType.PREPARE_LIBRARY: OutputGenerator._prepare_library,
     ActionType.CULTURE_CELLS: OutputGenerator._culture_cells,
-    ActionType.PERTURB_GENE: OutputGenerator._perturb,
-    ActionType.PERTURB_COMPOUND: OutputGenerator._perturb,
     ActionType.SEQUENCE_CELLS: OutputGenerator._sequence_cells,
     ActionType.RUN_QC: OutputGenerator._run_qc,
     ActionType.FILTER_DATA: OutputGenerator._filter_data,

 from .latent_state import FullLatentState
 from .noise import NoiseModel
+# Pool of common transcription factors used to generate realistic false-positive
+# regulators, so the agent cannot trivially distinguish true vs. false hits by
+# gene-name format alone.
+_NOISE_TFS: List[str] = [
+    "NR3C1", "KLF4", "EGR1", "IRF1", "FOSL2", "JUN", "FOS", "ATF3",
+    "NFKB1", "RELA", "SP1", "MYC", "MAX", "E2F1", "CTCF", "YY1",
+    "TP53", "STAT5A", "SMAD3", "TCF7L2", "NFE2L2", "HIF1A", "CREB1",
+]
 class OutputGenerator:
     """Creates structured ``IntermediateOutput`` objects conditioned on the
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         days = action.parameters.get("days", 7)
+        # Viability decays with culture duration: each day adds ~0.5%
+        # cumulative stress, reflecting senescence, media depletion, and
+        # passaging artefacts common in primary cell cultures.
+        decay = 0.005 * days
+        viability = self.noise.sample_qc_metric(
+            max(0.50, 0.95 - decay), 0.05, 0.30, 1.0
+        )
         return IntermediateOutput(
             output_type=OutputType.CULTURE_RESULT,
             step_index=idx,
             artifacts_available=["cultured_cells"],
         )
+    def _perturb_gene(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
+        """Genetic perturbation (CRISPR/RNAi): high on-target efficiency,
+        binary effect, non-trivial off-target risk."""
         target = action.parameters.get("target", "unknown")
+        efficiency = s.last_perturbation_efficiency if s.last_perturbation_efficiency is not None else self.noise.sample_qc_metric(0.80, 0.12, 0.0, 1.0)
+        off_target_risk = self.noise.sample_qc_metric(0.10, 0.05, 0.0, 0.5)
         return IntermediateOutput(
             output_type=OutputType.PERTURBATION_RESULT,
             step_index=idx,
             quality_score=efficiency,
+            summary=(
+                f"Genetic perturbation of {target} "
+                f"(efficiency={efficiency:.2f}, off-target risk={off_target_risk:.2f})"
+            ),
+            data={
+                "target": target,
+                "efficiency": efficiency,
+                "type": action.action_type.value,
+                "off_target_risk": off_target_risk,
+            },
+            artifacts_available=["perturbed_cells"],
+        )
+    def _perturb_compound(
+        self, action: ExperimentAction, s: FullLatentState, idx: int
+    ) -> IntermediateOutput:
+        """Small-molecule perturbation: dose-dependent, partial on-target
+        activity, systemic effects possible."""
+        target = action.parameters.get("target", "unknown")
+        dose_um = action.parameters.get("dose_uM", 1.0)
+        efficiency = s.last_perturbation_efficiency if s.last_perturbation_efficiency is not None else self.noise.sample_qc_metric(0.70, 0.15, 0.0, 1.0)
+        on_target_frac = self.noise.sample_qc_metric(0.75, 0.10, 0.0, 1.0)
+        return IntermediateOutput(
+            output_type=OutputType.PERTURBATION_RESULT,
+            step_index=idx,
+            quality_score=efficiency * on_target_frac,
+            summary=(
+                f"Compound perturbation targeting {target} at {dose_um} µM "
+                f"(efficiency={efficiency:.2f}, on-target={on_target_frac:.2f})"
+            ),
             data={
                 "target": target,
                 "efficiency": efficiency,
                 "type": action.action_type.value,
+                "dose_uM": dose_um,
+                "on_target_fraction": on_target_frac,
             },
             artifacts_available=["perturbed_cells"],
         )
     def _sequence_cells(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
+        import math
         depth = s.technical.sequencing_depth_factor
+        n_cells = s.progress.n_cells_sequenced or self.noise.sample_count(
             s.biology.n_true_cells * s.technical.capture_efficiency
         )
+        # Gene detection saturates with sequencing depth: follows a
+        # 1 - exp(-k) saturation curve, scaled by library complexity.
+        max_genes = 20_000
+        saturation_arg = depth * s.technical.library_complexity * 0.8
+        n_genes = self.noise.sample_count(
+            int(max_genes * (1.0 - math.exp(-saturation_arg)))
+        )
         median_umi = self.noise.sample_count(int(3000 * depth))
         quality = self.noise.quality_degradation(
             s.technical.sample_quality,
         doublet_frac = self.noise.sample_qc_metric(
             s.technical.doublet_rate, 0.01, 0.0, 0.2
         )
+        # Mitochondrial fraction reflects cellular stress: activated,
+        # inflammatory, or pro-fibrotic populations have elevated mito
+        # transcription compared to quiescent/resting cells.
+        _stressed_states = {"activated", "stressed", "pro-fibrotic", "inflammatory"}
+        has_stressed_cells = any(
+            p.state in _stressed_states for p in s.biology.cell_populations
+        )
+        # Means are kept close (0.09 vs 0.06) with a wider SD (0.03) so the
+        # mito fraction is informative but not a near-perfect oracle for
+        # stressed-cell presence.
+        mito_mean = 0.09 if has_stressed_cells else 0.06
+        mito_frac = self.noise.sample_qc_metric(mito_mean, 0.03, 0.0, 0.3)
         ambient_frac = self.noise.sample_qc_metric(
             s.technical.ambient_rna_fraction, 0.01, 0.0, 0.2
         )
     def _filter_data(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
+        retain_frac = s.last_retain_frac if s.last_retain_frac is not None else self.noise.sample_qc_metric(0.85, 0.05, 0.5, 1.0)
+        n_before = s.progress.n_cells_sequenced or s.biology.n_true_cells
+        n_after = s.progress.n_cells_after_filter or max(100, int(n_before * retain_frac))
         return IntermediateOutput(
             output_type=OutputType.COUNT_MATRIX_SUMMARY,
             step_index=idx,
     ) -> IntermediateOutput:
         n_true = len(s.biology.cell_populations) or 5
         quality = self.noise.quality_degradation(0.8, [0.95])
+        n_clusters = s.last_n_clusters if s.last_n_clusters is not None else self.noise.sample_cluster_count(n_true, quality)
         cluster_names = [f"cluster_{i}" for i in range(n_clusters)]
+        n_cells = s.progress.n_cells_after_filter or s.biology.n_true_cells
+        sizes = self._partition_by_population(n_cells, n_clusters, s.biology.cell_populations)
         return IntermediateOutput(
             output_type=OutputType.CLUSTER_RESULT,
             step_index=idx,
             quality_score=quality,
+            summary=f"Found {n_clusters} clusters",
             data={
                 "n_clusters": n_clusters,
                 "cluster_names": cluster_names,
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         comparison = action.parameters.get("comparison", "disease_vs_healthy")
+        # Fall back to the first available comparison key if the requested one
+        # is absent, rather than silently returning an empty effect dict.
+        if comparison not in s.biology.true_de_genes and s.biology.true_de_genes:
+            comparison = next(iter(s.biology.true_de_genes))
         true_effects = s.biology.true_de_genes.get(comparison, {})
         n_cells = s.progress.n_cells_after_filter or s.biology.n_true_cells
+        batch_noise = (
+            sum(s.technical.batch_effects.values())
+            / max(len(s.technical.batch_effects), 1)
+        )
+        noise_level = (
+            s.technical.dropout_rate
+            + 0.1 * (1.0 - s.technical.sample_quality)
+            + 0.5 * batch_noise
+        )
         observed = self.noise.sample_effect_sizes(true_effects, n_cells, noise_level)
         fp_genes = self.noise.generate_false_positives(5000, 0.002 + noise_level * 0.01)
         quality = self.noise.quality_degradation(0.7 if has_trajectory else 0.3, [0.9])
         summary_data: Dict[str, Any] = {"method": action.method or "monocle3"}
         if has_trajectory:
+            true_n_lineages = s.biology.true_trajectory.get("n_lineages", 1)
+            true_branching = s.biology.true_trajectory.get("branching", False)
+            # Perturb lineage count by ±1 and flip the branching flag with 20%
+            # probability so the output is informative but not an exact oracle.
+            noisy_n_lineages = max(1, true_n_lineages + int(self.noise.rng.choice([-1, 0, 0, 1])))
+            noisy_branching = true_branching if not self.noise.coin_flip(0.20) else not true_branching
             summary_data.update({
+                "n_lineages": noisy_n_lineages,
                 "pseudotime_range": [0.0, 1.0],
+                "branching_detected": noisy_branching,
             })
         else:
             summary_data["n_lineages"] = self.noise.sample_count(1) + 1
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
         true_pathways = s.biology.true_pathways
+        # Pathway enrichment quality is tightly coupled to the quality of the
+        # preceding DE step: more DE genes found → better gene-set coverage →
+        # lower noise and fewer spurious pathway hits.
+        de_genes_found = s.progress.n_de_genes_found or 0
+        de_was_run = s.progress.de_performed
+        if de_was_run and de_genes_found > 0:
+            # Noise shrinks as the DE gene list grows (more signal in input).
+            noise_level = max(0.05, 0.25 - 0.001 * min(de_genes_found, 200))
+            n_fp_mean = max(1, int(5 - de_genes_found / 50))
+        else:
+            # Without a DE step, enrichment is unreliable.
+            noise_level = 0.40
+            n_fp_mean = 8
         observed: Dict[str, float] = {}
         for pw, activity in true_pathways.items():
             observed[pw] = activity + float(self.noise.rng.normal(0, noise_level))
+        for i in range(self.noise.sample_count(n_fp_mean)):
             observed[f"FP_PATHWAY_{i}"] = float(self.noise.rng.uniform(0.3, 0.6))
         top = sorted(observed.items(), key=lambda kv: kv[1], reverse=True)[:15]
+        base_quality = 0.80 if de_was_run else 0.45
         return IntermediateOutput(
             output_type=OutputType.PATHWAY_RESULT,
             step_index=idx,
+            quality_score=self.noise.quality_degradation(base_quality, [0.95]),
             summary=f"Pathway enrichment: {len(top)} significant pathways",
             data={
                 "method": action.method or "GSEA",
                 "top_pathways": [
+                    {"pathway": p, "score": round(sc, 3)} for p, sc in top
                 ],
             },
             uncertainty=noise_level,
         true_net = s.biology.true_regulatory_network
         n_edges_true = sum(len(v) for v in true_net.values())
         noise_edges = self.noise.sample_count(max(5, int(n_edges_true * 0.3)))
+        true_tfs = list(true_net.keys())
+        # Drop ~25% of true regulators (false-negative rate).
+        fn_set = set(self.noise.generate_false_negatives(true_tfs, 0.25))
+        observed_tfs = [tf for tf in true_tfs if tf not in fn_set]
+        # Inject realistic false-positive TFs drawn from a background pool so
+        # the agent cannot distinguish true from false hits by name format.
+        fp_candidates = [t for t in _NOISE_TFS if t not in set(true_tfs)]
+        n_fp = self.noise.sample_count(max(2, int(len(true_tfs) * 0.5) + 2))
+        if fp_candidates and n_fp > 0:
+            chosen = self.noise.rng.choice(
+                fp_candidates,
+                size=min(n_fp, len(fp_candidates)),
+                replace=False,
+            )
+            observed_tfs.extend(chosen.tolist())
+        # Shuffle so rank order does not reveal true-vs-false identity.
+        observed_tfs = self.noise.shuffle_ranking(observed_tfs, 0.5)
         return IntermediateOutput(
             output_type=OutputType.NETWORK_RESULT,
             step_index=idx,
                 "method": action.method or "SCENIC",
                 "n_regulons": len(true_net) + self.noise.sample_count(3),
                 "n_edges": n_edges_true + noise_edges,
+                "top_regulators": observed_tfs[:10],
             },
             uncertainty=0.35,
             artifacts_available=["regulon_table", "grn_adjacency"],
                 "marker": marker,
                 "validated": validated,
                 "assay": action.method or "qPCR",
+                # Means are kept close (0.85 vs 0.45) with a wide SD (0.4)
+                # so the effect size is correlated with, but not a near-perfect
+                # oracle for, true marker membership.
                 "effect_size": self.noise.sample_qc_metric(
+                    0.85 if is_true else 0.45, 0.4, -0.5, 5.0
                 ),
             },
             artifacts_available=["validation_data"],
     def _design_followup(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
+        evidence_signals = sum([
+            int(s.progress.cells_clustered),
+            int(s.progress.de_performed),
+            int(s.progress.trajectories_inferred),
+            int(s.progress.pathways_analyzed),
+            int(s.progress.networks_inferred),
+            int(s.progress.markers_discovered),
+            int(s.progress.markers_validated),
+        ])
         return IntermediateOutput(
             output_type=OutputType.FOLLOWUP_DESIGN,
             step_index=idx,
+            quality_score=min(0.75, 0.2 + 0.08 * evidence_signals),
+            summary=(
+                f"Follow-up experiment design proposed "
+                f"(evidence_signals={evidence_signals})"
+            ),
+            data={
+                "proposal": action.parameters,
+                "evidence_signals": evidence_signals,
+            },
+            uncertainty=max(0.25, 0.8 - 0.08 * evidence_signals),
             artifacts_available=["followup_proposal"],
         )
     def _subagent_review(
         self, action: ExperimentAction, s: FullLatentState, idx: int
     ) -> IntermediateOutput:
+        evidence_signals = sum([
+            int(s.progress.cells_clustered),
+            int(s.progress.de_performed),
+            int(s.progress.trajectories_inferred),
+            int(s.progress.pathways_analyzed),
+            int(s.progress.networks_inferred),
+            int(s.progress.markers_discovered),
+            int(s.progress.markers_validated),
+        ])
         return IntermediateOutput(
             output_type=OutputType.SUBAGENT_REPORT,
             step_index=idx,
+            quality_score=min(0.7, 0.15 + 0.07 * evidence_signals),
             summary=f"Subagent review ({action.invoked_subagent or 'general'})",
+            data={
+                "subagent": action.invoked_subagent,
+                "notes": "Review complete.",
+                "evidence_signals": evidence_signals,
+            },
+            uncertainty=max(0.3, 0.85 - 0.08 * evidence_signals),
             artifacts_available=["subagent_report"],
         )
         sizes[0] += diff
         return sizes
+    def _partition_by_population(
+        self,
+        total: int,
+        k: int,
+        populations: list,
+    ) -> List[int]:
+        """Partition cells into k clusters using true population proportions
+        as Dirichlet concentration parameters, so majority cell types produce
+        larger clusters rather than uniformly random sizes."""
+        if k <= 0:
+            return []
+        if populations:
+            # Use true proportions as Dirichlet alpha — larger proportions
+            # concentrate probability mass, yielding realistic size ratios.
+            raw = [max(p.proportion, 1e-3) for p in populations]
+            # Align alpha length to k: repeat/truncate as needed.
+            if len(raw) >= k:
+                alpha = raw[:k]
+            else:
+                alpha = raw + [sum(raw) / len(raw)] * (k - len(raw))
+            # Scale alpha so the total magnitude is proportional to k,
+            # giving reasonable Dirichlet variance.
+            scale = k / max(sum(alpha), 1e-6)
+            alpha = [a * scale for a in alpha]
+        else:
+            alpha = [1.0] * k
+        fracs = self.noise.rng.dirichlet(alpha=alpha)
+        sizes = [max(1, int(total * f)) for f in fracs]
+        diff = total - sum(sizes)
+        sizes[0] += diff
+        return sizes
 _HANDLERS = {
     ActionType.COLLECT_SAMPLE: OutputGenerator._collect_sample,
     ActionType.SELECT_COHORT: OutputGenerator._select_cohort,
     ActionType.PREPARE_LIBRARY: OutputGenerator._prepare_library,
     ActionType.CULTURE_CELLS: OutputGenerator._culture_cells,
+    ActionType.PERTURB_GENE: OutputGenerator._perturb_gene,
+    ActionType.PERTURB_COMPOUND: OutputGenerator._perturb_compound,
     ActionType.SEQUENCE_CELLS: OutputGenerator._sequence_cells,
     ActionType.RUN_QC: OutputGenerator._run_qc,
     ActionType.FILTER_DATA: OutputGenerator._filter_data,

server/simulator/transition.py CHANGED Viewed

@@ -15,6 +15,7 @@ from models import (
     ExperimentAction,
     IntermediateOutput,
     OutputType,
 )
 from .latent_state import FullLatentState
@@ -22,7 +23,8 @@ from .noise import NoiseModel
 from .output_generator import OutputGenerator
-ACTION_COSTS: Dict[ActionType, Tuple[float, float]] = {
     ActionType.COLLECT_SAMPLE:               (5_000,  7.0),
     ActionType.SELECT_COHORT:                (  500,  1.0),
     ActionType.PREPARE_LIBRARY:              (8_000,  3.0),
@@ -41,11 +43,30 @@ ACTION_COSTS: Dict[ActionType, Tuple[float, float]] = {
     ActionType.REGULATORY_NETWORK_INFERENCE: (  300,  1.0),
     ActionType.MARKER_SELECTION:             (  100,  0.5),
     ActionType.VALIDATE_MARKER:              (5_000, 14.0),
-    ActionType.DESIGN_FOLLOWUP:              (    0,  0.5),
-    ActionType.REQUEST_SUBAGENT_REVIEW:      (    0,  0.25),
     ActionType.SYNTHESIZE_CONCLUSION:        (    0,  0.5),
 }
 @dataclass
 class TransitionResult:
@@ -138,9 +159,7 @@ class TransitionEngine:
     def _apply_resource_cost(
         self, s: FullLatentState, action: ExperimentAction
     ) -> None:
-        budget_cost, time_cost = ACTION_COSTS.get(
-            action.action_type, (0.0, 0.0)
-        )
         s.resources.budget_used += budget_cost
         s.resources.time_used_days += time_cost
         if action.action_type in {
@@ -176,6 +195,8 @@ class TransitionEngine:
             ActionType.REGULATORY_NETWORK_INFERENCE: "networks_inferred",
             ActionType.MARKER_SELECTION: "markers_discovered",
             ActionType.VALIDATE_MARKER: "markers_validated",
             ActionType.SYNTHESIZE_CONCLUSION: "conclusion_reached",
         }
         flag = _MAP.get(at)
@@ -188,16 +209,43 @@ class TransitionEngine:
         if at == ActionType.SEQUENCE_CELLS:
             s.resources.sequencing_lanes_used += 1
         if at == ActionType.FILTER_DATA:
             retain = self.noise.sample_qc_metric(0.85, 0.05, 0.5, 1.0)
-            p.n_cells_after_filter = max(
-                100, int(s.biology.n_true_cells * retain)
-            )
         if at == ActionType.CLUSTER_CELLS:
             n_true = len(s.biology.cell_populations) or 5
             p.n_clusters_found = self.noise.sample_cluster_count(n_true, 0.8)
     def _propagate_artifacts(
         self,
@@ -208,6 +256,7 @@ class TransitionEngine:
         if action.action_type == ActionType.DIFFERENTIAL_EXPRESSION:
             top = output.data.get("top_genes", [])
             s.discovered_de_genes = [g["gene"] for g in top[:20]]
         if action.action_type == ActionType.CLUSTER_CELLS:
             s.discovered_clusters = output.data.get("cluster_names", [])

     ExperimentAction,
     IntermediateOutput,
     OutputType,
+    TOOL_REGISTRY,
 )
 from .latent_state import FullLatentState
 from .output_generator import OutputGenerator
+# Fallback costs per ActionType when the agent doesn't specify a known tool.
+_BASE_ACTION_COSTS: Dict[ActionType, Tuple[float, float]] = {
     ActionType.COLLECT_SAMPLE:               (5_000,  7.0),
     ActionType.SELECT_COHORT:                (  500,  1.0),
     ActionType.PREPARE_LIBRARY:              (8_000,  3.0),
     ActionType.REGULATORY_NETWORK_INFERENCE: (  300,  1.0),
     ActionType.MARKER_SELECTION:             (  100,  0.5),
     ActionType.VALIDATE_MARKER:              (5_000, 14.0),
+    ActionType.DESIGN_FOLLOWUP:              (  100,  0.5),
+    ActionType.REQUEST_SUBAGENT_REVIEW:      (   50,  0.25),
     ActionType.SYNTHESIZE_CONCLUSION:        (    0,  0.5),
 }
+# Kept as public alias so existing imports (e.g. hackathon_environment) still work.
+ACTION_COSTS = _BASE_ACTION_COSTS
+def compute_action_cost(action: ExperimentAction) -> Tuple[float, float]:
+    """Return (budget_cost, time_cost_days) for an action.
+    If the action specifies a ``method`` that exists in ``TOOL_REGISTRY``,
+    the tool's ``typical_cost_usd`` and ``typical_runtime_hours`` are used
+    (converted to days).  Otherwise we fall back to the per-ActionType base
+    cost table.
+    """
+    tool_spec = TOOL_REGISTRY.get(action.method or "")
+    if tool_spec is not None:
+        budget = tool_spec.typical_cost_usd
+        time_days = tool_spec.typical_runtime_hours / 24.0
+        return (budget, time_days)
+    return _BASE_ACTION_COSTS.get(action.action_type, (0.0, 0.0))
 @dataclass
 class TransitionResult:
     def _apply_resource_cost(
         self, s: FullLatentState, action: ExperimentAction
     ) -> None:
+        budget_cost, time_cost = compute_action_cost(action)
         s.resources.budget_used += budget_cost
         s.resources.time_used_days += time_cost
         if action.action_type in {
             ActionType.REGULATORY_NETWORK_INFERENCE: "networks_inferred",
             ActionType.MARKER_SELECTION: "markers_discovered",
             ActionType.VALIDATE_MARKER: "markers_validated",
+            ActionType.DESIGN_FOLLOWUP: "followup_designed",
+            ActionType.REQUEST_SUBAGENT_REVIEW: "subagent_review_requested",
             ActionType.SYNTHESIZE_CONCLUSION: "conclusion_reached",
         }
         flag = _MAP.get(at)
         if at == ActionType.SEQUENCE_CELLS:
             s.resources.sequencing_lanes_used += 1
+            p.n_cells_sequenced = self.noise.sample_count(
+                s.biology.n_true_cells * s.technical.capture_efficiency
+            )
+        if at in {ActionType.PERTURB_GENE, ActionType.PERTURB_COMPOUND}:
+            self._apply_perturbation_effects(s, action)
         if at == ActionType.FILTER_DATA:
             retain = self.noise.sample_qc_metric(0.85, 0.05, 0.5, 1.0)
+            base = p.n_cells_sequenced or s.biology.n_true_cells
+            p.n_cells_after_filter = max(100, int(base * retain))
+            s.last_retain_frac = retain
         if at == ActionType.CLUSTER_CELLS:
             n_true = len(s.biology.cell_populations) or 5
             p.n_clusters_found = self.noise.sample_cluster_count(n_true, 0.8)
+            s.last_n_clusters = p.n_clusters_found
+    def _apply_perturbation_effects(
+        self, s: FullLatentState, action: ExperimentAction
+    ) -> None:
+        """Fold perturbation-specific gene effects into true_de_genes so
+        downstream DE analysis reflects the perturbed biology."""
+        target = action.parameters.get("target", "")
+        effects = s.biology.perturbation_effects.get(target, {})
+        if not effects:
+            return
+        # Efficiency drawn from the same distribution as the output handler
+        # so latent state and observable output are coherent.
+        if action.action_type == ActionType.PERTURB_GENE:
+            efficiency = self.noise.sample_qc_metric(0.80, 0.12, 0.0, 1.0)
+        else:
+            efficiency = self.noise.sample_qc_metric(0.70, 0.15, 0.0, 1.0)
+        s.last_perturbation_efficiency = efficiency
+        for gene_map in s.biology.true_de_genes.values():
+            for gene, delta in effects.items():
+                gene_map[gene] = gene_map.get(gene, 0.0) + delta * efficiency
     def _propagate_artifacts(
         self,
         if action.action_type == ActionType.DIFFERENTIAL_EXPRESSION:
             top = output.data.get("top_genes", [])
             s.discovered_de_genes = [g["gene"] for g in top[:20]]
+            s.progress.n_de_genes_found = output.data.get("n_significant", 0)
         if action.action_type == ActionType.CLUSTER_CELLS:
             s.discovered_clusters = output.data.get("cluster_names", [])

server/tasks/bio_palette.py ADDED Viewed

	@@ -0,0 +1,692 @@

+"""Curated biological building blocks for procedural scenario generation.
+Provides tissue-specific cell types, disease profiles, pathway libraries,
+regulatory network templates, and perturbation effect profiles.  The
+procedural generator composes these into complete ``Scenario`` objects
+with fully populated ``LatentBiologicalState``.
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Tuple
+# ── Cell type templates ─────────────────────────────────────────────────────
+@dataclass
+class CellTypeTemplate:
+    name: str
+    marker_genes: List[str]
+    proportion_range: Tuple[float, float] = (0.05, 0.30)
+    states: List[str] = field(default_factory=lambda: ["quiescent"])
+    disease_responsive: bool = False
+    response_range: Tuple[float, float] = (0.5, 1.5)
+TISSUE_CELL_TYPES: Dict[str, List[CellTypeTemplate]] = {
+    "heart": [
+        CellTypeTemplate("cardiomyocyte", ["TNNT2", "MYH7", "ACTC1"], (0.25, 0.40), ["contractile", "stressed"]),
+        CellTypeTemplate("cardiac_fibroblast", ["COL1A1", "DCN", "LUM"], (0.15, 0.30), ["quiescent", "activated"], True, (1.1, 1.8)),
+        CellTypeTemplate("endothelial", ["PECAM1", "VWF", "CDH5"], (0.10, 0.20), ["quiescent"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "CSF1R"], (0.05, 0.15), ["quiescent", "activated", "inflammatory"], True, (1.2, 2.0)),
+        CellTypeTemplate("smooth_muscle", ["ACTA2", "MYH11", "TAGLN"], (0.08, 0.18), ["quiescent"]),
+        CellTypeTemplate("pericyte", ["PDGFRB", "RGS5", "NOTCH3"], (0.03, 0.10), ["quiescent"]),
+    ],
+    "lung": [
+        CellTypeTemplate("AT2", ["SFTPC", "SFTPB", "ABCA3"], (0.15, 0.25), ["normal", "stressed"]),
+        CellTypeTemplate("AT1", ["AGER", "PDPN", "CAV1"], (0.10, 0.18), ["normal"]),
+        CellTypeTemplate("alveolar_macrophage", ["MARCO", "FABP4", "MCEMP1"], (0.10, 0.20), ["resident", "activated"]),
+        CellTypeTemplate("fibroblast", ["COL1A1", "COL3A1", "POSTN"], (0.12, 0.25), ["quiescent", "activated"], True, (1.2, 2.0)),
+        CellTypeTemplate("endothelial", ["PECAM1", "CLDN5", "VWF"], (0.08, 0.15), ["quiescent"]),
+        CellTypeTemplate("T_cell", ["CD3D", "CD3E", "IL7R"], (0.08, 0.18), ["quiescent", "activated"]),
+        CellTypeTemplate("ciliated", ["FOXJ1", "DNAH5", "TPPP3"], (0.05, 0.12), ["normal"]),
+    ],
+    "brain": [
+        CellTypeTemplate("excitatory_neuron", ["SLC17A7", "CAMK2A", "NRGN"], (0.25, 0.40), ["normal", "stressed"]),
+        CellTypeTemplate("inhibitory_neuron", ["GAD1", "GAD2", "SLC32A1"], (0.12, 0.22), ["normal"]),
+        CellTypeTemplate("astrocyte", ["GFAP", "AQP4", "SLC1A3"], (0.10, 0.20), ["quiescent", "activated"], True, (1.2, 1.8)),
+        CellTypeTemplate("microglia", ["CX3CR1", "P2RY12", "TMEM119"], (0.05, 0.12), ["homeostatic", "activated", "inflammatory"], True, (1.3, 2.5)),
+        CellTypeTemplate("oligodendrocyte", ["MBP", "PLP1", "MOG"], (0.10, 0.18), ["myelinating"]),
+        CellTypeTemplate("OPC", ["PDGFRA", "CSPG4", "OLIG2"], (0.03, 0.08), ["progenitor"]),
+        CellTypeTemplate("endothelial", ["CLDN5", "FLT1", "PECAM1"], (0.03, 0.08), ["quiescent"]),
+    ],
+    "liver": [
+        CellTypeTemplate("hepatocyte", ["ALB", "APOB", "CYP3A4"], (0.55, 0.70), ["normal", "stressed"]),
+        CellTypeTemplate("cholangiocyte", ["KRT19", "KRT7", "EPCAM"], (0.05, 0.10), ["normal"]),
+        CellTypeTemplate("kupffer_cell", ["CD68", "CLEC4F", "MARCO"], (0.08, 0.15), ["quiescent", "activated", "inflammatory"], True, (1.2, 2.0)),
+        CellTypeTemplate("stellate_cell", ["ACTA2", "LRAT", "PDGFRB"], (0.05, 0.12), ["quiescent", "activated"], True, (1.3, 2.0)),
+        CellTypeTemplate("endothelial", ["PECAM1", "LYVE1", "STAB2"], (0.05, 0.10), ["quiescent"]),
+        CellTypeTemplate("NK_cell", ["NKG7", "GNLY", "KLRD1"], (0.03, 0.08), ["quiescent", "activated"]),
+    ],
+    "bone_marrow": [
+        CellTypeTemplate("HSC", ["CD34", "KIT", "THY1"], (0.03, 0.08), ["stem"]),
+        CellTypeTemplate("CMP", ["CD34", "FLT3"], (0.08, 0.15), ["progenitor"]),
+        CellTypeTemplate("GMP", ["CSF3R", "CEBPA"], (0.08, 0.15), ["progenitor"]),
+        CellTypeTemplate("MEP", ["GATA1", "KLF1"], (0.06, 0.12), ["progenitor"]),
+        CellTypeTemplate("erythrocyte", ["HBA1", "HBB", "GYPA"], (0.15, 0.25), ["mature"]),
+        CellTypeTemplate("neutrophil", ["ELANE", "MPO", "CTSG"], (0.12, 0.22), ["mature"]),
+        CellTypeTemplate("monocyte", ["CD14", "CSF1R", "FCGR3A"], (0.10, 0.18), ["mature"]),
+        CellTypeTemplate("megakaryocyte", ["ITGA2B", "GP1BA", "PF4"], (0.05, 0.12), ["mature"]),
+    ],
+    "kidney": [
+        CellTypeTemplate("proximal_tubule", ["SLC34A1", "LRP2", "CUBN"], (0.30, 0.45), ["normal", "stressed"]),
+        CellTypeTemplate("distal_tubule", ["SLC12A3", "CALB1"], (0.10, 0.18), ["normal"]),
+        CellTypeTemplate("collecting_duct", ["AQP2", "FXYD4"], (0.08, 0.15), ["normal"]),
+        CellTypeTemplate("podocyte", ["NPHS1", "NPHS2", "WT1"], (0.05, 0.10), ["normal", "stressed"]),
+        CellTypeTemplate("endothelial", ["PECAM1", "EMCN", "FLT1"], (0.05, 0.12), ["quiescent"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "CSF1R"], (0.05, 0.10), ["quiescent", "inflammatory"], True, (1.3, 2.0)),
+        CellTypeTemplate("fibroblast", ["COL1A1", "PDGFRA", "DCN"], (0.05, 0.12), ["quiescent", "activated"], True, (1.2, 1.8)),
+    ],
+    "colon": [
+        CellTypeTemplate("colonocyte", ["CA2", "AQP8", "SLC26A3"], (0.25, 0.40), ["normal", "stressed"]),
+        CellTypeTemplate("goblet_cell", ["MUC2", "TFF3", "FCGBP"], (0.10, 0.18), ["secretory"]),
+        CellTypeTemplate("stem_cell", ["LGR5", "ASCL2", "OLFM4"], (0.05, 0.10), ["stem"]),
+        CellTypeTemplate("T_cell", ["CD3D", "CD3E", "IL7R"], (0.10, 0.18), ["quiescent", "activated"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "CSF1R"], (0.05, 0.12), ["quiescent", "inflammatory"], True, (1.3, 2.0)),
+        CellTypeTemplate("fibroblast", ["COL1A1", "COL3A1", "VIM"], (0.08, 0.15), ["quiescent", "activated"], True, (1.2, 1.8)),
+        CellTypeTemplate("endothelial", ["PECAM1", "VWF", "CDH5"], (0.05, 0.10), ["quiescent"]),
+    ],
+    "pancreas": [
+        CellTypeTemplate("beta_cell", ["INS", "MAFA", "NKX6-1"], (0.25, 0.40), ["normal", "stressed"], True, (0.4, 0.8)),
+        CellTypeTemplate("alpha_cell", ["GCG", "ARX", "IRX2"], (0.15, 0.25), ["normal"]),
+        CellTypeTemplate("delta_cell", ["SST", "HHEX"], (0.05, 0.10), ["normal"]),
+        CellTypeTemplate("ductal", ["KRT19", "SOX9", "CFTR"], (0.10, 0.18), ["normal"]),
+        CellTypeTemplate("acinar", ["PRSS1", "CPA1", "CELA3A"], (0.10, 0.20), ["normal"]),
+        CellTypeTemplate("stellate", ["ACTA2", "PDGFRA", "COL1A1"], (0.05, 0.10), ["quiescent", "activated"], True, (1.2, 1.8)),
+        CellTypeTemplate("macrophage", ["CD68", "CD163"], (0.03, 0.08), ["quiescent", "inflammatory"]),
+    ],
+    "skin": [
+        CellTypeTemplate("keratinocyte", ["KRT14", "KRT5", "KRT1"], (0.40, 0.55), ["basal", "differentiated"]),
+        CellTypeTemplate("melanocyte", ["MLANA", "PMEL", "TYR"], (0.05, 0.10), ["normal", "activated"]),
+        CellTypeTemplate("fibroblast", ["COL1A1", "COL3A1", "DCN"], (0.10, 0.20), ["quiescent", "activated"]),
+        CellTypeTemplate("T_cell", ["CD3D", "CD3E", "IL7R"], (0.08, 0.15), ["quiescent", "activated"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "CSF1R"], (0.05, 0.10), ["quiescent", "inflammatory"]),
+        CellTypeTemplate("endothelial", ["PECAM1", "VWF"], (0.05, 0.10), ["quiescent"]),
+    ],
+    "breast": [
+        CellTypeTemplate("luminal_epithelial", ["KRT8", "KRT18", "EPCAM"], (0.25, 0.40), ["normal", "stressed"]),
+        CellTypeTemplate("basal_epithelial", ["KRT14", "KRT5", "TP63"], (0.10, 0.20), ["normal"]),
+        CellTypeTemplate("fibroblast", ["COL1A1", "COL3A1", "FAP"], (0.10, 0.20), ["quiescent", "activated"], True, (1.2, 1.8)),
+        CellTypeTemplate("T_cell", ["CD3D", "CD3E", "CD8A"], (0.08, 0.15), ["quiescent", "activated", "exhausted"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "CSF1R"], (0.05, 0.12), ["quiescent", "inflammatory"], True, (1.3, 2.0)),
+        CellTypeTemplate("endothelial", ["PECAM1", "VWF", "CDH5"], (0.05, 0.10), ["quiescent"]),
+    ],
+    "synovium": [
+        CellTypeTemplate("fibroblast", ["COL1A1", "FAP", "THY1"], (0.20, 0.30), ["quiescent", "activated"], True, (1.2, 1.8)),
+        CellTypeTemplate("CD4_T_cell", ["CD3D", "CD4", "IL7R"], (0.12, 0.22), ["quiescent", "activated"]),
+        CellTypeTemplate("CD8_T_cell", ["CD3D", "CD8A", "GZMB"], (0.08, 0.15), ["quiescent", "activated"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "MARCO"], (0.10, 0.18), ["quiescent", "inflammatory"], True, (1.3, 2.0)),
+        CellTypeTemplate("B_cell", ["CD19", "MS4A1", "CD79A"], (0.05, 0.12), ["quiescent"]),
+        CellTypeTemplate("endothelial", ["PECAM1", "VWF"], (0.05, 0.10), ["quiescent"]),
+        CellTypeTemplate("mast_cell", ["KIT", "TPSAB1", "CPA3"], (0.03, 0.08), ["quiescent"]),
+    ],
+    "aorta": [
+        CellTypeTemplate("smooth_muscle", ["ACTA2", "MYH11", "TAGLN"], (0.30, 0.45), ["contractile", "synthetic"], True, (0.6, 0.9)),
+        CellTypeTemplate("endothelial", ["PECAM1", "VWF", "CDH5"], (0.15, 0.25), ["quiescent", "activated"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "TREM2"], (0.08, 0.15), ["quiescent", "inflammatory"], True, (1.5, 2.5)),
+        CellTypeTemplate("fibroblast", ["COL1A1", "LUM", "DCN"], (0.08, 0.15), ["quiescent", "activated"]),
+        CellTypeTemplate("T_cell", ["CD3D", "CD3E", "IL7R"], (0.05, 0.12), ["quiescent", "activated"]),
+        CellTypeTemplate("dendritic_cell", ["FCER1A", "CD1C", "CLEC10A"], (0.03, 0.08), ["quiescent"]),
+    ],
+    "blood": [
+        CellTypeTemplate("CD4_T_cell", ["CD3D", "CD4", "IL7R"], (0.15, 0.25), ["quiescent", "activated"]),
+        CellTypeTemplate("CD8_T_cell", ["CD3D", "CD8A", "GZMB"], (0.10, 0.18), ["quiescent", "activated"]),
+        CellTypeTemplate("B_cell", ["CD19", "MS4A1", "CD79A"], (0.08, 0.15), ["quiescent"]),
+        CellTypeTemplate("NK_cell", ["NKG7", "GNLY", "KLRD1"], (0.05, 0.12), ["quiescent", "activated"]),
+        CellTypeTemplate("monocyte", ["CD14", "CSF1R", "FCGR3A"], (0.15, 0.25), ["classical", "non_classical"]),
+        CellTypeTemplate("neutrophil", ["ELANE", "MPO", "CTSG"], (0.10, 0.20), ["mature"]),
+        CellTypeTemplate("platelet", ["ITGA2B", "GP1BA", "PF4"], (0.03, 0.08), ["normal"]),
+    ],
+    "spleen": [
+        CellTypeTemplate("B_cell", ["CD19", "MS4A1", "CD79A"], (0.20, 0.35), ["quiescent", "activated"]),
+        CellTypeTemplate("T_cell", ["CD3D", "CD3E", "IL7R"], (0.15, 0.25), ["quiescent", "activated"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163", "CSF1R"], (0.10, 0.18), ["quiescent", "inflammatory"]),
+        CellTypeTemplate("dendritic_cell", ["FCER1A", "CD1C", "CLEC10A"], (0.05, 0.10), ["quiescent"]),
+        CellTypeTemplate("NK_cell", ["NKG7", "GNLY", "KLRD1"], (0.05, 0.12), ["quiescent"]),
+        CellTypeTemplate("endothelial", ["PECAM1", "STAB2"], (0.05, 0.10), ["quiescent"]),
+    ],
+    "thymus": [
+        CellTypeTemplate("double_negative_T", ["CD3D", "PTCRA"], (0.10, 0.18), ["progenitor"]),
+        CellTypeTemplate("double_positive_T", ["CD3D", "CD4", "CD8A"], (0.30, 0.45), ["progenitor"]),
+        CellTypeTemplate("CD4_SP", ["CD3D", "CD4", "IL7R"], (0.10, 0.18), ["mature"]),
+        CellTypeTemplate("CD8_SP", ["CD3D", "CD8A", "CD8B"], (0.08, 0.15), ["mature"]),
+        CellTypeTemplate("thymic_epithelial", ["FOXN1", "KRT5", "KRT8"], (0.05, 0.12), ["cortical", "medullary"]),
+        CellTypeTemplate("dendritic_cell", ["FCER1A", "CD1C"], (0.03, 0.08), ["quiescent"]),
+        CellTypeTemplate("macrophage", ["CD68", "CD163"], (0.03, 0.08), ["quiescent"]),
+    ],
+}
+# ── Disease profiles ────────────────────────────────────────────────────────
+@dataclass
+class DiseaseProfile:
+    name: str
+    display_name: str
+    tissue: str
+    condition_name: str
+    de_genes: Dict[str, Tuple[float, float]]
+    pathways: Dict[str, Tuple[float, float]]
+    markers: List[str]
+    mechanism_templates: List[str]
+    responding_cell_types: List[str] = field(default_factory=list)
+    hidden_failure_templates: List[str] = field(default_factory=list)
+DISEASE_PROFILES: Dict[str, DiseaseProfile] = {
+    "dilated_cardiomyopathy": DiseaseProfile(
+        name="dilated_cardiomyopathy",
+        display_name="dilated cardiomyopathy",
+        tissue="heart",
+        condition_name="dilated_cardiomyopathy",
+        de_genes={
+            "NPPA": (1.8, 3.5), "NPPB": (2.0, 4.0), "MYH7": (1.0, 2.5),
+            "COL1A1": (1.0, 2.2), "COL3A1": (0.8, 1.8), "POSTN": (1.5, 3.0),
+            "CCL2": (0.8, 1.8), "IL6": (0.5, 1.5), "TGFB1": (0.7, 1.6),
+            "ANKRD1": (1.5, 3.0), "XIRP2": (-2.0, -0.8), "MYL2": (-1.5, -0.5),
+        },
+        pathways={
+            "cardiac_muscle_contraction": (0.3, 0.6),
+            "extracellular_matrix_organisation": (0.7, 0.95),
+            "inflammatory_response": (0.5, 0.8),
+            "TGF_beta_signalling": (0.6, 0.85),
+            "apoptosis": (0.4, 0.65),
+        },
+        markers=["NPPA", "NPPB", "POSTN", "COL1A1"],
+        mechanism_templates=[
+            "TGF-beta-driven fibrosis",
+            "inflammatory macrophage infiltration",
+        ],
+        responding_cell_types=["cardiac_fibroblast", "macrophage"],
+    ),
+    "IPF": DiseaseProfile(
+        name="IPF",
+        display_name="idiopathic pulmonary fibrosis",
+        tissue="lung",
+        condition_name="IPF",
+        de_genes={
+            "SPP1": (2.0, 4.0), "MERTK": (0.8, 2.0), "MMP9": (1.0, 2.5),
+            "TREM2": (0.8, 2.0), "COL1A1": (1.5, 3.0), "COL3A1": (1.2, 2.5),
+            "POSTN": (1.5, 3.5), "SFTPC": (-2.0, -0.8), "AGER": (-2.5, -1.0),
+        },
+        pathways={
+            "extracellular_matrix_organisation": (0.75, 0.95),
+            "integrin_signalling": (0.6, 0.85),
+            "macrophage_activation": (0.65, 0.9),
+            "Wnt_signalling": (0.4, 0.7),
+        },
+        markers=["SPP1", "MERTK", "POSTN", "MMP9"],
+        mechanism_templates=[
+            "SPP1+ macrophage-driven fibroblast activation",
+            "integrin-mediated SPP1 signalling in fibrosis",
+        ],
+        responding_cell_types=["fibroblast", "alveolar_macrophage"],
+    ),
+    "Alzheimer": DiseaseProfile(
+        name="Alzheimer",
+        display_name="Alzheimer's disease",
+        tissue="brain",
+        condition_name="Alzheimer",
+        de_genes={
+            "TREM2": (1.0, 2.5), "APOE": (1.2, 2.8), "CLU": (0.8, 2.0),
+            "C1QA": (1.0, 2.2), "C1QB": (0.9, 2.0), "GFAP": (1.5, 3.0),
+            "AQP4": (0.6, 1.5), "SLC17A7": (-1.5, -0.5), "NRGN": (-2.0, -0.8),
+            "SNAP25": (-1.2, -0.4),
+        },
+        pathways={
+            "complement_cascade": (0.7, 0.9),
+            "neuroinflammation": (0.65, 0.85),
+            "amyloid_processing": (0.6, 0.8),
+            "synaptic_signalling": (0.3, 0.5),
+            "lipid_metabolism": (0.5, 0.7),
+        },
+        markers=["TREM2", "APOE", "GFAP", "C1QA"],
+        mechanism_templates=[
+            "TREM2-mediated microglial activation in amyloid clearance",
+            "complement-driven synaptic pruning",
+            "reactive astrogliosis amplifying neuroinflammation",
+        ],
+        responding_cell_types=["microglia", "astrocyte"],
+    ),
+    "colorectal_cancer": DiseaseProfile(
+        name="colorectal_cancer",
+        display_name="colorectal cancer",
+        tissue="colon",
+        condition_name="colorectal_cancer",
+        de_genes={
+            "MYC": (1.5, 3.0), "KRAS": (0.8, 1.8), "TP53": (-1.5, -0.5),
+            "APC": (-1.8, -0.8), "CDH1": (-1.2, -0.4), "VIM": (1.0, 2.5),
+            "MKI67": (1.5, 3.0), "CD44": (1.0, 2.2), "LGR5": (0.8, 2.0),
+        },
+        pathways={
+            "Wnt_signalling": (0.75, 0.95),
+            "cell_cycle": (0.7, 0.9),
+            "EMT": (0.6, 0.85),
+            "p53_signalling": (0.3, 0.5),
+            "MAPK_signalling": (0.55, 0.75),
+        },
+        markers=["MYC", "CD44", "VIM", "MKI67"],
+        mechanism_templates=[
+            "Wnt/beta-catenin-driven tumour stem cell expansion",
+            "epithelial-mesenchymal transition promoting invasion",
+        ],
+        responding_cell_types=["stem_cell", "macrophage", "fibroblast"],
+    ),
+    "type2_diabetes": DiseaseProfile(
+        name="type2_diabetes",
+        display_name="type 2 diabetes",
+        tissue="pancreas",
+        condition_name="type2_diabetes",
+        de_genes={
+            "INS": (-2.0, -0.8), "MAFA": (-1.5, -0.5), "PDX1": (-1.2, -0.4),
+            "UCN3": (-1.8, -0.6), "GCG": (0.8, 2.0), "ARX": (0.5, 1.5),
+            "IAPP": (0.6, 1.8), "TXNIP": (1.0, 2.5), "DDIT3": (0.8, 2.0),
+        },
+        pathways={
+            "insulin_signalling": (0.3, 0.5),
+            "ER_stress_response": (0.7, 0.9),
+            "oxidative_stress": (0.6, 0.8),
+            "glucagon_signalling": (0.6, 0.8),
+            "apoptosis": (0.5, 0.7),
+        },
+        markers=["INS", "TXNIP", "IAPP", "DDIT3"],
+        mechanism_templates=[
+            "ER stress-induced beta cell apoptosis",
+            "glucotoxicity-driven beta cell dedifferentiation",
+        ],
+        responding_cell_types=["beta_cell", "stellate"],
+    ),
+    "AML": DiseaseProfile(
+        name="AML",
+        display_name="acute myeloid leukemia",
+        tissue="bone_marrow",
+        condition_name="AML",
+        de_genes={
+            "FLT3": (1.5, 3.0), "NPM1": (0.8, 2.0), "IDH2": (0.6, 1.5),
+            "RUNX1": (-1.5, -0.5), "CEBPA": (-1.2, -0.4), "KIT": (1.0, 2.5),
+            "WT1": (1.2, 2.8), "MYC": (0.8, 2.0),
+        },
+        pathways={
+            "hematopoietic_cell_lineage": (0.3, 0.5),
+            "MAPK_signalling": (0.7, 0.9),
+            "PI3K_AKT_signalling": (0.65, 0.85),
+            "cell_cycle": (0.7, 0.9),
+            "apoptosis": (0.3, 0.5),
+        },
+        markers=["FLT3", "NPM1", "WT1", "KIT"],
+        mechanism_templates=[
+            "FLT3-ITD-driven proliferative advantage",
+            "myeloid differentiation block via RUNX1 loss",
+        ],
+        responding_cell_types=["HSC", "GMP"],
+    ),
+    "rheumatoid_arthritis": DiseaseProfile(
+        name="rheumatoid_arthritis",
+        display_name="rheumatoid arthritis",
+        tissue="synovium",
+        condition_name="rheumatoid_arthritis",
+        de_genes={
+            "IFNG": (1.0, 2.5), "TBX21": (0.8, 1.8), "IL17A": (1.0, 2.2),
+            "RORC": (0.6, 1.5), "TNF": (1.2, 2.5), "IL6": (1.0, 2.2),
+            "MMP3": (1.5, 3.0), "MMP1": (1.2, 2.5), "CXCL13": (1.0, 2.5),
+        },
+        pathways={
+            "JAK_STAT_signalling": (0.7, 0.9),
+            "TNF_signalling": (0.7, 0.9),
+            "Th17_differentiation": (0.6, 0.8),
+            "NF_kB_signalling": (0.65, 0.85),
+            "matrix_metalloproteinase_activity": (0.7, 0.9),
+        },
+        markers=["TNF", "IL6", "MMP3", "CXCL13"],
+        mechanism_templates=[
+            "TNF/NF-kB-driven synovial inflammation",
+            "Th17-mediated cartilage destruction via MMPs",
+        ],
+        responding_cell_types=["fibroblast", "macrophage", "CD4_T_cell"],
+    ),
+    "hepatocellular_carcinoma": DiseaseProfile(
+        name="hepatocellular_carcinoma",
+        display_name="hepatocellular carcinoma",
+        tissue="liver",
+        condition_name="HCC",
+        de_genes={
+            "GPC3": (2.0, 4.0), "AFP": (1.5, 3.5), "EPCAM": (1.0, 2.5),
+            "MYC": (1.0, 2.5), "VEGFA": (1.2, 2.8), "MKI67": (1.5, 3.0),
+            "ALB": (-2.0, -0.8), "CYP3A4": (-1.8, -0.6), "APOB": (-1.5, -0.5),
+        },
+        pathways={
+            "Wnt_signalling": (0.7, 0.9),
+            "cell_cycle": (0.75, 0.95),
+            "angiogenesis": (0.6, 0.8),
+            "PI3K_AKT_signalling": (0.65, 0.85),
+            "p53_signalling": (0.3, 0.5),
+        },
+        markers=["GPC3", "AFP", "VEGFA", "MKI67"],
+        mechanism_templates=[
+            "Wnt/beta-catenin-driven hepatocyte dedifferentiation",
+            "VEGF-mediated tumour angiogenesis",
+        ],
+        responding_cell_types=["kupffer_cell", "stellate_cell"],
+        hidden_failure_templates=[
+            "Tumour heterogeneity may confound DE in mixed biopsies",
+        ],
+    ),
+    "atherosclerosis": DiseaseProfile(
+        name="atherosclerosis",
+        display_name="atherosclerosis",
+        tissue="aorta",
+        condition_name="atherosclerosis",
+        de_genes={
+            "TREM2": (1.5, 3.0), "CD9": (1.0, 2.2), "LGALS3": (1.2, 2.5),
+            "APOE": (0.8, 2.0), "MMP9": (1.0, 2.5), "IL1B": (0.8, 2.0),
+            "ACTA2": (-1.5, -0.5), "MYH11": (-2.0, -0.8), "CNN1": (-1.5, -0.5),
+        },
+        pathways={
+            "lipid_metabolism": (0.7, 0.9),
+            "inflammatory_response": (0.65, 0.85),
+            "foam_cell_formation": (0.7, 0.9),
+            "smooth_muscle_contraction": (0.3, 0.5),
+            "complement_cascade": (0.5, 0.7),
+        },
+        markers=["TREM2", "LGALS3", "MMP9", "CD9"],
+        mechanism_templates=[
+            "TREM2+ macrophage-driven foam cell formation",
+            "smooth muscle cell phenotypic switching in plaque",
+        ],
+        responding_cell_types=["macrophage", "smooth_muscle"],
+    ),
+    "breast_cancer": DiseaseProfile(
+        name="breast_cancer",
+        display_name="breast cancer",
+        tissue="breast",
+        condition_name="breast_cancer",
+        de_genes={
+            "ERBB2": (1.5, 3.5), "ESR1": (-1.5, 1.5), "MKI67": (1.5, 3.0),
+            "MYC": (1.0, 2.5), "CDH1": (-1.5, -0.3), "VIM": (0.8, 2.2),
+            "CD274": (0.8, 2.0), "FOXP3": (0.6, 1.5), "GZMB": (0.8, 2.0),
+        },
+        pathways={
+            "cell_cycle": (0.7, 0.9),
+            "PI3K_AKT_signalling": (0.65, 0.85),
+            "EMT": (0.55, 0.8),
+            "immune_checkpoint": (0.5, 0.75),
+            "estrogen_signalling": (0.3, 0.7),
+        },
+        markers=["ERBB2", "MKI67", "CD274", "VIM"],
+        mechanism_templates=[
+            "ERBB2-driven proliferative signalling",
+            "immune evasion via PD-L1 upregulation",
+        ],
+        responding_cell_types=["macrophage", "T_cell", "fibroblast"],
+    ),
+    "multiple_sclerosis": DiseaseProfile(
+        name="multiple_sclerosis",
+        display_name="multiple sclerosis",
+        tissue="brain",
+        condition_name="multiple_sclerosis",
+        de_genes={
+            "CD68": (1.0, 2.5), "CXCL10": (1.2, 2.8), "STAT1": (0.8, 2.0),
+            "IRF1": (0.8, 1.8), "MBP": (-2.0, -0.8), "PLP1": (-1.8, -0.6),
+            "MOG": (-1.5, -0.5), "GFAP": (1.0, 2.5), "C3": (0.8, 2.0),
+        },
+        pathways={
+            "interferon_signalling": (0.7, 0.9),
+            "neuroinflammation": (0.7, 0.9),
+            "complement_cascade": (0.6, 0.8),
+            "myelination": (0.2, 0.4),
+            "T_cell_activation": (0.6, 0.8),
+        },
+        markers=["CXCL10", "STAT1", "GFAP", "C3"],
+        mechanism_templates=[
+            "interferon-driven microglial activation in demyelination",
+            "complement-mediated oligodendrocyte damage",
+        ],
+        responding_cell_types=["microglia", "astrocyte"],
+    ),
+    "diabetic_nephropathy": DiseaseProfile(
+        name="diabetic_nephropathy",
+        display_name="diabetic nephropathy",
+        tissue="kidney",
+        condition_name="diabetic_nephropathy",
+        de_genes={
+            "HAVCR1": (1.5, 3.0), "LCN2": (1.2, 2.8), "COL4A1": (1.0, 2.5),
+            "VEGFA": (0.8, 2.0), "NPHS1": (-1.8, -0.6), "NPHS2": (-1.5, -0.5),
+            "WT1": (-1.2, -0.4), "TGFB1": (1.0, 2.2), "FN1": (1.2, 2.5),
+        },
+        pathways={
+            "TGF_beta_signalling": (0.7, 0.9),
+            "extracellular_matrix_organisation": (0.7, 0.9),
+            "oxidative_stress": (0.6, 0.8),
+            "VEGF_signalling": (0.5, 0.7),
+            "apoptosis": (0.5, 0.7),
+        },
+        markers=["HAVCR1", "LCN2", "TGFB1", "FN1"],
+        mechanism_templates=[
+            "TGF-beta-driven glomerular fibrosis",
+            "podocyte loss via oxidative stress",
+        ],
+        responding_cell_types=["fibroblast", "macrophage"],
+    ),
+    "melanoma": DiseaseProfile(
+        name="melanoma",
+        display_name="melanoma",
+        tissue="skin",
+        condition_name="melanoma",
+        de_genes={
+            "MLANA": (1.5, 3.0), "PMEL": (1.2, 2.5), "SOX10": (1.0, 2.2),
+            "MKI67": (1.5, 3.0), "CD274": (0.8, 2.0), "PDCD1": (0.8, 2.0),
+            "GZMB": (0.8, 2.0), "HAVCR2": (0.6, 1.5), "LAG3": (0.6, 1.5),
+        },
+        pathways={
+            "MAPK_signalling": (0.7, 0.9),
+            "immune_checkpoint": (0.6, 0.85),
+            "cell_cycle": (0.7, 0.9),
+            "melanogenesis": (0.5, 0.7),
+            "T_cell_exhaustion": (0.55, 0.8),
+        },
+        markers=["MLANA", "CD274", "GZMB", "MKI67"],
+        mechanism_templates=[
+            "MAPK-driven melanocyte proliferation",
+            "T cell exhaustion via immune checkpoint upregulation",
+        ],
+        responding_cell_types=["T_cell", "macrophage"],
+    ),
+}
+# ── Pathway library ─────────────────────────────────────────────────────────
+PATHWAY_LIBRARY: Dict[str, List[str]] = {
+    "TGF_beta_signalling": ["TGFB1", "TGFB2", "SMAD2", "SMAD3", "SMAD4", "ACVR1"],
+    "Wnt_signalling": ["WNT3A", "CTNNB1", "APC", "AXIN2", "LGR5", "TCF7L2"],
+    "MAPK_signalling": ["KRAS", "BRAF", "MAP2K1", "MAPK1", "MAPK3", "FOS", "JUN"],
+    "JAK_STAT_signalling": ["JAK1", "JAK2", "STAT1", "STAT3", "STAT5A", "SOCS1", "SOCS3"],
+    "PI3K_AKT_signalling": ["PIK3CA", "AKT1", "MTOR", "PTEN", "TSC2"],
+    "NF_kB_signalling": ["NFKB1", "RELA", "IKBKB", "TNF", "IL1B"],
+    "cell_cycle": ["CDK4", "CDK6", "CCND1", "CCNE1", "RB1", "E2F1", "MKI67"],
+    "apoptosis": ["BCL2", "BAX", "BAK1", "CASP3", "CASP9", "TP53", "BID"],
+    "inflammatory_response": ["TNF", "IL6", "IL1B", "CCL2", "CXCL8", "NFKB1"],
+    "extracellular_matrix_organisation": ["COL1A1", "COL3A1", "FN1", "POSTN", "MMP2", "MMP9", "TIMP1"],
+    "complement_cascade": ["C1QA", "C1QB", "C3", "C4A", "C5", "CFB"],
+    "neuroinflammation": ["TREM2", "CX3CR1", "P2RY12", "IL1B", "TNF", "C1QA"],
+    "synaptic_signalling": ["SLC17A7", "GRIA1", "GRIN1", "DLG4", "SNAP25", "SYP"],
+    "hematopoietic_cell_lineage": ["CD34", "KIT", "FLT3", "GATA1", "CEBPA", "SPI1"],
+    "insulin_signalling": ["INS", "INSR", "IRS1", "PIK3CA", "AKT1", "SLC2A4"],
+    "ER_stress_response": ["DDIT3", "ATF4", "XBP1", "HSPA5", "EIF2AK3"],
+    "oxidative_stress": ["SOD1", "SOD2", "CAT", "GPX1", "NFE2L2", "HMOX1"],
+    "angiogenesis": ["VEGFA", "VEGFB", "KDR", "FLT1", "ANGPT1", "ANGPT2"],
+    "EMT": ["CDH1", "CDH2", "VIM", "SNAI1", "SNAI2", "TWIST1", "ZEB1"],
+    "immune_checkpoint": ["CD274", "PDCD1", "CTLA4", "HAVCR2", "LAG3", "TIGIT"],
+    "T_cell_activation": ["CD3D", "CD28", "LCK", "ZAP70", "IL2", "IFNG"],
+    "T_cell_exhaustion": ["PDCD1", "HAVCR2", "LAG3", "TIGIT", "TOX", "ENTPD1"],
+    "TNF_signalling": ["TNF", "TNFRSF1A", "TRADD", "RIPK1", "NFKB1", "CASP8"],
+    "Th17_differentiation": ["IL17A", "IL17F", "RORC", "IL23R", "CCR6", "STAT3"],
+    "interferon_signalling": ["IFNG", "IFNB1", "STAT1", "IRF1", "IRF7", "MX1", "OAS1"],
+    "lipid_metabolism": ["APOE", "APOB", "LDLR", "HMGCR", "ABCA1", "PPARG"],
+    "myelination": ["MBP", "PLP1", "MOG", "MAG", "OLIG2", "SOX10"],
+    "foam_cell_formation": ["CD36", "MSR1", "ABCA1", "APOE", "LGALS3", "TREM2"],
+    "smooth_muscle_contraction": ["ACTA2", "MYH11", "TAGLN", "CNN1", "MYLK"],
+    "glucagon_signalling": ["GCG", "GCGR", "CREB1", "PCK1", "G6PC"],
+    "matrix_metalloproteinase_activity": ["MMP1", "MMP2", "MMP3", "MMP9", "TIMP1", "TIMP2"],
+    "estrogen_signalling": ["ESR1", "ESR2", "PGR", "GREB1", "TFF1"],
+    "melanogenesis": ["MITF", "TYR", "TYRP1", "DCT", "MLANA", "PMEL"],
+    "VEGF_signalling": ["VEGFA", "VEGFB", "KDR", "FLT1", "NRP1"],
+}
+# ── Regulatory network templates ────────────────────────────────────────────
+REGULATORY_TEMPLATES: Dict[str, Dict[str, List[str]]] = {
+    "erythroid": {
+        "GATA1": ["KLF1", "HBB", "HBA1", "GYPA", "ALAS2"],
+        "KLF1": ["HBB", "HBA1", "SLC4A1"],
+    },
+    "myeloid": {
+        "CEBPA": ["CSF3R", "ELANE", "MPO", "CTSG"],
+        "SPI1": ["CSF1R", "CD14", "FCGR3A", "CD68"],
+    },
+    "lymphoid": {
+        "TCF7": ["CD3D", "CD3E", "IL7R", "LEF1"],
+        "PAX5": ["CD19", "MS4A1", "CD79A"],
+    },
+    "fibrotic": {
+        "SMAD3": ["COL1A1", "COL3A1", "FN1", "POSTN"],
+        "TGFB1": ["ACTA2", "COL1A1", "CTGF"],
+    },
+    "inflammatory": {
+        "NFKB1": ["TNF", "IL6", "IL1B", "CCL2", "CXCL8"],
+        "STAT1": ["IRF1", "CXCL10", "MX1", "OAS1"],
+    },
+    "stem_cell": {
+        "RUNX1": ["CD34", "KIT", "FLT3"],
+        "MYC": ["CDK4", "CCND1", "E2F1"],
+    },
+    "neuronal": {
+        "NEUROD1": ["SLC17A7", "NRGN", "SNAP25"],
+        "DLX1": ["GAD1", "GAD2", "SLC32A1"],
+    },
+}
+# ── Perturbation templates ──────────────────────────────────────────────────
+@dataclass
+class PerturbationTemplate:
+    name: str
+    target_pathway: str
+    gene_effects: Dict[str, float]
+    description: str
+PERTURBATION_TEMPLATES: Dict[str, PerturbationTemplate] = {
+    "JAK_inhibitor": PerturbationTemplate(
+        name="JAK_inhibitor",
+        target_pathway="JAK_STAT_signalling",
+        gene_effects={"STAT1": -0.8, "STAT3": -0.7, "IFNG": -1.5, "IL17A": -1.3, "SOCS1": 1.2},
+        description="JAK inhibitor treatment",
+    ),
+    "anti_TNF": PerturbationTemplate(
+        name="anti_TNF",
+        target_pathway="TNF_signalling",
+        gene_effects={"TNF": -1.5, "IL6": -1.0, "IL1B": -0.8, "MMP3": -1.2, "SOCS3": 0.8},
+        description="anti-TNF biologic therapy",
+    ),
+    "PD1_blockade": PerturbationTemplate(
+        name="PD1_blockade",
+        target_pathway="immune_checkpoint",
+        gene_effects={"PDCD1": -1.0, "GZMB": 1.5, "IFNG": 1.2, "PRF1": 1.0, "TNF": 0.8},
+        description="anti-PD-1 immune checkpoint blockade",
+    ),
+    "BRAF_inhibitor": PerturbationTemplate(
+        name="BRAF_inhibitor",
+        target_pathway="MAPK_signalling",
+        gene_effects={"BRAF": -0.5, "MAPK1": -1.0, "MKI67": -1.5, "CCND1": -1.2, "FOS": -0.8},
+        description="BRAF inhibitor treatment",
+    ),
+    "TGFb_inhibitor": PerturbationTemplate(
+        name="TGFb_inhibitor",
+        target_pathway="TGF_beta_signalling",
+        gene_effects={"TGFB1": -0.8, "COL1A1": -1.2, "COL3A1": -1.0, "POSTN": -1.5, "ACTA2": -0.8},
+        description="TGF-beta pathway inhibitor",
+    ),
+    "mTOR_inhibitor": PerturbationTemplate(
+        name="mTOR_inhibitor",
+        target_pathway="PI3K_AKT_signalling",
+        gene_effects={"MTOR": -0.8, "AKT1": -0.6, "MKI67": -1.2, "CCND1": -1.0, "HIF1A": -0.7},
+        description="mTOR inhibitor treatment",
+    ),
+    "CRISPR_TP53_KO": PerturbationTemplate(
+        name="CRISPR_TP53_KO",
+        target_pathway="p53_signalling",
+        gene_effects={"TP53": -2.0, "BAX": -1.0, "CDKN1A": -1.5, "MDM2": -0.8, "MKI67": 1.0},
+        description="CRISPR knockout of TP53",
+    ),
+}
+# ── Trajectory templates ────────────────────────────────────────────────────
+@dataclass
+class TrajectoryTemplate:
+    """Template for a developmental trajectory through cell populations."""
+    root_population: str
+    branches: List[List[str]]
+    n_lineages: int
+    tissue: str
+TRAJECTORY_TEMPLATES: List[TrajectoryTemplate] = [
+    TrajectoryTemplate(
+        root_population="HSC",
+        branches=[
+            ["HSC", "CMP", "GMP", "neutrophil"],
+            ["HSC", "CMP", "GMP", "monocyte"],
+            ["HSC", "MEP", "erythrocyte"],
+            ["HSC", "MEP", "megakaryocyte"],
+        ],
+        n_lineages=3,
+        tissue="bone_marrow",
+    ),
+    TrajectoryTemplate(
+        root_population="double_negative_T",
+        branches=[
+            ["double_negative_T", "double_positive_T", "CD4_SP"],
+            ["double_negative_T", "double_positive_T", "CD8_SP"],
+        ],
+        n_lineages=2,
+        tissue="thymus",
+    ),
+    TrajectoryTemplate(
+        root_population="stem_cell",
+        branches=[
+            ["stem_cell", "colonocyte"],
+            ["stem_cell", "goblet_cell"],
+        ],
+        n_lineages=2,
+        tissue="colon",
+    ),
+    TrajectoryTemplate(
+        root_population="OPC",
+        branches=[
+            ["OPC", "oligodendrocyte"],
+        ],
+        n_lineages=1,
+        tissue="brain",
+    ),
+]
+# ── Hidden failure condition templates ──────────────────────────────────────
+HIDDEN_FAILURE_TEMPLATES: List[str] = [
+    "High ambient RNA may confound DE in low-abundance transcripts",
+    "Strong batch effects between conditions may inflate false positives",
+    "Low cell viability in disease samples reduces statistical power",
+    "Doublet contamination in dense populations obscures rare cell types",
+    "Sample degradation during processing introduces 3' bias artefacts",
+    "Dissociation-induced gene expression changes confound stress signatures",
+    "Unbalanced sample sizes between conditions reduce DE sensitivity",
+]

server/tasks/generator.py CHANGED Viewed

@@ -12,7 +12,7 @@ from typing import List, Optional, Tuple
 import numpy as np
-from models import TaskSpec
 from server.simulator.latent_state import (
     CellPopulation,
@@ -24,6 +24,7 @@ from server.simulator.latent_state import (
     TechnicalState,
 )
 from .scenarios import SCENARIO_LIBRARY, Scenario
 class TaskGenerator:
@@ -34,7 +35,10 @@ class TaskGenerator:
         scenarios: Optional[List[Scenario]] = None,
         domain_randomise: bool = True,
     ):
-        self.scenarios = scenarios or SCENARIO_LIBRARY
         self.domain_randomise = domain_randomise
     def generate(
@@ -58,6 +62,14 @@ class TaskGenerator:
         if self.domain_randomise:
             self._randomise(rng, task, biology, technical)
         latent = FullLatentState(
             biology=biology,
             technical=technical,
@@ -67,6 +79,7 @@ class TaskGenerator:
                 time_limit_days=task.time_limit_days,
             ),
             hidden_failure_conditions=list(scenario.hidden_failure_conditions),
             rng_seed=seed or 0,
         )
         return task, latent

 import numpy as np
+from models import TaskSpec, tools_for_modality, assays_for_modality
 from server.simulator.latent_state import (
     CellPopulation,
     TechnicalState,
 )
 from .scenarios import SCENARIO_LIBRARY, Scenario
+from .procedural_generator import generate_procedural_scenarios
 class TaskGenerator:
         scenarios: Optional[List[Scenario]] = None,
         domain_randomise: bool = True,
     ):
+        if scenarios is not None:
+            self.scenarios = scenarios
+        else:
+            self.scenarios = list(SCENARIO_LIBRARY) + generate_procedural_scenarios(n=20, seed=42)
         self.domain_randomise = domain_randomise
     def generate(
         if self.domain_randomise:
             self._randomise(rng, task, biology, technical)
+        # Filter available tools/assays to those compatible with the modality.
+        compatible_tools = [t.name for t in tools_for_modality(task.modality)]
+        compatible_assays = [a.name for a in assays_for_modality(task.modality)]
+        if compatible_tools:
+            task.available_tools = compatible_tools
+        if compatible_assays:
+            task.available_assays = compatible_assays
         latent = FullLatentState(
             biology=biology,
             technical=technical,
                 time_limit_days=task.time_limit_days,
             ),
             hidden_failure_conditions=list(scenario.hidden_failure_conditions),
+            task_modality=task.modality,
             rng_seed=seed or 0,
         )
         return task, latent

server/tasks/procedural_generator.py ADDED Viewed

	@@ -0,0 +1,501 @@

+"""Procedural scenario generator.
+Composes biologically coherent ``Scenario`` objects from the curated
+palette in ``bio_palette``, producing fully populated
+``LatentBiologicalState`` instances that drive every simulator tool
+(clustering, DE, pathway enrichment, trajectory, regulatory networks,
+marker validation) with realistic intermediate outputs.
+"""
+from __future__ import annotations
+import logging
+from typing import Any, Dict, List, Optional, Tuple
+import numpy as np
+from models import TaskSpec
+from server.simulator.latent_state import (
+    CellPopulation,
+    LatentBiologicalState,
+    TechnicalState,
+)
+from .bio_palette import (
+    DISEASE_PROFILES,
+    HIDDEN_FAILURE_TEMPLATES,
+    PATHWAY_LIBRARY,
+    PERTURBATION_TEMPLATES,
+    REGULATORY_TEMPLATES,
+    TISSUE_CELL_TYPES,
+    TRAJECTORY_TEMPLATES,
+    CellTypeTemplate,
+    DiseaseProfile,
+)
+from .scenarios import Scenario
+logger = logging.getLogger(__name__)
+SCENARIO_TYPES = ("de", "trajectory", "perturbation", "biomarker")
+_DIFFICULTY_PARAMS = {
+    "easy": {
+        "n_pops": (4, 5),
+        "de_scale": (1.2, 1.6),
+        "noise_dropout": (0.05, 0.10),
+        "noise_doublet": (0.03, 0.06),
+        "noise_ambient": (0.02, 0.05),
+        "noise_batch_strength": (0.05, 0.12),
+        "n_batches": (1, 2),
+        "budget_range": (70_000, 100_000),
+        "time_range": (100, 150),
+        "sample_quality": (0.85, 0.95),
+        "include_trajectory": False,
+        "include_perturbation": False,
+        "include_network": False,
+        "include_failure_conditions": False,
+    },
+    "medium": {
+        "n_pops": (5, 7),
+        "de_scale": (0.9, 1.3),
+        "noise_dropout": (0.08, 0.14),
+        "noise_doublet": (0.04, 0.08),
+        "noise_ambient": (0.03, 0.07),
+        "noise_batch_strength": (0.08, 0.18),
+        "n_batches": (1, 3),
+        "budget_range": (80_000, 120_000),
+        "time_range": (120, 180),
+        "sample_quality": (0.78, 0.92),
+        "include_trajectory": True,
+        "include_perturbation": False,
+        "include_network": True,
+        "include_failure_conditions": False,
+    },
+    "hard": {
+        "n_pops": (6, 8),
+        "de_scale": (0.6, 1.0),
+        "noise_dropout": (0.10, 0.20),
+        "noise_doublet": (0.06, 0.12),
+        "noise_ambient": (0.05, 0.10),
+        "noise_batch_strength": (0.12, 0.25),
+        "n_batches": (2, 4),
+        "budget_range": (90_000, 140_000),
+        "time_range": (140, 200),
+        "sample_quality": (0.65, 0.85),
+        "include_trajectory": True,
+        "include_perturbation": True,
+        "include_network": True,
+        "include_failure_conditions": True,
+    },
+}
+def generate_scenario(
+    seed: int,
+    difficulty: str = "medium",
+    scenario_type: Optional[str] = None,
+) -> Scenario:
+    """Generate a single procedural scenario with complete latent state.
+    Parameters
+    ----------
+    seed
+        RNG seed for reproducibility.
+    difficulty
+        One of ``"easy"``, ``"medium"``, ``"hard"``.
+    scenario_type
+        One of ``"de"``, ``"trajectory"``, ``"perturbation"``,
+        ``"biomarker"``, or ``None`` for random selection.
+    """
+    rng = np.random.default_rng(seed)
+    params = _DIFFICULTY_PARAMS[difficulty]
+    if scenario_type is None:
+        scenario_type = rng.choice(SCENARIO_TYPES)
+    disease_key = rng.choice(list(DISEASE_PROFILES.keys()))
+    disease = DISEASE_PROFILES[disease_key]
+    tissue = disease.tissue
+    cell_templates = TISSUE_CELL_TYPES.get(tissue, [])
+    if not cell_templates:
+        tissue = rng.choice(list(TISSUE_CELL_TYPES.keys()))
+        cell_templates = TISSUE_CELL_TYPES[tissue]
+    populations = _sample_populations(rng, cell_templates, disease, params)
+    de_genes = _build_de_genes(rng, disease, params)
+    pathways = _build_pathways(rng, disease)
+    markers = _derive_markers(rng, de_genes, disease)
+    mechanisms = list(disease.mechanism_templates)
+    n_cells = int(rng.integers(8_000, 22_000))
+    trajectory = None
+    if scenario_type == "trajectory" or (
+        params["include_trajectory"] and rng.random() < 0.4
+    ):
+        trajectory = _build_trajectory(rng, tissue, populations)
+    reg_network: Dict[str, List[str]] = {}
+    if scenario_type == "trajectory" or (
+        params["include_network"] and rng.random() < 0.5
+    ):
+        reg_network = _build_regulatory_network(rng, tissue, populations)
+    perturbation_effects: Dict[str, Dict[str, float]] = {}
+    if scenario_type == "perturbation" or (
+        params["include_perturbation"] and rng.random() < 0.5
+    ):
+        perturbation_effects = _build_perturbation(rng, disease)
+    technical = _build_technical(rng, params)
+    hidden_failures: List[str] = []
+    if params["include_failure_conditions"] and rng.random() < 0.6:
+        n_failures = int(rng.integers(1, 3))
+        indices = rng.choice(
+            len(HIDDEN_FAILURE_TEMPLATES), size=min(n_failures, len(HIDDEN_FAILURE_TEMPLATES)), replace=False,
+        )
+        hidden_failures = [HIDDEN_FAILURE_TEMPLATES[i] for i in indices]
+    task = _build_task(rng, disease, tissue, scenario_type, params, perturbation_effects)
+    biology = LatentBiologicalState(
+        cell_populations=populations,
+        true_de_genes=de_genes,
+        true_pathways=pathways,
+        true_trajectory=trajectory,
+        true_regulatory_network=reg_network,
+        perturbation_effects=perturbation_effects,
+        true_markers=markers,
+        causal_mechanisms=mechanisms,
+        n_true_cells=n_cells,
+    )
+    name = f"proc_{disease.name}_{scenario_type}_{seed}"
+    tags = [scenario_type, "scRNA-seq", tissue, disease.name, difficulty]
+    return Scenario(
+        name=name,
+        task=task,
+        biology=biology,
+        technical=technical,
+        hidden_failure_conditions=hidden_failures,
+        difficulty=difficulty,
+        tags=tags,
+    )
+def generate_procedural_scenarios(
+    n: int = 20,
+    seed: int = 42,
+) -> List[Scenario]:
+    """Pre-generate a pool of procedural scenarios across difficulties."""
+    rng = np.random.default_rng(seed)
+    scenarios: List[Scenario] = []
+    difficulties = ["easy", "medium", "hard"]
+    for i in range(n):
+        diff = difficulties[i % len(difficulties)]
+        child_seed = int(rng.integers(0, 2**31))
+        scenario = generate_scenario(
+            seed=child_seed,
+            difficulty=diff,
+            scenario_type=None,
+        )
+        scenarios.append(scenario)
+    logger.info("Generated %d procedural scenarios.", len(scenarios))
+    return scenarios
+# ── Internal builders ───────────────────────────────────────────────────────
+def _sample_populations(
+    rng: np.random.Generator,
+    templates: List[CellTypeTemplate],
+    disease: DiseaseProfile,
+    params: dict,
+) -> List[CellPopulation]:
+    lo, hi = params["n_pops"]
+    n_pops = int(rng.integers(lo, hi + 1))
+    n_pops = min(n_pops, len(templates))
+    indices = rng.choice(len(templates), size=n_pops, replace=False)
+    selected = [templates[i] for i in sorted(indices)]
+    responding_names = set(disease.responding_cell_types)
+    populations: List[CellPopulation] = []
+    for tmpl in selected:
+        prop = float(rng.uniform(*tmpl.proportion_range))
+        state = rng.choice(tmpl.states)
+        condition_response: Dict[str, float] = {}
+        if tmpl.disease_responsive and tmpl.name in responding_names:
+            condition_response[disease.condition_name] = float(
+                rng.uniform(*tmpl.response_range)
+            )
+        populations.append(CellPopulation(
+            name=tmpl.name,
+            proportion=prop,
+            marker_genes=list(tmpl.marker_genes),
+            state=state,
+            condition_response=condition_response,
+        ))
+    total = sum(p.proportion for p in populations)
+    if total > 0:
+        for p in populations:
+            p.proportion = round(p.proportion / total, 4)
+    return populations
+def _build_de_genes(
+    rng: np.random.Generator,
+    disease: DiseaseProfile,
+    params: dict,
+) -> Dict[str, Dict[str, float]]:
+    comparison = f"{disease.condition_name}_vs_healthy"
+    scale_lo, scale_hi = params["de_scale"]
+    effects: Dict[str, float] = {}
+    for gene, (lo, hi) in disease.de_genes.items():
+        base = float(rng.uniform(lo, hi))
+        scale = float(rng.uniform(scale_lo, scale_hi))
+        if base > 0:
+            effects[gene] = round(base * scale, 3)
+        else:
+            effects[gene] = round(base * scale, 3)
+    return {comparison: effects}
+def _build_pathways(
+    rng: np.random.Generator,
+    disease: DiseaseProfile,
+) -> Dict[str, float]:
+    pathways: Dict[str, float] = {}
+    for pw, (lo, hi) in disease.pathways.items():
+        pathways[pw] = round(float(rng.uniform(lo, hi)), 3)
+    return pathways
+def _derive_markers(
+    rng: np.random.Generator,
+    de_genes: Dict[str, Dict[str, float]],
+    disease: DiseaseProfile,
+) -> List[str]:
+    markers = list(disease.markers)
+    all_effects: Dict[str, float] = {}
+    for effects in de_genes.values():
+        all_effects.update(effects)
+    for gene in markers:
+        if gene not in all_effects:
+            all_effects[gene] = float(rng.uniform(1.0, 2.5))
+            for comp_effects in de_genes.values():
+                comp_effects[gene] = all_effects[gene]
+    n_markers = min(len(markers), int(rng.integers(3, 7)))
+    return markers[:n_markers]
+def _build_trajectory(
+    rng: np.random.Generator,
+    tissue: str,
+    populations: List[CellPopulation],
+) -> Optional[Dict[str, Any]]:
+    pop_names = {p.name for p in populations}
+    for tmpl in TRAJECTORY_TEMPLATES:
+        if tmpl.tissue == tissue:
+            valid_branches = [
+                branch for branch in tmpl.branches
+                if all(node in pop_names for node in branch)
+            ]
+            if valid_branches:
+                return {
+                    "root": tmpl.root_population,
+                    "n_lineages": len(valid_branches),
+                    "branching": len(valid_branches) > 1,
+                    "branches": valid_branches,
+                }
+    if len(populations) >= 3:
+        root = populations[0].name
+        branches = [[root, p.name] for p in populations[1:]]
+        selected = branches[:int(rng.integers(2, min(4, len(branches)) + 1))]
+        return {
+            "root": root,
+            "n_lineages": len(selected),
+            "branching": len(selected) > 1,
+            "branches": selected,
+        }
+    return None
+def _build_regulatory_network(
+    rng: np.random.Generator,
+    tissue: str,
+    populations: List[CellPopulation],
+) -> Dict[str, List[str]]:
+    all_genes = set()
+    for p in populations:
+        all_genes.update(p.marker_genes)
+    network: Dict[str, List[str]] = {}
+    tissue_to_programs = {
+        "bone_marrow": ["erythroid", "myeloid", "stem_cell"],
+        "thymus": ["lymphoid"],
+        "blood": ["lymphoid", "myeloid"],
+        "spleen": ["lymphoid"],
+        "brain": ["neuronal", "inflammatory"],
+        "heart": ["fibrotic", "inflammatory"],
+        "lung": ["fibrotic", "inflammatory"],
+        "liver": ["fibrotic", "inflammatory"],
+        "kidney": ["fibrotic", "inflammatory"],
+        "colon": ["inflammatory", "stem_cell"],
+        "pancreas": ["inflammatory"],
+        "skin": ["inflammatory"],
+        "breast": ["inflammatory"],
+        "synovium": ["inflammatory", "lymphoid"],
+        "aorta": ["inflammatory"],
+    }
+    programs = tissue_to_programs.get(tissue, ["inflammatory"])
+    for prog_name in programs:
+        prog = REGULATORY_TEMPLATES.get(prog_name, {})
+        for tf, targets in prog.items():
+            network[tf] = list(targets)
+    if not network:
+        for p in populations[:2]:
+            if len(p.marker_genes) >= 2:
+                tf = p.marker_genes[0]
+                network[tf] = p.marker_genes[1:]
+    return network
+def _build_perturbation(
+    rng: np.random.Generator,
+    disease: DiseaseProfile,
+) -> Dict[str, Dict[str, float]]:
+    disease_pathways = set(disease.pathways.keys())
+    matching = [
+        (name, tmpl) for name, tmpl in PERTURBATION_TEMPLATES.items()
+        if tmpl.target_pathway in disease_pathways
+    ]
+    if matching:
+        name, tmpl = matching[int(rng.integers(0, len(matching)))]
+    else:
+        name = rng.choice(list(PERTURBATION_TEMPLATES.keys()))
+        tmpl = PERTURBATION_TEMPLATES[name]
+    scaled: Dict[str, float] = {}
+    for gene, effect in tmpl.gene_effects.items():
+        scale = float(rng.uniform(0.7, 1.3))
+        scaled[gene] = round(effect * scale, 3)
+    return {name: scaled}
+def _build_technical(
+    rng: np.random.Generator,
+    params: dict,
+) -> TechnicalState:
+    n_batches = int(rng.integers(*params["n_batches"]))
+    batch_effects: Dict[str, float] = {}
+    for i in range(max(1, n_batches)):
+        strength = float(rng.uniform(*params["noise_batch_strength"]))
+        batch_effects[f"batch_{i}"] = round(strength, 3)
+    return TechnicalState(
+        batch_effects=batch_effects,
+        dropout_rate=round(float(rng.uniform(*params["noise_dropout"])), 3),
+        doublet_rate=round(float(rng.uniform(*params["noise_doublet"])), 3),
+        ambient_rna_fraction=round(float(rng.uniform(*params["noise_ambient"])), 3),
+        sample_quality=round(float(rng.uniform(*params["sample_quality"])), 3),
+    )
+def _build_task(
+    rng: np.random.Generator,
+    disease: DiseaseProfile,
+    tissue: str,
+    scenario_type: str,
+    params: dict,
+    perturbation_effects: Dict[str, Dict[str, float]],
+) -> TaskSpec:
+    budget = float(rng.integers(*params["budget_range"]))
+    time_days = float(rng.integers(*params["time_range"]))
+    if scenario_type == "de":
+        problem = (
+            f"Identify differentially expressed genes between "
+            f"{disease.display_name} and healthy {tissue} tissue "
+            f"using single-cell RNA sequencing."
+        )
+        criteria = [
+            f"Identify DE genes between {disease.condition_name} and healthy",
+            "Validate at least one candidate marker",
+        ]
+    elif scenario_type == "trajectory":
+        problem = (
+            f"Infer the developmental trajectory of cell populations "
+            f"in {tissue} tissue in the context of {disease.display_name}."
+        )
+        criteria = [
+            "Reconstruct branching lineage structure",
+            "Identify key transcription factors driving fate decisions",
+        ]
+    elif scenario_type == "perturbation":
+        pert_name = next(iter(perturbation_effects), "treatment")
+        pert_tmpl = PERTURBATION_TEMPLATES.get(pert_name)
+        pert_desc = pert_tmpl.description if pert_tmpl else pert_name
+        problem = (
+            f"Determine the effect of {pert_desc} on cell states "
+            f"in {tissue} tissue affected by {disease.display_name}."
+        )
+        criteria = [
+            "Quantify shift in cell activation states",
+            f"Identify pathways modulated by {pert_name}",
+            "Propose validation strategy",
+        ]
+    else:
+        top_marker = disease.markers[0] if disease.markers else "candidate"
+        problem = (
+            f"Validate candidate biomarker {top_marker} for "
+            f"{disease.display_name} in {tissue} tissue using "
+            f"single-cell RNA sequencing."
+        )
+        criteria = [
+            f"Validate {top_marker} as a disease marker",
+            "Confirm expression specificity across cell types",
+        ]
+    conditions = ["healthy", disease.condition_name]
+    if scenario_type == "perturbation" and perturbation_effects:
+        pert_name = next(iter(perturbation_effects))
+        conditions = [f"untreated_{disease.condition_name}", f"{pert_name}_treated"]
+    return TaskSpec(
+        problem_statement=problem,
+        modality="scRNA-seq",
+        organism="human",
+        tissue=tissue,
+        conditions=conditions,
+        budget_limit=budget,
+        time_limit_days=time_days,
+        success_criteria=criteria,
+    )

server/tasks/scenarios.py CHANGED Viewed

@@ -353,8 +353,8 @@ SCENARIO_LIBRARY: List[Scenario] = [
             budget_limit=90_000.0,
             time_limit_days=150.0,
             prior_observations=[
-                "SPP1 identified as top DE gene in prior pilot study",
-                "SPP1+ macrophages enriched in fibrotic regions",
             ],
             success_criteria=[
                 "Validate SPP1 as a marker for pro-fibrotic macrophages",
@@ -452,3 +452,5 @@ SCENARIO_LIBRARY: List[Scenario] = [
         ),
     ),
 ]

             budget_limit=90_000.0,
             time_limit_days=150.0,
             prior_observations=[
+                "A macrophage subpopulation shows elevated expression in IPF tissue relative to controls",
+                "Pro-fibrotic macrophage enrichment has been observed in fibrotic regions by spatial profiling",
             ],
             success_criteria=[
                 "Validate SPP1 as a marker for pro-fibrotic macrophages",
         ),
     ),
 ]

tests/test_environment.py CHANGED Viewed

@@ -56,6 +56,17 @@ class TestEnvironmentLifecycle:
         assert obs.latest_output is not None
         assert obs.latest_output.success is False
     def test_conclusion_ends_episode(self):
         env = BioExperimentEnvironment()
         env.reset()

         assert obs.latest_output is not None
         assert obs.latest_output.success is False
+    def test_premature_followup_design_is_flagged(self):
+        env = BioExperimentEnvironment()
+        env.reset()
+        obs = env.step(ExperimentAction(
+            action_type=ActionType.DESIGN_FOLLOWUP,
+            parameters={"assay": "qPCR"},
+        ))
+        assert obs.latest_output is not None
+        assert obs.latest_output.success is True
+        assert any("follow-up design" in msg.lower() for msg in obs.rule_violations)
     def test_conclusion_ends_episode(self):
         env = BioExperimentEnvironment()
         env.reset()

tests/test_rewards.py CHANGED Viewed

@@ -61,6 +61,29 @@ class TestStepReward:
         )
         assert rb.total < 0
 class TestTerminalReward:
     def test_correct_conclusion_rewarded(self):
@@ -103,3 +126,42 @@ class TestTerminalReward:
         ]
         rb = rc.terminal_reward(state, claims, [])
         assert rb.components.get("overconfidence_penalty", 0) < 0

         )
         assert rb.total < 0
+    def test_premature_meta_action_gets_penalized(self):
+        rc = RewardComputer()
+        prev, nxt = _states(
+            prev_flags={"data_normalized": True},
+            next_flags={"followup_designed": True},
+            budget_used=2_000,
+        )
+        output = IntermediateOutput(
+            output_type=OutputType.FOLLOWUP_DESIGN,
+            step_index=2,
+            quality_score=1.0,
+            uncertainty=0.0,
+        )
+        rb = rc.step_reward(
+            ExperimentAction(action_type=ActionType.DESIGN_FOLLOWUP),
+            prev,
+            nxt,
+            output,
+            [],
+            [],
+        )
+        assert rb.components.get("premature_meta_action_penalty", 0.0) < 0.0
 class TestTerminalReward:
     def test_correct_conclusion_rewarded(self):
         ]
         rb = rc.terminal_reward(state, claims, [])
         assert rb.components.get("overconfidence_penalty", 0) < 0
+    def test_discovery_error_penalizes_wrong_markers_and_mechanisms(self):
+        rc = RewardComputer()
+        state = FullLatentState(
+            biology=LatentBiologicalState(
+                true_markers=["NPPA", "NPPB"],
+                causal_mechanisms=["TGF-beta-driven fibrosis"],
+            ),
+            progress=ExperimentProgress(
+                samples_collected=True,
+                cells_sequenced=True,
+                qc_performed=True,
+                data_filtered=True,
+                data_normalized=True,
+                de_performed=True,
+                markers_discovered=True,
+                conclusion_reached=True,
+            ),
+            resources=ResourceState(budget_total=100_000, budget_used=40_000),
+        )
+        aligned = rc.terminal_reward(
+            state,
+            [],
+            [],
+            discovered_markers=["NPPA", "NPPB"],
+            candidate_mechanisms=["TGF-beta-driven fibrosis"],
+        )
+        misaligned = rc.terminal_reward(
+            state,
+            [],
+            [],
+            discovered_markers=["WRONG1", "WRONG2"],
+            candidate_mechanisms=["unrelated inflammatory process"],
+        )
+        assert aligned.components["discovery_alignment"] > misaligned.components["discovery_alignment"]
+        assert aligned.components["discovery_error_penalty"] > misaligned.components["discovery_error_penalty"]
+        assert aligned.terminal > misaligned.terminal

tests/test_rules.py CHANGED Viewed

@@ -66,6 +66,36 @@ class TestRedundancy:
         assert not hard
         assert any("redundant" in m.lower() for m in soft)
 class TestResourceConstraints:
     def test_exhausted_budget_blocked(self):

         assert not hard
         assert any("redundant" in m.lower() for m in soft)
+    def test_repeated_followup_design_is_soft(self):
+        engine = RuleEngine()
+        violations = engine.check(
+            ExperimentAction(action_type=ActionType.DESIGN_FOLLOWUP),
+            _state(followup_designed=True, de_performed=True),
+        )
+        hard = engine.hard_violations(violations)
+        soft = engine.soft_violations(violations)
+        assert not hard
+        assert any("redundant" in m.lower() for m in soft)
+class TestMetaActionTiming:
+    def test_followup_design_without_analysis_is_soft(self):
+        engine = RuleEngine()
+        violations = engine.check(
+            ExperimentAction(action_type=ActionType.DESIGN_FOLLOWUP),
+            _state(),
+        )
+        soft = engine.soft_violations(violations)
+        assert any("follow-up design" in m.lower() for m in soft)
+    def test_subagent_review_without_analysis_is_soft(self):
+        engine = RuleEngine()
+        violations = engine.check(
+            ExperimentAction(action_type=ActionType.REQUEST_SUBAGENT_REVIEW),
+            _state(),
+        )
+        soft = engine.soft_violations(violations)
+        assert any("subagent review" in m.lower() for m in soft)
 class TestResourceConstraints:
     def test_exhausted_budget_blocked(self):

tests/test_run_agent.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""Tests for run_agent parser and fallback helpers."""
+from models import ActionType, ExperimentAction
+from run_agent import fallback_action, parse_action
+from server.hackathon_environment import BioExperimentEnvironment
+def test_parse_action_accepts_reasoning_variant():
+    action = parse_action(
+        '{"action_type":"run_qc","parameters":{},"Reasoning":"check quality","confidence":0.8}'
+    )
+    assert action is not None
+    assert action.action_type == ActionType.RUN_QC
+    assert action.justification == "check quality"
+def test_parse_action_accepts_justifyement_typo():
+    action = parse_action(
+        '{"action_type":"collect_sample","parameters":{},"justifyement":"typo key","confidence":0.7}'
+    )
+    assert action is not None
+    assert action.action_type == ActionType.COLLECT_SAMPLE
+    assert action.justification == "typo key"
+def test_fallback_uses_observation_progress_not_step_index():
+    env = BioExperimentEnvironment(scenario_name="cardiac_disease_de", domain_randomise=False)
+    obs = env.reset(seed=0)
+    for action_type in (
+        ActionType.COLLECT_SAMPLE,
+        ActionType.PREPARE_LIBRARY,
+        ActionType.SEQUENCE_CELLS,
+    ):
+        obs = env.step(ExperimentAction(action_type=action_type))
+    action = fallback_action(obs)
+    assert action.action_type == ActionType.RUN_QC

tests/test_training_script.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""Tests for GRPO training helpers."""
+from pathlib import Path
+from models import ActionType
+from training_script import (
+    INVALID_ACTION_PENALTY,
+    OpenEnvReward,
+    available_numeric_log_keys,
+    build_prompt_examples,
+    completion_to_text,
+    parse_action_completion,
+    save_training_plots,
+    select_metric_key,
+    select_reward_key,
+)
+def test_completion_to_text_from_chat_messages():
+    completion = [
+        {"role": "assistant", "content": '{"action_type":"collect_sample"}'}
+    ]
+    assert completion_to_text(completion) == '{"action_type":"collect_sample"}'
+def test_parse_action_completion_roundtrip():
+    action = parse_action_completion(
+        '{"action_type":"run_qc","method":"scanpy.pp.calculate_qc_metrics",'
+        '"parameters":{"min_genes":200},"confidence":0.8}'
+    )
+    assert action is not None
+    assert action.action_type == ActionType.RUN_QC
+    assert action.method == "scanpy.pp.calculate_qc_metrics"
+    assert action.parameters["min_genes"] == 200
+    assert action.confidence == 0.8
+def test_parse_action_completion_accepts_reasoning_alias():
+    action = parse_action_completion(
+        '{"action_type":"run_qc","reasoning":"Measure quality before filtering."}'
+    )
+    assert action is not None
+    assert action.justification == "Measure quality before filtering."
+def test_build_prompt_examples_contains_reference_action():
+    examples = build_prompt_examples(
+        dataset_episodes=1,
+        rollout_steps=2,
+        collection_policy="heuristic",
+        scenario_names=["cardiac_disease_de"],
+        seed=0,
+        domain_randomise=False,
+    )
+    assert len(examples) == 2
+    assert examples[0]["scenario_name"] == "cardiac_disease_de"
+    assert '"action_type": "collect_sample"' in examples[0]["reference_action"]
+def test_openenv_reward_penalizes_invalid_completion():
+    reward_fn = OpenEnvReward(
+        reward_backend="local",
+        base_url="http://localhost:8000",
+    )
+    rewards = reward_fn(
+        completions=[[{"role": "assistant", "content": "not valid json"}]],
+        scenario_name=["cardiac_disease_de"],
+        history_actions=["[]"],
+    )
+    assert rewards == [INVALID_ACTION_PENALTY]
+def test_openenv_reward_scores_valid_completion_locally():
+    examples = build_prompt_examples(
+        dataset_episodes=1,
+        rollout_steps=1,
+        collection_policy="heuristic",
+        scenario_names=["cardiac_disease_de"],
+        seed=0,
+        domain_randomise=False,
+    )
+    reward_fn = OpenEnvReward(
+        reward_backend="local",
+        base_url="http://localhost:8000",
+    )
+    sample = examples[0]
+    rewards = reward_fn(
+        completions=[[{"role": "assistant", "content": sample["reference_action"]}]],
+        scenario_name=[sample["scenario_name"]],
+        history_actions=[sample["history_actions"]],
+    )
+    assert len(rewards) == 1
+    assert rewards[0] > 0.0
+def test_log_key_selection_prefers_reward_and_metric_keys():
+    log_history = [
+        {"step": 1, "loss": 1.2, "rewards/open_env_reward": 0.4, "objective/kl": 0.05},
+        {"step": 2, "loss": 1.0, "rewards/open_env_reward": 0.6, "objective/kl": 0.04},
+    ]
+    assert available_numeric_log_keys(log_history) == [
+        "loss",
+        "objective/kl",
+        "rewards/open_env_reward",
+    ]
+    reward_key = select_reward_key(log_history)
+    assert reward_key == "rewards/open_env_reward"
+    assert select_metric_key(log_history, reward_key=reward_key) == "objective/kl"
+def test_save_training_plots_writes_expected_files(tmp_path):
+    log_history = [
+        {"step": 1, "loss": 1.2, "reward": 0.4, "grad_norm": 0.8},
+        {"step": 2, "loss": 0.9, "reward": 0.7, "grad_norm": 0.5},
+    ]
+    plot_paths = save_training_plots(log_history, tmp_path, metric_key="grad_norm")
+    assert set(plot_paths) == {"loss", "reward", "metric", "dashboard"}
+    for plot_path in plot_paths.values():
+        assert Path(plot_path).exists()
+    manifest_path = tmp_path / "training_plot_manifest.json"
+    assert manifest_path.exists()

training/__init__.py CHANGED Viewed

@@ -1,9 +1,7 @@
 from .evaluation import EvaluationSuite
-from .gym_wrapper import BioExperimentGymEnv
 from .trajectory import Trajectory, TrajectoryDataset
 __all__ = [
-    "BioExperimentGymEnv",
     "EvaluationSuite",
     "PaperBenchmarkResult",
     "Trajectory",

 from .evaluation import EvaluationSuite
 from .trajectory import Trajectory, TrajectoryDataset
 __all__ = [
     "EvaluationSuite",
     "PaperBenchmarkResult",
     "Trajectory",

training/evaluation.py CHANGED Viewed

@@ -118,7 +118,7 @@ class EvaluationSuite:
         for t in ds.trajectories:
             violations = sum(
                 1 for s in t.steps
-                if not s.observation.get("rule_violations") == []
                 and s.observation.get("rule_violations") is not None
             )
             if violations == 0:
@@ -146,7 +146,8 @@ class EvaluationSuite:
                 at = s.action.get("action_type")
                 if at:
                     all_types.add(at)
-        return len(all_types)
     @staticmethod
     def _mean_conclusion_confidence(ds: TrajectoryDataset) -> float:

         for t in ds.trajectories:
             violations = sum(
                 1 for s in t.steps
+                if s.observation.get("rule_violations", []) != []
                 and s.observation.get("rule_violations") is not None
             )
             if violations == 0:
                 at = s.action.get("action_type")
                 if at:
                     all_types.add(at)
+        from models import ActionType
+        return len(all_types) / max(len(ActionType), 1)
     @staticmethod
     def _mean_conclusion_confidence(ds: TrajectoryDataset) -> float:

training/literature_benchmark.py CHANGED Viewed

@@ -148,7 +148,6 @@ def run_paper_benchmark(
             tool_call_spec=_tool_context(
                 obs.task,
                 libraries=["biopython"],
-                include_expected_findings=True,
             ),
         )
     )
@@ -353,19 +352,7 @@ def infer_conclusion_claims(obs: ExperimentObservation) -> List[ConclusionClaim]
             evidence_steps=_evidence_steps(obs, {OutputType.NETWORK_RESULT}),
         ))
-    if claims:
-        return claims
-    # Fallback: preserve the strongest expected findings verbatim if the
-    # heuristic extractors do not recover enough signal from the episode.
-    return [
-        ConclusionClaim(
-            claim=finding.finding,
-            confidence=0.65,
-            claim_type=finding.category,
-        )
-        for finding in obs.task.expected_findings[:3]
-    ]
 def compare_expected_findings(
@@ -444,11 +431,11 @@ def _default_comparison_name(task: TaskSpec) -> str:
 def _preferred_marker(task: TaskSpec) -> str:
-    for finding in task.expected_findings:
-        for keyword in finding.keywords:
-            if keyword.isupper():
-                return keyword
-    return "SPP1"
 def _latest_output_data(

             tool_call_spec=_tool_context(
                 obs.task,
                 libraries=["biopython"],
             ),
         )
     )
             evidence_steps=_evidence_steps(obs, {OutputType.NETWORK_RESULT}),
         ))
+    return claims
 def compare_expected_findings(
 def _preferred_marker(task: TaskSpec) -> str:
+    """Derive a candidate marker from the problem statement, not expected findings."""
+    tokens = [t for t in TOKEN_RE.findall(task.problem_statement) if t.isupper() and len(t) >= 3]
+    if tokens:
+        return tokens[0]
+    return "unknown"
 def _latest_output_data(

training/rollout_collection.py ADDED Viewed

	@@ -0,0 +1,219 @@

+"""Collect trajectories with direct OpenEnv environment access."""
+from __future__ import annotations
+import argparse
+import random
+import uuid
+from pathlib import Path
+from typing import Dict, List, Sequence
+from models import ActionType, ExperimentAction
+from server.hackathon_environment import BioExperimentEnvironment
+from training.evaluation import EvaluationSuite
+from training.trajectory import Trajectory, TrajectoryDataset
+HEURISTIC_SEQUENCE = [
+    ActionType.COLLECT_SAMPLE,
+    ActionType.PREPARE_LIBRARY,
+    ActionType.SEQUENCE_CELLS,
+    ActionType.RUN_QC,
+    ActionType.FILTER_DATA,
+    ActionType.NORMALIZE_DATA,
+    ActionType.CLUSTER_CELLS,
+    ActionType.TRAJECTORY_ANALYSIS,
+    ActionType.MARKER_SELECTION,
+    ActionType.SYNTHESIZE_CONCLUSION,
+]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Run rollout episodes and persist trajectories."
+    )
+    parser.add_argument("--episodes", type=int, default=10, help="Number of episodes.")
+    parser.add_argument(
+        "--policy",
+        choices=["random", "heuristic"],
+        default="heuristic",
+        help="Policy to use for rollouts.",
+    )
+    parser.add_argument(
+        "--max-steps",
+        type=int,
+        default=None,
+        help="Optional hard cutoff per episode (defaults to env limit).",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default="training/rollouts",
+        help="Directory for JSON trajectory outputs.",
+    )
+    parser.add_argument("--seed", type=int, default=None, help="RNG seed.")
+    return parser.parse_args()
+def heuristic_next_action(history: Sequence[ActionType], step_index: int) -> ActionType:
+    seen = set(history)
+    for action in HEURISTIC_SEQUENCE:
+        if action not in seen:
+            return action
+    if step_index >= 2 and ActionType.VALIDATE_MARKER not in seen:
+        return ActionType.VALIDATE_MARKER
+    if ActionType.SYNTHESIZE_CONCLUSION in seen:
+        return ActionType.SYNTHESIZE_CONCLUSION
+    return ActionType.SYNTHESIZE_CONCLUSION
+def pick_action(policy: str, step_index: int, history: Sequence[ActionType]) -> ActionType:
+    if policy == "random":
+        return random.choice(list(ActionType))
+    return heuristic_next_action(history, step_index)
+def default_comparison_name(conditions: Sequence[str]) -> str:
+    normalized = {condition.lower() for condition in conditions}
+    if {"healthy", "ipf"} <= normalized:
+        return "IPF_vs_healthy"
+    if any("treated" in condition for condition in normalized) and any(
+        "untreated" in condition for condition in normalized
+    ):
+        return "treated_vs_untreated"
+    if any("healthy" in condition for condition in normalized):
+        return "disease_vs_healthy"
+    return "disease_vs_healthy"
+def build_experiment_action(
+    action_type: ActionType,
+    discovered_markers: Sequence[str],
+    conditions: Sequence[str],
+) -> ExperimentAction:
+    method = None
+    parameters: Dict[str, object] = {}
+    if action_type == ActionType.COLLECT_SAMPLE:
+        parameters = {"n_samples": 6}
+    elif action_type == ActionType.PREPARE_LIBRARY:
+        method = "10x_chromium"
+    elif action_type == ActionType.RUN_QC:
+        method = "scanpy.pp.calculate_qc_metrics"
+    elif action_type == ActionType.FILTER_DATA:
+        method = "scanpy.pp.filter_cells"
+    elif action_type == ActionType.NORMALIZE_DATA:
+        method = "scanpy.pp.normalize_total"
+    elif action_type == ActionType.CLUSTER_CELLS:
+        method = "scanpy.tl.leiden"
+    elif action_type == ActionType.DIFFERENTIAL_EXPRESSION:
+        method = "scanpy.tl.rank_genes_groups"
+        parameters = {"comparison": default_comparison_name(conditions)}
+    elif action_type == ActionType.TRAJECTORY_ANALYSIS:
+        method = "scanpy.tl.dpt"
+    elif action_type == ActionType.MARKER_SELECTION:
+        method = "scanpy.tl.rank_genes_groups"
+    elif action_type == ActionType.VALIDATE_MARKER:
+        method = "qPCR"
+        parameters = {"marker": discovered_markers[0] if discovered_markers else "SPP1"}
+    elif action_type == ActionType.SYNTHESIZE_CONCLUSION:
+        parameters = {"claims": []}
+    return ExperimentAction(
+        action_type=action_type,
+        method=method,
+        parameters=parameters,
+        confidence=0.75,
+    )
+def run_episode(
+    env: BioExperimentEnvironment,
+    episode_id: str,
+    policy: str,
+    max_steps: int | None = None,
+) -> Trajectory:
+    structured_obs = env.reset()
+    traj = Trajectory(
+        episode_id=episode_id,
+        task=structured_obs.task.model_dump(),
+        metadata={
+            "task_problem": structured_obs.task.problem_statement,
+            "policy": policy,
+        },
+    )
+    done = structured_obs.done
+    step_num = 0
+    while not done:
+        if max_steps is not None and step_num >= max_steps:
+            break
+        history = [rec.action_type for rec in structured_obs.pipeline_history]
+        action_type = pick_action(policy, step_num, history)
+        experiment_action = build_experiment_action(
+            action_type=action_type,
+            discovered_markers=structured_obs.discovered_markers,
+            conditions=structured_obs.task.conditions,
+        )
+        structured_obs = env.step(experiment_action)
+        reward = structured_obs.reward
+        done = structured_obs.done
+        step_num += 1
+        traj.add_step(
+            action=experiment_action,
+            observation=structured_obs,
+            reward=reward,
+            done=done,
+            reward_breakdown=structured_obs.step_reward_breakdown,
+        )
+        print(
+            f"  step={structured_obs.step_index:02d} "
+            f"action={action_type.value:>28} "
+            f"reward={reward:+.3f}"
+        )
+    return traj
+def main() -> None:
+    args = parse_args()
+    if args.seed is not None:
+        random.seed(args.seed)
+    out_dir = Path(args.output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    env = BioExperimentEnvironment()
+    trajectories: List[Trajectory] = []
+    print(
+        f"Starting rollout collection: episodes={args.episodes}, policy={args.policy}"
+    )
+    for ep in range(args.episodes):
+        print(f"Episode {ep + 1}/{args.episodes}")
+        traj = run_episode(
+            env=env,
+            episode_id=str(uuid.uuid4()),
+            policy=args.policy,
+            max_steps=args.max_steps,
+        )
+        traj.save(out_dir / f"{traj.episode_id}.json")
+        trajectories.append(traj)
+    dataset = TrajectoryDataset(trajectories)
+    stats = EvaluationSuite.online_metrics(trajectories)
+    print("\nRun complete.")
+    print(f"Saved trajectories to: {out_dir}")
+    print("Online metrics:")
+    for metric in stats:
+        print(f"  - {metric.name}: {metric.value:.4f}")
+    print(f"Summary: {dataset.summary()}")
+if __name__ == "__main__":
+    main()

training_script.py ADDED Viewed

	@@ -0,0 +1,1250 @@

+"""Train a planner with TRL GRPO and OpenEnv rewards."""
+from __future__ import annotations
+import argparse
+import json
+import random
+import re
+from numbers import Real
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Sequence, Tuple
+from client import BioExperimentEnv
+from models import (
+    ActionType,
+    ExperimentAction,
+    ExperimentObservation,
+    build_agent_observation_context,
+    build_agent_system_prompt,
+)
+from server.hackathon_environment import BioExperimentEnvironment
+from server.tasks.scenarios import SCENARIO_LIBRARY
+DEFAULT_MODEL_ID = "Qwen/Qwen3.5-0.8B"
+DEFAULT_OUTPUT_DIR = "training/grpo-output"
+DEFAULT_BASE_URL = "http://localhost:8000"
+INVALID_ACTION_PENALTY = -2.0
+ENVIRONMENT_ERROR_PENALTY = -4.0
+SYSTEM_PROMPT = build_agent_system_prompt()
+HEURISTIC_SEQUENCE = [
+    ActionType.COLLECT_SAMPLE,
+    ActionType.PREPARE_LIBRARY,
+    ActionType.SEQUENCE_CELLS,
+    ActionType.RUN_QC,
+    ActionType.FILTER_DATA,
+    ActionType.NORMALIZE_DATA,
+    ActionType.CLUSTER_CELLS,
+    ActionType.DIFFERENTIAL_EXPRESSION,
+    ActionType.PATHWAY_ENRICHMENT,
+    ActionType.MARKER_SELECTION,
+    ActionType.SYNTHESIZE_CONCLUSION,
+]
+VALID_ACTION_TYPES = {action.value for action in ActionType}
+def compact_preview(value: Any, max_chars: int = 160) -> str:
+    try:
+        text = json.dumps(value, ensure_ascii=True, sort_keys=True)
+    except TypeError:
+        text = str(value)
+    text = re.sub(r"\s+", " ", text).strip()
+    if len(text) <= max_chars:
+        return text
+    return text[: max_chars - 3] + "..."
+def _edit_distance(a: str, b: str) -> int:
+    if len(a) < len(b):
+        return _edit_distance(b, a)
+    if not b:
+        return len(a)
+    prev = list(range(len(b) + 1))
+    for i, ca in enumerate(a):
+        curr = [i + 1]
+        for j, cb in enumerate(b):
+            curr.append(min(prev[j + 1] + 1, curr[j] + 1, prev[j] + (ca != cb)))
+        prev = curr
+    return prev[-1]
+def get_payload_value(payload: Dict[str, Any], *names: str) -> Any:
+    for name in names:
+        if name in payload:
+            return payload[name]
+    lowered = {
+        str(key).lower(): value
+        for key, value in payload.items()
+    }
+    for name in names:
+        if name.lower() in lowered:
+            return lowered[name.lower()]
+    for key, value in lowered.items():
+        for name in names:
+            threshold = max(2, len(name) // 3)
+            if _edit_distance(key, name.lower()) <= threshold:
+                return value
+    return None
+def build_argument_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        description="Train a GRPO policy against the OpenEnv bio experiment environment."
+    )
+    parser.add_argument("--model-id", default=DEFAULT_MODEL_ID)
+    parser.add_argument("--output-dir", default=DEFAULT_OUTPUT_DIR)
+    parser.add_argument("--dataset-episodes", type=int, default=8)
+    parser.add_argument("--rollout-steps", type=int, default=6)
+    parser.add_argument(
+        "--collection-policy",
+        choices=["random", "heuristic"],
+        default="heuristic",
+        help="Policy used to build prompt states for GRPO training.",
+    )
+    parser.add_argument(
+        "--reward-backend",
+        choices=["local", "remote"],
+        default="local",
+        help="Use local in-process scoring or a live OpenEnv server.",
+    )
+    parser.add_argument(
+        "--base-url",
+        default=DEFAULT_BASE_URL,
+        help="Base URL for the OpenEnv server when reward-backend=remote.",
+    )
+    parser.add_argument(
+        "--scenario-name",
+        action="append",
+        default=None,
+        help="Repeatable scenario selector. Defaults to all curated scenarios.",
+    )
+    parser.add_argument(
+        "--domain-randomise",
+        action="store_true",
+        help="Enable domain randomisation while building prompts and local rewards.",
+    )
+    parser.add_argument("--num-generations", type=int, default=2)
+    parser.add_argument("--max-completion-length", type=int, default=220)
+    parser.add_argument("--max-prompt-length", type=int, default=768)
+    parser.add_argument("--per-device-train-batch-size", type=int, default=2)
+    parser.add_argument("--gradient-accumulation-steps", type=int, default=1)
+    parser.add_argument("--learning-rate", type=float, default=5e-6)
+    parser.add_argument("--num-train-epochs", type=float, default=1.0)
+    parser.add_argument("--logging-steps", type=int, default=1)
+    parser.add_argument("--save-steps", type=int, default=50)
+    parser.add_argument(
+        "--plot-metric-key",
+        default=None,
+        help="Optional extra metric key from trainer log history to plot.",
+    )
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument(
+        "--load-model-only",
+        action="store_true",
+        help="Download and load the selected model and tokenizer, then exit.",
+    )
+    parser.add_argument(
+        "--trust-remote-code",
+        action="store_true",
+        help="Pass trust_remote_code=True to model/tokenizer loading.",
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Build the prompt dataset and smoke-test the reward function without training.",
+    )
+    parser.add_argument(
+        "--push-to-hub",
+        type=str,
+        default=None,
+        help="HuggingFace Hub repo id to push the trained model to (e.g. 'myuser/my-model').",
+    )
+    return parser
+def parse_args(argv: Optional[Sequence[str]] = None) -> argparse.Namespace:
+    return build_argument_parser().parse_args(argv)
+def make_training_args(**overrides: Any) -> argparse.Namespace:
+    """Build an argparse-style namespace for notebooks and scripts."""
+    parser = build_argument_parser()
+    defaults = vars(parser.parse_args([]))
+    unknown = sorted(set(overrides) - set(defaults))
+    if unknown:
+        raise ValueError(f"Unknown training args: {', '.join(unknown)}")
+    defaults.update(overrides)
+    return argparse.Namespace(**defaults)
+def format_observation(obs: ExperimentObservation) -> str:
+    parts = [
+        f"TASK: {obs.task.problem_statement}",
+        f"Organism: {obs.task.organism} | Tissue: {obs.task.tissue}",
+        f"Conditions: {', '.join(obs.task.conditions) or 'N/A'}",
+        (
+            "Step: "
+            f"{obs.step_index} | Budget: ${obs.resource_usage.budget_remaining:,.0f} "
+            f"| Time: {obs.resource_usage.time_remaining_days:.0f}d"
+        ),
+    ]
+    context = build_agent_observation_context(obs, max_tools=5, max_assays=2)
+    if context:
+        parts.append(context)
+    if obs.pipeline_history:
+        parts.append("History:")
+        for step in obs.pipeline_history[-5:]:
+            tag = "OK" if step.success else "FAIL"
+            line = f"  [{tag}] {step.action_type.value}: {step.output_summary[:100]}"
+            if step.parameters:
+                line += f" | params={compact_preview(step.parameters, 120)}"
+            parts.append(line)
+    if obs.latest_output and obs.latest_output.data:
+        parts.append(
+            f"Latest output data: {compact_preview(obs.latest_output.data, 200)}"
+        )
+    if obs.rule_violations:
+        parts.append(f"Violations: {obs.rule_violations}")
+    if obs.discovered_markers:
+        parts.append(f"Markers: {obs.discovered_markers[:5]}")
+    if obs.candidate_mechanisms:
+        parts.append(f"Mechanisms: {obs.candidate_mechanisms[:5]}")
+    return "\n".join(parts)
+def build_training_prompt(obs: ExperimentObservation) -> str:
+    return f"{SYSTEM_PROMPT}\n\n{format_observation(obs)}"
+def heuristic_next_action(history: Sequence[ActionType], step_index: int) -> ActionType:
+    seen = set(history)
+    for action in HEURISTIC_SEQUENCE:
+        if action not in seen:
+            return action
+    if step_index >= 2 and ActionType.VALIDATE_MARKER not in seen:
+        return ActionType.VALIDATE_MARKER
+    return ActionType.SYNTHESIZE_CONCLUSION
+def pick_action(policy: str, step_index: int, history: Sequence[ActionType]) -> ActionType:
+    if policy == "random":
+        return random.choice(list(ActionType))
+    return heuristic_next_action(history, step_index)
+def default_comparison_name(conditions: Sequence[str]) -> str:
+    normalized = {condition.lower() for condition in conditions}
+    if {"healthy", "ipf"} <= normalized:
+        return "IPF_vs_healthy"
+    if any("treated" in condition for condition in normalized) and any(
+        "untreated" in condition for condition in normalized
+    ):
+        return "treated_vs_untreated"
+    return "disease_vs_healthy"
+def build_experiment_action(
+    action_type: ActionType,
+    discovered_markers: Sequence[str],
+    conditions: Sequence[str],
+) -> ExperimentAction:
+    method = None
+    parameters: Dict[str, object] = {}
+    justification = f"Advance the experiment with {action_type.value}."
+    if action_type == ActionType.COLLECT_SAMPLE:
+        parameters = {"n_samples": 6}
+        justification = "Collect enough samples to start the experiment."
+    elif action_type == ActionType.PREPARE_LIBRARY:
+        method = "10x_chromium"
+        justification = "Prepare a single-cell library for sequencing."
+    elif action_type == ActionType.SEQUENCE_CELLS:
+        method = "NovaSeq"
+        justification = "Generate reads for downstream single-cell analysis."
+    elif action_type == ActionType.RUN_QC:
+        method = "scanpy.pp.calculate_qc_metrics"
+        justification = "Measure technical quality before filtering."
+    elif action_type == ActionType.FILTER_DATA:
+        method = "scanpy.pp.filter_cells"
+        justification = "Remove low-quality cells and technical artifacts."
+    elif action_type == ActionType.NORMALIZE_DATA:
+        method = "scanpy.pp.normalize_total"
+        justification = "Normalize counts for comparable expression profiles."
+    elif action_type == ActionType.CLUSTER_CELLS:
+        method = "scanpy.tl.leiden"
+        justification = "Resolve cell states before interpretation."
+    elif action_type == ActionType.DIFFERENTIAL_EXPRESSION:
+        method = "scanpy.tl.rank_genes_groups"
+        parameters = {"comparison": default_comparison_name(conditions)}
+        justification = "Identify genes associated with the phenotype of interest."
+    elif action_type == ActionType.TRAJECTORY_ANALYSIS:
+        method = "scanpy.tl.dpt"
+        justification = "Recover pseudotime and lineage structure."
+    elif action_type == ActionType.PATHWAY_ENRICHMENT:
+        method = "gseapy.prerank"
+        justification = "Translate gene-level changes into pathway programs."
+    elif action_type == ActionType.MARKER_SELECTION:
+        method = "scanpy.tl.rank_genes_groups"
+        justification = "Nominate marker genes for validation."
+    elif action_type == ActionType.VALIDATE_MARKER:
+        method = "qPCR"
+        parameters = {"marker": discovered_markers[0] if discovered_markers else "SPP1"}
+        justification = "Validate the strongest discovered marker."
+    elif action_type == ActionType.SYNTHESIZE_CONCLUSION:
+        top = list(discovered_markers[:5]) if discovered_markers else []
+        parameters = {
+            "claims": [{
+                "top_markers": top,
+                "causal_mechanisms": [],
+                "predicted_pathways": {},
+                "confidence": 0.6,
+                "claim_type": "correlational",
+                "claim": "",
+            }],
+        }
+        justification = "Summarize the current evidence into a conclusion."
+    return ExperimentAction(
+        action_type=action_type,
+        method=method,
+        parameters=parameters,
+        justification=justification,
+        confidence=0.75,
+    )
+def selected_scenarios(requested: Optional[Sequence[str]]) -> List[str]:
+    from server.tasks.procedural_generator import generate_procedural_scenarios
+    all_scenarios = list(SCENARIO_LIBRARY) + generate_procedural_scenarios(n=20, seed=42)
+    available = [scenario.name for scenario in all_scenarios]
+    if not requested:
+        return available
+    unknown = sorted(set(requested) - set(available))
+    if unknown:
+        raise ValueError(f"Unknown scenarios requested: {', '.join(unknown)}")
+    return list(requested)
+def action_completion_json(action: ExperimentAction) -> str:
+    payload = {
+        "action_type": action.action_type.value,
+        "method": action.method,
+        "parameters": action.parameters,
+        "justification": action.justification,
+        "confidence": action.confidence,
+    }
+    return json.dumps(payload)
+def build_prompt_examples(
+    *,
+    dataset_episodes: int,
+    rollout_steps: int,
+    collection_policy: str,
+    scenario_names: Sequence[str],
+    seed: int,
+    domain_randomise: bool,
+) -> List[Dict[str, str]]:
+    rng = random.Random(seed)
+    examples: List[Dict[str, str]] = []
+    scenario_cycle = list(scenario_names)
+    rng.shuffle(scenario_cycle)
+    for episode_idx in range(dataset_episodes):
+        scenario_name = scenario_cycle[episode_idx % len(scenario_cycle)]
+        env = BioExperimentEnvironment(
+            scenario_name=scenario_name,
+            domain_randomise=domain_randomise,
+        )
+        obs = env.reset()
+        history_actions: List[ExperimentAction] = []
+        for step_idx in range(rollout_steps):
+            if obs.done:
+                break
+            next_action = build_experiment_action(
+                action_type=pick_action(
+                    collection_policy,
+                    step_idx,
+                    [action.action_type for action in history_actions],
+                ),
+                discovered_markers=obs.discovered_markers,
+                conditions=obs.task.conditions,
+            )
+            examples.append({
+                "prompt": build_training_prompt(obs),
+                "scenario_name": scenario_name,
+                "history_actions": json.dumps(
+                    [action.model_dump() for action in history_actions]
+                ),
+                "rng_seed": str(env._latent.rng_seed),
+                "reference_action": action_completion_json(next_action),
+                "problem_statement": obs.task.problem_statement,
+            })
+            history_actions.append(next_action)
+            obs = env.step(next_action)
+    return examples
+def completion_to_text(completion: Any) -> str:
+    if isinstance(completion, str):
+        return completion.strip()
+    if isinstance(completion, dict):
+        return content_to_text(completion.get("content", ""))
+    if isinstance(completion, list):
+        for item in reversed(completion):
+            if isinstance(item, dict) and "content" in item:
+                text = content_to_text(item["content"])
+                if text:
+                    return text
+            if isinstance(item, str) and item.strip():
+                return item.strip()
+    return str(completion).strip()
+def content_to_text(content: Any) -> str:
+    if isinstance(content, str):
+        return content.strip()
+    if isinstance(content, list):
+        parts: List[str] = []
+        for part in content:
+            if isinstance(part, str):
+                parts.append(part)
+            elif isinstance(part, dict):
+                if isinstance(part.get("text"), str):
+                    parts.append(part["text"])
+                elif isinstance(part.get("content"), str):
+                    parts.append(part["content"])
+        return "".join(parts).strip()
+    return str(content).strip()
+def _repair_truncated_json(text: str) -> Optional[str]:
+    """Try to repair JSON truncated mid-value (common with small LLMs)."""
+    s = text.strip()
+    if not s.startswith("{"):
+        return None
+    s = re.sub(r',\s*"[^"\n]*$', '', s)
+    s = re.sub(r',\s*"[^"\n]*"\s*:\s*$', '', s)
+    in_string = False
+    escape = False
+    for ch in s:
+        if escape:
+            escape = False
+            continue
+        if ch == "\\":
+            escape = True
+            continue
+        if ch == '"':
+            in_string = not in_string
+    if in_string:
+        s += '"'
+    open_braces = s.count("{") - s.count("}")
+    open_brackets = s.count("[") - s.count("]")
+    s += "]" * max(0, open_brackets)
+    s += "}" * max(0, open_braces)
+    try:
+        obj = json.loads(s)
+        if isinstance(obj, dict):
+            return s
+    except json.JSONDecodeError:
+        pass
+    s = re.sub(r',\s*([}\]])', r'\1', s)
+    try:
+        obj = json.loads(s)
+        if isinstance(obj, dict):
+            return s
+    except json.JSONDecodeError:
+        pass
+    return None
+def _normalize_jsonish_text(text: str) -> str:
+    """Normalize common near-JSON artifacts emitted by small local models."""
+    text = _strip_js_comments(text)
+    text = re.sub(r'(?<=:\s)\bNone\b', 'null', text)
+    text = re.sub(r'(?<=:\s)\bTrue\b', 'true', text)
+    text = re.sub(r'(?<=:\s)\bFalse\b', 'false', text)
+    text = re.sub(r'"([^"\n]+?):"\s*,', r'"\1": "",', text)
+    return text
+def _strip_js_comments(text: str) -> str:
+    """Remove // and /* */ comments that small LLMs inject into JSON."""
+    text = re.sub(r'//[^\n]*', '', text)
+    text = re.sub(r'/\*.*?\*/', '', text, flags=re.DOTALL)
+    return text
+def extract_json_object(text: str) -> Optional[Dict[str, Any]]:
+    stripped = _normalize_jsonish_text(text).strip()
+    fence_prefix = "```"
+    if stripped.startswith(fence_prefix) and stripped.endswith(fence_prefix):
+        lines = stripped.splitlines()
+        if len(lines) >= 3:
+            stripped = "\n".join(lines[1:-1]).strip()
+    candidates: List[str] = [stripped]
+    start = stripped.find("{")
+    while start != -1:
+        depth = 0
+        for idx in range(start, len(stripped)):
+            char = stripped[idx]
+            if char == "{":
+                depth += 1
+            elif char == "}":
+                depth -= 1
+                if depth == 0:
+                    candidates.append(stripped[start:idx + 1])
+                    break
+        start = stripped.find("{", start + 1)
+    first_brace = stripped.find("{")
+    if first_brace != -1:
+        repaired = _repair_truncated_json(stripped[first_brace:])
+        if repaired is not None:
+            candidates.append(repaired)
+    candidates.sort(key=len, reverse=True)
+    for candidate in candidates:
+        try:
+            parsed = json.loads(candidate)
+        except json.JSONDecodeError:
+            continue
+        if isinstance(parsed, dict):
+            return parsed
+    return None
+def normalize_optional_string(value: Any) -> Optional[str]:
+    if value is None or isinstance(value, bool):
+        return None
+    if isinstance(value, str):
+        value = value.strip()
+        return value or None
+    if isinstance(value, (int, float)):
+        return str(value)
+    return compact_preview(value, 80)
+def parse_action_completion(text: str) -> Optional[ExperimentAction]:
+    payload = extract_json_object(text)
+    if payload is not None:
+        action_type = get_payload_value(payload, "action_type")
+        if action_type not in VALID_ACTION_TYPES:
+            return None
+        parameters = get_payload_value(payload, "parameters", "params") or {}
+        if not isinstance(parameters, dict):
+            parameters = {}
+        confidence = get_payload_value(payload, "confidence")
+        if confidence is None:
+            confidence = 0.5
+        try:
+            confidence = float(confidence)
+        except (TypeError, ValueError):
+            confidence = 0.5
+        justification = get_payload_value(
+            payload, "justification", "reasoning", "rationale", "reason"
+        )
+        if justification is not None and not isinstance(justification, str):
+            justification = compact_preview(justification, 200)
+        return ExperimentAction(
+            action_type=ActionType(action_type),
+            method=normalize_optional_string(get_payload_value(payload, "method")),
+            parameters=parameters,
+            justification=justification,
+            confidence=min(1.0, max(0.0, confidence)),
+        )
+    action_match = re.search(
+        r'["\']action_type["\']\s*:\s*["\']([^"\']+)',
+        text,
+        re.IGNORECASE,
+    )
+    if not action_match:
+        return None
+    action_type = action_match.group(1).strip()
+    if action_type not in VALID_ACTION_TYPES:
+        return None
+    method_match = re.search(
+        r'["\']method["\']\s*:\s*("((?:[^"\\]|\\.)*)"|null|none|true|false|-?\d+(?:\.\d+)?)',
+        text,
+        re.IGNORECASE,
+    )
+    confidence_match = re.search(
+        r'["\']confidence["\']\s*:\s*([0-9]*\.?[0-9]+)',
+        text,
+        re.IGNORECASE,
+    )
+    justification_match = re.search(
+        r'["\'](?:justif\w*|reasoning|rationale|reason)["\']\s*:\s*"((?:[^"\\]|\\.)*)',
+        text,
+        re.DOTALL | re.IGNORECASE,
+    )
+    confidence = 0.5
+    if confidence_match:
+        try:
+            confidence = float(confidence_match.group(1))
+        except ValueError:
+            confidence = 0.5
+    justification = None
+    if justification_match:
+        try:
+            justification = json.loads(f'"{justification_match.group(1)}"')
+        except json.JSONDecodeError:
+            justification = justification_match.group(1)
+    method = None
+    if method_match:
+        raw_method = method_match.group(1)
+        if raw_method.startswith('"') and raw_method.endswith('"'):
+            try:
+                method = json.loads(raw_method)
+            except json.JSONDecodeError:
+                method = raw_method.strip('"')
+        elif raw_method.lower() not in {"null", "none", "true", "false"}:
+            method = raw_method
+    return ExperimentAction(
+        action_type=ActionType(action_type),
+        method=normalize_optional_string(method),
+        parameters={},
+        justification=justification,
+        confidence=min(1.0, max(0.0, confidence)),
+    )
+def decode_history_actions(history_actions: Optional[str]) -> List[ExperimentAction]:
+    if not history_actions:
+        return []
+    raw_actions = json.loads(history_actions)
+    return [
+        ExperimentAction(**action_payload)
+        for action_payload in raw_actions
+        if isinstance(action_payload, dict)
+    ]
+def normalise_column(values: Any, length: int) -> List[Any]:
+    if values is None:
+        return [None] * length
+    if isinstance(values, list):
+        if len(values) == length:
+            return values
+        if len(values) == 1:
+            return values * length
+        return values[:length] + [None] * max(0, length - len(values))
+    return [values] * length
+class OpenEnvReward:
+    """Reward function compatible with TRL GRPOTrainer."""
+    def __init__(
+        self,
+        *,
+        reward_backend: str,
+        base_url: str,
+        invalid_action_penalty: float = INVALID_ACTION_PENALTY,
+        environment_error_penalty: float = ENVIRONMENT_ERROR_PENALTY,
+        domain_randomise: bool = False,
+    ) -> None:
+        self.__name__ = "openenv_reward"
+        self.reward_backend = reward_backend
+        self.base_url = base_url
+        self.invalid_action_penalty = invalid_action_penalty
+        self.environment_error_penalty = environment_error_penalty
+        self.domain_randomise = domain_randomise
+    def __call__(
+        self,
+        completions: List[Any],
+        scenario_name: Optional[List[str]] = None,
+        history_actions: Optional[List[str]] = None,
+        rng_seed: Optional[List[str]] = None,
+        **_: Any,
+    ) -> List[float]:
+        scenario_names = normalise_column(scenario_name, len(completions))
+        history_columns = normalise_column(history_actions, len(completions))
+        seed_columns = normalise_column(rng_seed, len(completions))
+        rewards: List[float] = []
+        for completion, current_scenario, current_history, current_seed in zip(
+            completions,
+            scenario_names,
+            history_columns,
+            seed_columns,
+        ):
+            action = parse_action_completion(completion_to_text(completion))
+            if action is None:
+                rewards.append(self.invalid_action_penalty)
+                continue
+            try:
+                if self.reward_backend == "remote":
+                    reward = self._score_remote(action, current_scenario, current_history)
+                else:
+                    reward = self._score_local(action, current_scenario, current_history, current_seed)
+            except Exception:
+                reward = self.environment_error_penalty
+            rewards.append(float(reward))
+        return rewards
+    def _score_local(
+        self,
+        action: ExperimentAction,
+        scenario_name: Optional[str],
+        history_actions: Optional[str],
+        rng_seed: Optional[str] = None,
+    ) -> float:
+        env = BioExperimentEnvironment(
+            scenario_name=scenario_name,
+            domain_randomise=self.domain_randomise,
+        )
+        seed = int(rng_seed) if rng_seed else None
+        obs = env.reset(seed=seed)
+        for previous_action in decode_history_actions(history_actions):
+            obs = env.step(previous_action)
+            if obs.done:
+                return float(obs.reward)
+        obs = env.step(action)
+        return float(obs.reward)
+    def _score_remote(
+        self,
+        action: ExperimentAction,
+        scenario_name: Optional[str],
+        history_actions: Optional[str],
+    ) -> float:
+        with BioExperimentEnv(base_url=self.base_url) as env:
+            # NOTE: scenario_name is accepted for API parity with _score_local
+            # but the OpenEnv HTTP protocol does not yet support passing it
+            # through reset(). The server will use its configured default.
+            result = env.reset()
+            for previous_action in decode_history_actions(history_actions):
+                result = env.step(previous_action)
+                if result.done:
+                    return float(result.reward or 0.0)
+            result = env.step(action)
+            if result.reward is not None:
+                return float(result.reward)
+            return float(result.observation.reward)
+def is_numeric_log_value(value: Any) -> bool:
+    return isinstance(value, Real) and not isinstance(value, bool)
+def available_numeric_log_keys(log_history: Sequence[Dict[str, Any]]) -> List[str]:
+    keys = {
+        key
+        for entry in log_history
+        if isinstance(entry, dict)
+        for key, value in entry.items()
+        if key != "step" and is_numeric_log_value(value)
+    }
+    return sorted(keys)
+def extract_log_series(
+    log_history: Sequence[Dict[str, Any]],
+    key: Optional[str],
+) -> List[Tuple[float, float]]:
+    if not key:
+        return []
+    series: List[Tuple[float, float]] = []
+    synthetic_step = 0
+    for entry in log_history:
+        if not isinstance(entry, dict) or key not in entry:
+            continue
+        value = entry.get(key)
+        if not is_numeric_log_value(value):
+            continue
+        raw_step = entry.get("step")
+        if is_numeric_log_value(raw_step):
+            step = float(raw_step)
+        else:
+            synthetic_step += 1
+            step = float(synthetic_step)
+        series.append((step, float(value)))
+    return series
+def select_reward_key(log_history: Sequence[Dict[str, Any]]) -> Optional[str]:
+    numeric_keys = available_numeric_log_keys(log_history)
+    reward_keys = [key for key in numeric_keys if "reward" in key.lower()]
+    if not reward_keys:
+        return None
+    preferred = [
+        "reward",
+        "mean_reward",
+        "reward_mean",
+        "rewards/open_env_reward",
+    ]
+    lowered = {key.lower(): key for key in reward_keys}
+    for key in preferred:
+        if key in lowered:
+            return lowered[key]
+    reward_keys.sort(key=lambda key: ("/" in key, len(key), key))
+    return reward_keys[0]
+def select_metric_key(
+    log_history: Sequence[Dict[str, Any]],
+    *,
+    reward_key: Optional[str],
+    requested_key: Optional[str] = None,
+) -> Optional[str]:
+    numeric_keys = available_numeric_log_keys(log_history)
+    if requested_key:
+        if requested_key not in numeric_keys:
+            available = ", ".join(numeric_keys) or "none"
+            raise ValueError(
+                f"Requested plot metric '{requested_key}' was not logged. "
+                f"Available numeric keys: {available}"
+            )
+        return requested_key
+    excluded = {
+        "epoch",
+        "loss",
+        "learning_rate",
+        "step",
+        "total_flos",
+        "train_loss",
+        "train_runtime",
+        "train_samples_per_second",
+        "train_steps_per_second",
+    }
+    if reward_key:
+        excluded.add(reward_key)
+    preferred = [
+        "kl",
+        "objective/kl",
+        "completion_length",
+        "mean_completion_length",
+        "grad_norm",
+        "entropy",
+        "accuracy",
+        "learning_rate",
+        "epoch",
+    ]
+    numeric_set = set(numeric_keys)
+    for key in preferred:
+        if key in numeric_set and key not in excluded:
+            return key
+    candidates = [
+        key for key in numeric_keys
+        if key not in excluded and "reward" not in key.lower()
+    ]
+    if candidates:
+        return candidates[0]
+    for fallback in ("learning_rate", "epoch"):
+        if fallback in numeric_set:
+            return fallback
+    return None
+def save_plot(
+    path: Path,
+    *,
+    series: Sequence[Tuple[float, float]],
+    title: str,
+    ylabel: str,
+) -> None:
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    fig, ax = plt.subplots(figsize=(8, 4.5))
+    if series:
+        x_values, y_values = zip(*series)
+        ax.plot(x_values, y_values, marker="o", linewidth=1.8)
+    else:
+        ax.text(
+            0.5,
+            0.5,
+            "No logged data available",
+            ha="center",
+            va="center",
+            transform=ax.transAxes,
+        )
+    ax.set_title(title)
+    ax.set_xlabel("Step")
+    ax.set_ylabel(ylabel)
+    ax.grid(True, alpha=0.3)
+    fig.tight_layout()
+    fig.savefig(path, dpi=150)
+    plt.close(fig)
+def save_training_plots(
+    log_history: Sequence[Dict[str, Any]],
+    output_dir: str | Path,
+    metric_key: Optional[str] = None,
+) -> Dict[str, str]:
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    reward_key = select_reward_key(log_history)
+    selected_metric_key = select_metric_key(
+        log_history,
+        reward_key=reward_key,
+        requested_key=metric_key,
+    )
+    loss_series = extract_log_series(log_history, "loss")
+    reward_series = extract_log_series(log_history, reward_key)
+    metric_series = extract_log_series(log_history, selected_metric_key)
+    loss_path = output_path / "training_loss.png"
+    reward_path = output_path / "training_reward.png"
+    metric_path = output_path / "training_metric.png"
+    dashboard_path = output_path / "training_dashboard.png"
+    manifest_path = output_path / "training_plot_manifest.json"
+    save_plot(loss_path, series=loss_series, title="Training Loss", ylabel="Loss")
+    save_plot(
+        reward_path,
+        series=reward_series,
+        title=f"Training Reward ({reward_key or 'not logged'})",
+        ylabel="Reward",
+    )
+    save_plot(
+        metric_path,
+        series=metric_series,
+        title=f"Training Metric ({selected_metric_key or 'not logged'})",
+        ylabel=selected_metric_key or "Metric",
+    )
+    fig, axes = plt.subplots(3, 1, figsize=(10, 12))
+    plot_specs = [
+        (axes[0], loss_series, "Training Loss", "Loss"),
+        (axes[1], reward_series, f"Training Reward ({reward_key or 'not logged'})", "Reward"),
+        (
+            axes[2],
+            metric_series,
+            f"Training Metric ({selected_metric_key or 'not logged'})",
+            selected_metric_key or "Metric",
+        ),
+    ]
+    for axis, series, title, ylabel in plot_specs:
+        if series:
+            x_values, y_values = zip(*series)
+            axis.plot(x_values, y_values, marker="o", linewidth=1.8)
+        else:
+            axis.text(
+                0.5,
+                0.5,
+                "No logged data available",
+                ha="center",
+                va="center",
+                transform=axis.transAxes,
+            )
+        axis.set_title(title)
+        axis.set_xlabel("Step")
+        axis.set_ylabel(ylabel)
+        axis.grid(True, alpha=0.3)
+    fig.tight_layout()
+    fig.savefig(dashboard_path, dpi=150)
+    plt.close(fig)
+    manifest = {
+        "available_numeric_keys": available_numeric_log_keys(log_history),
+        "reward_key": reward_key,
+        "metric_key": selected_metric_key,
+        "plots": {
+            "loss": str(loss_path),
+            "reward": str(reward_path),
+            "metric": str(metric_path),
+            "dashboard": str(dashboard_path),
+        },
+    }
+    manifest_path.write_text(json.dumps(manifest, indent=2), encoding="utf-8")
+    return manifest["plots"]
+def run_dry_run_preview(
+    examples: Sequence[Dict[str, str]],
+    reward_fn: OpenEnvReward,
+    output_dir: str,
+) -> None:
+    if not examples:
+        raise ValueError("No training prompts were generated for the dry run.")
+    sample = examples[0]
+    sample_reward = reward_fn(
+        completions=[[{"role": "assistant", "content": sample["reference_action"]}]],
+        scenario_name=[sample["scenario_name"]],
+        history_actions=[sample["history_actions"]],
+    )[0]
+    print(f"Built {len(examples)} prompt states.")
+    print(f"Output directory: {Path(output_dir)}")
+    print(f"Sample scenario: {sample['scenario_name']}")
+    print(f"Sample reward for reference action: {sample_reward:+.3f}")
+    print("\nSample prompt:\n")
+    print(sample["prompt"])
+def resolve_torch_runtime() -> Dict[str, Any]:
+    import torch
+    use_cuda = torch.cuda.is_available()
+    bf16 = bool(getattr(torch.cuda, "is_bf16_supported", lambda: False)()) if use_cuda else False
+    dtype = torch.bfloat16 if bf16 else (
+        torch.float16 if use_cuda else torch.float32
+    )
+    return {
+        "use_cuda": use_cuda,
+        "device": "cuda:0" if use_cuda else "cpu",
+        "dtype": dtype,
+        "bf16": bf16,
+        "fp16": use_cuda and not bf16,
+        "device_name": torch.cuda.get_device_name(0) if use_cuda else "cpu",
+    }
+def load_model_artifacts(
+    model_id: str,
+    *,
+    trust_remote_code: bool,
+):
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    runtime = resolve_torch_runtime()
+    print(f"Loading tokenizer for {model_id} ...")
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_id,
+        trust_remote_code=trust_remote_code,
+    )
+    if tokenizer.pad_token is None and tokenizer.eos_token is not None:
+        tokenizer.pad_token = tokenizer.eos_token
+    print(f"Loading model for {model_id} ...")
+    model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        trust_remote_code=trust_remote_code,
+        torch_dtype=runtime["dtype"],
+    )
+    if runtime["use_cuda"]:
+        model = model.to(runtime["device"])
+    else:
+        model = model.to("cpu")
+    return tokenizer, model
+def generate_action_with_model(
+    model: Any,
+    tokenizer: Any,
+    prompt_or_observation: str | ExperimentObservation,
+    *,
+    max_new_tokens: int = 220,
+    temperature: float = 0.2,
+    top_p: float = 0.9,
+    do_sample: bool = True,
+) -> Dict[str, Any]:
+    import torch
+    if isinstance(prompt_or_observation, ExperimentObservation):
+        prompt = build_training_prompt(prompt_or_observation)
+    else:
+        prompt = str(prompt_or_observation)
+    model_device = getattr(model, "device", None)
+    if model_device is None:
+        model_device = resolve_torch_runtime()["device"]
+    inputs = tokenizer(prompt, return_tensors="pt")
+    inputs = {key: value.to(model_device) for key, value in inputs.items()}
+    prompt_tokens = inputs["input_ids"].shape[1]
+    generation_kwargs = {
+        "max_new_tokens": max_new_tokens,
+        "do_sample": do_sample,
+        "temperature": temperature,
+        "top_p": top_p,
+        "pad_token_id": tokenizer.pad_token_id,
+    }
+    with torch.no_grad():
+        output_ids = model.generate(**inputs, **generation_kwargs)
+    new_tokens = output_ids[0][prompt_tokens:]
+    response_text = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
+    action = parse_action_completion(response_text)
+    return {
+        "prompt": prompt,
+        "response_text": response_text,
+        "action": action,
+    }
+def run_training(args: argparse.Namespace) -> Dict[str, Any]:
+    random.seed(args.seed)
+    runtime = resolve_torch_runtime()
+    if args.load_model_only:
+        tokenizer, model = load_model_artifacts(
+            args.model_id,
+            trust_remote_code=args.trust_remote_code,
+        )
+        device = getattr(model, "device", "unknown")
+        print(f"Model ready: {args.model_id}")
+        print(f"Tokenizer vocab size: {len(tokenizer)}")
+        print(f"Model device: {device}")
+        print(f"Runtime device name: {runtime['device_name']}")
+        return {
+            "args": args,
+            "runtime": runtime,
+            "tokenizer": tokenizer,
+            "model": model,
+        }
+    scenario_names = selected_scenarios(args.scenario_name)
+    examples = build_prompt_examples(
+        dataset_episodes=args.dataset_episodes,
+        rollout_steps=args.rollout_steps,
+        collection_policy=args.collection_policy,
+        scenario_names=scenario_names,
+        seed=args.seed,
+        domain_randomise=args.domain_randomise,
+    )
+    reward_fn = OpenEnvReward(
+        reward_backend=args.reward_backend,
+        base_url=args.base_url,
+        domain_randomise=args.domain_randomise,
+    )
+    if args.dry_run:
+        run_dry_run_preview(examples, reward_fn, args.output_dir)
+        return {
+            "args": args,
+            "runtime": runtime,
+            "scenario_names": scenario_names,
+            "examples": examples,
+            "reward_fn": reward_fn,
+        }
+    from datasets import Dataset
+    from trl import GRPOConfig, GRPOTrainer
+    train_dataset = Dataset.from_list(examples)
+    tokenizer, model = load_model_artifacts(
+        args.model_id,
+        trust_remote_code=args.trust_remote_code,
+    )
+    config = GRPOConfig(
+        output_dir=args.output_dir,
+        learning_rate=args.learning_rate,
+        per_device_train_batch_size=args.per_device_train_batch_size,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        num_generations=args.num_generations,
+        max_completion_length=args.max_completion_length,
+        num_train_epochs=args.num_train_epochs,
+        logging_steps=args.logging_steps,
+        save_steps=args.save_steps,
+        bf16=runtime["bf16"],
+        fp16=runtime["fp16"],
+        report_to="none",
+        remove_unused_columns=False,
+    )
+    print(
+        f"Training runtime: device={runtime['device']} "
+        f"name={runtime['device_name']} "
+        f"dtype={runtime['dtype']}"
+    )
+    trainer = GRPOTrainer(
+        model=model,
+        reward_funcs=reward_fn,
+        args=config,
+        train_dataset=train_dataset,
+        processing_class=tokenizer,
+    )
+    trainer.train()
+    trainer.save_model(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    if args.push_to_hub:
+        from huggingface_hub import HfApi
+        api = HfApi()
+        api.create_repo(repo_id=args.push_to_hub, repo_type="model", exist_ok=True)
+        print(f"Pushing model to HuggingFace Hub: {args.push_to_hub}")
+        api.upload_folder(
+            folder_path=args.output_dir,
+            repo_id=args.push_to_hub,
+            repo_type="model",
+            create_pr=False,
+        )
+        print(f"Model pushed to https://huggingface.co/{args.push_to_hub}")
+    plot_paths = save_training_plots(
+        trainer.state.log_history,
+        args.output_dir,
+        metric_key=args.plot_metric_key,
+    )
+    print("Saved training plots:")
+    for plot_name, plot_path in plot_paths.items():
+        print(f"  - {plot_name}: {plot_path}")
+    return {
+        "args": args,
+        "runtime": runtime,
+        "scenario_names": scenario_names,
+        "examples": examples,
+        "reward_fn": reward_fn,
+        "train_dataset": train_dataset,
+        "tokenizer": tokenizer,
+        "model": model,
+        "trainer": trainer,
+        "plot_paths": plot_paths,
+    }
+def main() -> None:
+    run_training(parse_args())
+if __name__ == "__main__":
+    main()

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff