bio-experiment / README.md
Ev3Dev's picture
Upload folder using huggingface_hub
ad39f2a verified
metadata
title: Bio Experiment Environment Server
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - reinforcement-learning
  - bioinformatics

Bio Experiment Environment

This repository implements an OpenEnv-compatible reinforcement learning environment for planning biological experiment pipelines. The agent does not directly see the true biological state. Instead, it proposes one structured experiment or analysis step at a time, receives a noisy simulated output, and is rewarded for valid, informative, efficient, well-calibrated plans.

The environment is designed as a partially observable Markov decision process (POMDP) with:

  • hidden ground-truth biology
  • hidden technical noise and failure conditions
  • visible task metadata, resource usage, step history, and intermediate outputs
  • dense step-wise reward plus terminal reward for conclusion quality

How it works

At a high level, each episode looks like this:

  1. reset() picks a biological scenario and seeds the simulator.
  2. The agent receives an ExperimentObservation describing the task and current visible state.
  3. The agent submits an ExperimentAction such as collect_sample, run_qc, or differential_expression.
  4. The rule engine checks whether the action is valid at this point in the pipeline.
  5. The transition engine updates hidden state, spends resources, and asks the output generator to simulate the result.
  6. The reward computer scores the step for validity, ordering, information gain, efficiency, novelty, and penalties.
  7. The environment returns a new observation with updated history, outputs, discoveries, violations, and reward.
  8. The episode ends when the agent synthesizes a conclusion, exhausts resources, or reaches the step limit.

The core mental model

Hidden state

The simulator keeps a FullLatentState that the agent never directly sees. It contains:

  • true cell populations and marker genes
  • true DE genes, pathways, trajectories, and regulatory networks
  • technical factors such as dropout, doublets, ambient RNA, and batch effects
  • experiment progress flags
  • remaining budget and time
  • hidden failure conditions

Visible state

The agent only sees ExperimentObservation, which includes:

  • the current TaskSpec
  • pipeline history
  • available assays and tools
  • resource usage
  • the latest and cumulative intermediate outputs
  • discovered markers and candidate mechanisms
  • rule violations
  • per-step reward breakdown

This separation is what makes the environment a POMDP rather than a fully observed simulator.

Main building blocks

models.py

Defines the contracts that all other modules use:

  • ActionType: 21 discrete experiment steps, grouped into three frozensets β€” WET_LAB_ACTIONS (8), COMPUTATIONAL_ACTIONS (10), and META_ACTIONS (3)
  • SubagentType: 9 sub-agent delegate roles (e.g. wet_lab_planner, computational_analyst, causal_reasoning_agent)
  • ExperimentAction: one structured step chosen by the agent; fields include action_type, method, parameters, justification, confidence (clamped to [0, 1]), invoked_subagent, tool_call_spec, input_targets
  • ExperimentObservation: what the agent can see after each step; includes task, pipeline_history, resource_usage, latest_output, all_outputs, discovered_markers, candidate_mechanisms, conclusions, rule_violations, step_reward_breakdown
  • TaskSpec: the problem statement, organism, tissue, conditions, budget, time limit, assays, tools, paper references, and expected findings
  • IntermediateOutput: the simulated artifact returned by a step; carries output_type, success, quality_score, summary, data, uncertainty, warnings, artifacts_available
  • ConclusionClaim: structured claims used for final synthesis; carries claim, evidence_steps, confidence, claim_type, supporting_data
  • PipelineStepRecord: compact observable record of one past step stored in history
  • ResourceUsage: budget and time tracking visible to the agent

The action vocabulary is intentionally broad enough to mix wet-lab, computational, and meta-planning actions.

server/tasks/

This is where episodes come from.

  • scenarios.py defines a curated library of four biological scenarios as Scenario dataclass objects, each bundling a TaskSpec, a LatentBiologicalState, a TechnicalState, hidden failure conditions, and tags
  • generator.py turns a scenario into a (TaskSpec, FullLatentState) pair via TaskGenerator.generate(); optional domain randomisation perturbs budget (Β±30%), time (Β±20%), technical noise, batch effects, cell proportions, and effect sizes

The four scenarios are:

Name Difficulty Tissue Problem Budget Time
cardiac_disease_de easy heart Differential expression between healthy and dilated cardiomyopathy cardiomyocytes $80 K 120 days
hematopoiesis_trajectory medium bone marrow Infer HSC β†’ mature lineage trajectory with three branches $100 K 150 days
perturbation_immune hard synovial fluid JAK inhibitor effect on T-cell states in rheumatoid arthritis $120 K 180 days
biomarker_validation_lung medium lung Validate SPP1 as biomarker for pro-fibrotic macrophages in IPF $90 K 150 days

Each scenario carries paper references with DOIs, true DE genes with log2FC values, true pathway activities, true regulatory networks, and ground-truth causal mechanisms used for terminal reward calibration.

server/simulator/

This is the simulator itself.

  • latent_state.py defines FullLatentState, the root aggregate of all hidden state. Key sub-structures are LatentBiologicalState (true DE genes, pathways, gene programs, trajectory, regulatory network, markers, causal mechanisms), TechnicalState (dropout, doublets, ambient RNA, sample quality), ExperimentProgress (18 boolean milestone flags plus counts), and ResourceState (internal budget and time tracking with exhaustion properties)
  • noise.py centralises stochasticity in NoiseModel. All randomness flows through a single seeded numpy.Generator. Methods include add_expression_noise, sample_effect_sizes, sample_p_values, generate_false_positives, generate_false_negatives, quality_degradation, sample_qc_metric, sample_cluster_count, shuffle_ranking, and coin_flip
  • output_generator.py turns an action plus hidden state into a realistic IntermediateOutput. Every action type has a dedicated handler conditioned on the latent state; noise is then injected β€” dropout in expression data, false positives and false negatives in DE and marker results, over/under-clustering, and pathway contamination
  • transition.py applies action costs from ACTION_COSTS, updates progress flags, calls the output generator, degrades quality on soft violations, propagates discovered DE genes and cluster names back into latent state, and decides whether the episode is done

The output generator does not simply echo the action. It conditions outputs on the hidden state, then injects realistic noise.

server/rules/engine.py

The rule engine enforces scientific and procedural constraints before each action is applied.

  • hard violations block the action entirely
  • soft violations allow the action, but reduce output quality and add reward penalties

The four rule families are:

  1. Prerequisites (HARD) β€” each computational step requires the appropriate upstream milestone flag. For example: normalize_data requires data_filtered, differential_expression requires data_normalized, validate_marker requires markers_discovered
  2. Resource constraints (HARD/SOFT) β€” budget or time exhausted is a hard block; action cost exceeding remaining budget (when budget > 0) is a soft warning
  3. Redundancy (SOFT) β€” repeating an already-completed step such as run_qc or normalize_data
  4. Causal validity (SOFT) β€” synthesizing conclusions without prior DE or clustering; making causal claims without validation evidence; pathway enrichment before DE

server/rewards/reward.py

Rewards are decomposed rather than being a single opaque number.

Per-step reward formula:

R_t = r_validity + r_ordering + r_info_gain + r_efficiency + r_novelty + r_penalty + Ξ³[Ο†(s_{t+1}) βˆ’ Ο†(s_t)]
Component Weight Description
validity 0.3 1.0 if output succeeded, βˆ’1.0 if hard violation
ordering 0.2 1.0 if natural next step, 0.3 otherwise
info_gain 0.4 quality_score Γ— (1 βˆ’ uncertainty)
efficiency 0.3 max(0, 1 βˆ’ 5 Γ— budget_fraction_used)
novelty +0.1 Bonus when no soft violations
penalty βˆ’0.15/violation Per soft violation
shaping Ξ³ = 0.99 Potential-based over 12 progress milestones

Terminal reward adds:

Component Weight Description
Pipeline completeness 3.0 Fraction of 7 core milestones completed
Calibration 4.0 How well conclusions match hidden markers and mechanisms
Budget + time efficiency 1.0 Average fraction of budget and time remaining
Overconfidence penalty βˆ’0.5/claim For high-confidence claims (> 0.8) that are wrong

This makes the environment easier to debug, benchmark, and train against.

server/hackathon_environment.py

This is the orchestration layer that ties everything together.

On reset() it:

  • seeds the noise model
  • generates a task and latent state via TaskGenerator
  • clears history, outputs, discoveries, conclusions, and cumulative reward

On step() it:

  • checks rules
  • calls the transition engine
  • computes reward
  • appends a PipelineStepRecord
  • updates discovered markers and candidate mechanisms
  • stores conclusion claims if the action is synthesize_conclusion
  • builds the next ExperimentObservation

This file is the best place to read if you want the end-to-end control flow.

What actually happens on one step

Here is the concrete order of operations for env.step(action):

  1. Increment the step counter.
  2. Copy the previous latent state for reward comparison.
  3. Run rule checks and split violations into hard vs soft.
  4. If there is a hard violation, return a failure report without applying the action.
  5. Otherwise deduct budget and time based on ACTION_COSTS.
  6. Update latent progress flags like samples_collected, qc_performed, or de_performed.
  7. Generate a structured simulated output for the chosen action.
  8. If there were soft violations, degrade output quality (Γ—0.5) and attach warnings.
  9. Propagate artifacts back into latent state, such as discovered DE genes or cluster names.
  10. Compute decomposed reward from state transition plus output quality.
  11. If the episode is ending, compute terminal reward from completeness and conclusion calibration.
  12. Return an observation that exposes the visible summary but not the hidden truth.

Action costs

Each action deducts from the episode's budget and time. Computational steps also accrue compute hours.

Action Budget Time (days)
sequence_cells $15,000 5
prepare_library $8,000 3
collect_sample $5,000 7
validate_marker $5,000 14
culture_cells $3,000 14
perturb_gene $2,000 3
perturb_compound $1,000 2
select_cohort $500 1
run_qc $100 0.5
integrate_batches $300 1
regulatory_network_inference $200 1
cluster_cells $150 0.5
differential_expression, trajectory_analysis, pathway_enrichment $100–200 0.5–1
filter_data, normalize_data, marker_selection $50–100 0.25–0.5
synthesize_conclusion, design_followup_experiment, request_subagent_review $0 0.25–0.5

Typical successful pipeline

Most scenarios reward a sensible experiment order similar to:

  1. collect_sample
  2. prepare_library
  3. sequence_cells
  4. run_qc
  5. filter_data
  6. normalize_data
  7. cluster_cells
  8. one or more of: differential_expression, trajectory_analysis, pathway_enrichment, regulatory_network_inference, marker_selection, validate_marker
  9. synthesize_conclusion

The exact best sequence depends on the scenario:

  • trajectory scenarios benefit from trajectory_analysis and regulatory inference
  • biomarker scenarios benefit from DE, marker selection, and validation
  • perturbation scenarios benefit from pathway-level interpretation

Episode termination

An episode ends when one of the following happens:

  • the agent chooses synthesize_conclusion
  • resources are exhausted
  • the environment reaches MAX_STEPS which is currently 30

Installation

Dependencies are managed with uv. The package requires Python β‰₯ 3.10.

H100 Jupyter notebook setup: See H100_JUPYTER_SETUP.md for environment setup on NVIDIA H100 instances with Jupyter.

# Core environment only
uv sync

# With dev/test tools
uv sync --extra dev

# With training dependencies (TRL, transformers, torch)
uv sync --extra train

# With bioinformatics extras (scanpy, biopython, gseapy)
uv sync --extra bio

Key dependency groups from pyproject.toml:

Group Key packages
core openenv-core[core]>=0.2.0, numpy, scipy, pydantic>=2.0
train trl>=0.29, transformers>=5.3, accelerate, datasets, torch, matplotlib
bio scanpy, biopython, gseapy
dev pytest, pytest-cov

Interfaces you can use

1. In-process environment

Use BioExperimentEnvironment when you want direct Python access with full structured observations:

from models import ActionType, ExperimentAction
from server.hackathon_environment import BioExperimentEnvironment

env = BioExperimentEnvironment(scenario_name="biomarker_validation_lung")
obs = env.reset()

obs = env.step(ExperimentAction(
    action_type=ActionType.COLLECT_SAMPLE,
    parameters={"n_samples": 8},
    justification="Collect enough material for downstream single-cell analysis.",
    confidence=0.8,
))

print(obs.task.problem_statement)
print(obs.latest_output.summary if obs.latest_output else "No output yet")
print(obs.reward)

The constructor accepts:

  • scenario_name: Optional[str] β€” pin to a specific scenario; None picks randomly each episode
  • domain_randomise: bool = True β€” perturbs scenario parameters for generalization

2. OpenEnv client/server mode

Use the FastAPI app when you want to serve the environment over HTTP and WebSocket:

uv sync --extra dev
uv run uvicorn server.app:app --reload

The server exposes five endpoints:

Method Path Description
POST /reset Start a new episode
POST /step Execute one action
GET /state Current environment state
GET /schema Action/observation JSON schemas
WS /ws WebSocket for persistent sessions

Then connect with the client:

from client import BioExperimentEnv
from models import ActionType, ExperimentAction

with BioExperimentEnv(base_url="http://localhost:8000") as env:
    result = env.reset()
    result = env.step(ExperimentAction(action_type=ActionType.COLLECT_SAMPLE))
    print(result.observation.latest_output.summary)

The environment class supports concurrent sessions, but the bundled server is currently configured with max_concurrent_envs=1 in server/app.py.

3. Running a local agent

run_agent.py runs a single interactive episode using a local Hugging Face model:

uv run python run_agent.py

For H100 and other large-GPU workflows, prefer the quantized Unsloth path:

uv sync --extra train
uv run python run_agent_unsloth.py

Configuration is via environment variables:

Variable Default Description
RUN_AGENT_USE_PIPELINE 0 Use HF pipeline() path instead of direct generate
RUN_AGENT_MAX_EPISODE_STEPS 12 Maximum number of planning steps

The local model defaults to Qwen/Qwen3.5-0.8B with sampling parameters temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.3. The episode runs up to MAX_EPISODE_STEPS = 12 steps. When action parsing fails, the script falls back to an observation-aware action that respects prerequisites.

PowerShell note: older PowerShell versions do not support &&. Run commands from the target directory directly, or use ; as the command separator.

Windows runtime warnings:

  • If you see HuggingFace symlink-cache warnings, functionality is unaffected; optionally set HF_HUB_DISABLE_SYMLINKS_WARNING=1.
  • If you see flash attention / causal-conv fallback warnings, execution continues with a slower PyTorch path.

4. GRPO training

training_script.py follows the TRL GRPO pattern and uses OpenEnv rewards to score generated action JSON against this environment.

uv sync --extra train
uv run python training_script.py --dry-run
uv run python training_script.py --model-id Qwen/Qwen3.5-0.8B

For H100, the preferred entrypoint is training_unsloth.py, which uses Unsloth 4-bit loading plus LoRA for faster quantized GRPO training:

uv sync --extra train
uv run python training_unsloth.py --dry-run
uv run python training_unsloth.py --model-id Qwen/Qwen3.5-4B

Laptop / mid-range GPU (e.g. 12GB VRAM): Use reduced batch size and sequence length to avoid OOM:

uv sync --extra train
uv pip install unsloth unsloth_zoo --no-deps   # if using training_unsloth.py
uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dataset-episodes 12 --rollout-steps 6 --per-device-train-batch-size 1 --num-generations 2 --gradient-accumulation-steps 4 --max-seq-length 1024 --trust-remote-code

If you still hit OOM, try --max-seq-length 768 or --num-generations 1.

PyTorch CUDA: Use the PyTorch index that matches your GPU. For older cards (RTX 20/30/40 series): uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121. For RTX 50 series (Blackwell, sm_120) you need a CUDA 12.8 build:

uv pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128
uv pip install triton-windows   # required by Unsloth on Windows

Key arguments:

Argument Default Description
--model-id Qwen/Qwen2.5-7B-Instruct Base model to fine-tune
--output-dir training/grpo-output Save directory
--dataset-episodes 8 Rollout episodes for prompt dataset
--rollout-steps 6 Steps per episode during collection
--collection-policy heuristic random or heuristic
--reward-backend local local (in-process) or remote (live server)
--base-url http://localhost:8000 Server URL for remote backend
--scenario-name all Repeatable; restricts which scenarios are used
--domain-randomise off Enable domain randomisation
--num-generations 4 GRPO generations per prompt
--max-completion-length 160 Max tokens for model completions
--max-prompt-length 768 Max tokens for prompts
--learning-rate 5e-6 AdamW learning rate
--dry-run off Build data and test reward without training

By default the reward function reconstructs prompt states locally so the prompt and reward stay aligned. Switch to a live server-backed reward loop with --reward-backend remote --base-url http://localhost:8000.

training_unsloth.py adds H100-oriented options such as --max-seq-length, --disable-4bit, and LoRA settings (--lora-r, --lora-alpha, --lora-dropout). vLLM fast inference is disabled to avoid dependency conflicts.

After training, the script saves plots to the output directory:

  • training_loss.png
  • training_reward.png
  • training_metric.png
  • training_dashboard.png
  • training_plot_manifest.json

Use --plot-metric-key <logged_key> to force a specific extra metric on the third chart; otherwise the script auto-selects a useful logged metric such as KL or gradient norm.

5. Rollout collection

training/rollout_collection.py collects direct environment rollouts into trajectory files:

uv run python -m training.rollout_collection

This runs N episodes with a random or heuristic policy, saves JSON trajectories, and prints evaluation metrics.

6. Benchmark and scripted agents

  • training/literature_benchmark.py runs paper-aligned action sequences and compares outcomes against curated expected findings
  • training/rollout_collection.py collects direct environment rollouts into trajectory files
  • training_script.py trains a GRPO policy with OpenEnv reward calls
  • training_unsloth.py trains a quantized GRPO policy with Unsloth on H100-class GPUs
  • run_agent.py runs a local language model planner against the environment
  • run_agent_unsloth.py runs the planner with Unsloth 4-bit loading for faster inference
  • training/trajectory.py stores trajectories for offline RL, imitation learning, replay, and evaluation
  • training/evaluation.py computes online, benchmark, expert-review, and fidelity-oriented metrics

Training utilities

training/trajectory.py

Provides TrajectoryStep, Trajectory, and TrajectoryDataset for episode serialization.

  • TrajectoryStep stores action, observation, reward, done, reward_breakdown, and an optional latent_snapshot
  • Trajectory accumulates steps with add_step(), computes total_reward, and exposes save(path) / load(path)
  • TrajectoryDataset wraps a list of trajectories with filter_successful(), save_dir(), load_dir(), and summary() (n, success_rate, mean_reward, mean_length, max/min reward)

training/evaluation.py

EvaluationSuite is a stateless class with four families of @staticmethod methods:

Family Method Metrics
Online RL online_metrics(trajectories) mean_return, median_return, std_return, mean_episode_length, success_rate
Offline benchmark benchmark_metrics(dataset) pipeline_validity_rate, ordering_score, action_diversity, mean_conclusion_confidence
Expert review expert_review_metrics(...) Placeholder; averages provided scores
Simulator fidelity simulator_fidelity_metrics(sim, real) reward_distribution_gap

training/literature_benchmark.py

run_paper_benchmark(problem_statement, scenario_name, domain_randomise) runs a paper-aligned action pipeline and scores against expected_findings using keyword matching. Returns a PaperBenchmarkResult with match_ratio.

Docker deployment

The server ships with a server/Dockerfile. It uses a multi-stage build based on openenv-base, installs dependencies via uv, and starts uvicorn server.app:app on port 8000.

docker build -f server/Dockerfile -t bio-experiment-env .
docker run -p 8000:8000 bio-experiment-env

The openenv.yaml file configures the deployment for the OpenEnv platform:

spec_version: 1
name: hackathon
type: space
runtime: fastapi
app: server.app:app
port: 8000

Why this is useful

This environment is trying to model a realistic scientific planning loop rather than a toy decision problem:

  • actions have prerequisites
  • outputs are noisy and imperfect
  • budget and time matter
  • not every correct-looking answer is well supported
  • final conclusions are scored against hidden ground truth

That makes it suitable for:

  • agent planning benchmarks
  • RL experiments on long-horizon scientific reasoning
  • literature-grounded evaluation
  • comparing structured policies against LLM-driven planners

Project map

.
β”œβ”€β”€ client.py                     # OpenEnv HTTP/WebSocket client
β”œβ”€β”€ models.py                     # Shared action / observation / task schemas
β”œβ”€β”€ openenv.yaml                  # OpenEnv platform deployment config
β”œβ”€β”€ pyproject.toml                # Package metadata and dependency groups
β”œβ”€β”€ run_agent.py                  # Single-episode interactive agent runner
β”œβ”€β”€ run_agent_unsloth.py          # Quantized Unsloth interactive agent runner
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                    # FastAPI/OpenEnv server entry point
β”‚   β”œβ”€β”€ Dockerfile                # Multi-stage Docker build
β”‚   β”œβ”€β”€ hackathon_environment.py  # Main environment orchestration
β”‚   β”œβ”€β”€ requirements.txt          # Minimal server dependencies
β”‚   β”œβ”€β”€ rewards/
β”‚   β”‚   └── reward.py             # Decomposed reward model
β”‚   β”œβ”€β”€ rules/
β”‚   β”‚   └── engine.py             # Biological constraint checking
β”‚   β”œβ”€β”€ simulator/
β”‚   β”‚   β”œβ”€β”€ latent_state.py       # Hidden biological, technical, progress, resource state
β”‚   β”‚   β”œβ”€β”€ noise.py              # Seeded stochastic noise model
β”‚   β”‚   β”œβ”€β”€ output_generator.py   # Per-action simulated output generation
β”‚   β”‚   └── transition.py         # State transition engine and ACTION_COSTS table
β”‚   β”œβ”€β”€ subagents/                # Placeholder for future sub-agent integration
β”‚   └── tasks/
β”‚       β”œβ”€β”€ generator.py          # TaskGenerator with domain randomisation
β”‚       └── scenarios.py          # SCENARIO_LIBRARY with 4 curated scenarios
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ evaluation.py             # EvaluationSuite metrics
β”‚   β”œβ”€β”€ literature_benchmark.py   # Paper-backed benchmark flow
β”‚   β”œβ”€β”€ rollout_collection.py     # Direct rollout collection helper
β”‚   └── trajectory.py             # Trajectory serialization and dataset utilities
β”œβ”€β”€ training_script.py            # TRL GRPO training entry point
β”œβ”€β”€ training_unsloth.py           # Unsloth quantized GRPO training entry point
└── tests/
    β”œβ”€β”€ test_environment.py
    β”œβ”€β”€ test_literature_benchmark.py
    β”œβ”€β”€ test_models.py
    β”œβ”€β”€ test_rewards.py
    β”œβ”€β”€ test_rules.py
    β”œβ”€β”€ test_simulator.py
    └── test_training_script.py

Quick sanity check

uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q

Those tests verify:

  • reset and step lifecycle
  • valid vs invalid pipeline behavior
  • conclusion termination
  • literature-backed scenario selection
  • benchmark matching for curated expected findings

Run the full suite with coverage:

uv run pytest tests/ --cov -q