Spaces:
Running on CPU Upgrade
title: Bio Experiment Environment Server
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- reinforcement-learning
- bioinformatics
Bio Experiment Environment
This repository implements an OpenEnv-compatible reinforcement learning environment for planning biological experiment pipelines. The agent does not directly see the true biological state. Instead, it proposes one structured experiment or analysis step at a time, receives a noisy simulated output, and is rewarded for valid, informative, efficient, well-calibrated plans.
The environment is designed as a partially observable Markov decision process (POMDP) with:
- hidden ground-truth biology
- hidden technical noise and failure conditions
- visible task metadata, resource usage, step history, and intermediate outputs
- dense step-wise reward plus terminal reward for conclusion quality
How it works
At a high level, each episode looks like this:
reset()picks a biological scenario and seeds the simulator.- The agent receives an
ExperimentObservationdescribing the task and current visible state. - The agent submits an
ExperimentActionsuch ascollect_sample,run_qc, ordifferential_expression. - The rule engine checks whether the action is valid at this point in the pipeline.
- The transition engine updates hidden state, spends resources, and asks the output generator to simulate the result.
- The reward computer scores the step for validity, ordering, information gain, efficiency, novelty, and penalties.
- The environment returns a new observation with updated history, outputs, discoveries, violations, and reward.
- The episode ends when the agent synthesizes a conclusion, exhausts resources, or reaches the step limit.
The core mental model
Hidden state
The simulator keeps a FullLatentState that the agent never directly sees. It contains:
- true cell populations and marker genes
- true DE genes, pathways, trajectories, and regulatory networks
- technical factors such as dropout, doublets, ambient RNA, and batch effects
- experiment progress flags
- remaining budget and time
- hidden failure conditions
Visible state
The agent only sees ExperimentObservation, which includes:
- the current
TaskSpec - pipeline history
- available assays and tools
- resource usage
- the latest and cumulative intermediate outputs
- discovered markers and candidate mechanisms
- rule violations
- per-step reward breakdown
This separation is what makes the environment a POMDP rather than a fully observed simulator.
Main building blocks
models.py
Defines the contracts that all other modules use:
ActionType: 21 discrete experiment steps, grouped into three frozensets βWET_LAB_ACTIONS(8),COMPUTATIONAL_ACTIONS(10), andMETA_ACTIONS(3)SubagentType: 9 sub-agent delegate roles (e.g.wet_lab_planner,computational_analyst,causal_reasoning_agent)ExperimentAction: one structured step chosen by the agent; fields includeaction_type,method,parameters,justification,confidence(clamped to[0, 1]),invoked_subagent,tool_call_spec,input_targetsExperimentObservation: what the agent can see after each step; includestask,pipeline_history,resource_usage,latest_output,all_outputs,discovered_markers,candidate_mechanisms,conclusions,rule_violations,step_reward_breakdownTaskSpec: the problem statement, organism, tissue, conditions, budget, time limit, assays, tools, paper references, and expected findingsIntermediateOutput: the simulated artifact returned by a step; carriesoutput_type,success,quality_score,summary,data,uncertainty,warnings,artifacts_availableConclusionClaim: structured claims used for final synthesis; carriesclaim,evidence_steps,confidence,claim_type,supporting_dataPipelineStepRecord: compact observable record of one past step stored in historyResourceUsage: budget and time tracking visible to the agent
The action vocabulary is intentionally broad enough to mix wet-lab, computational, and meta-planning actions.
server/tasks/
This is where episodes come from.
scenarios.pydefines a curated library of four biological scenarios asScenariodataclass objects, each bundling aTaskSpec, aLatentBiologicalState, aTechnicalState, hidden failure conditions, and tagsgenerator.pyturns a scenario into a(TaskSpec, FullLatentState)pair viaTaskGenerator.generate(); optional domain randomisation perturbs budget (Β±30%), time (Β±20%), technical noise, batch effects, cell proportions, and effect sizes
The four scenarios are:
| Name | Difficulty | Tissue | Problem | Budget | Time |
|---|---|---|---|---|---|
cardiac_disease_de |
easy | heart | Differential expression between healthy and dilated cardiomyopathy cardiomyocytes | $80 K | 120 days |
hematopoiesis_trajectory |
medium | bone marrow | Infer HSC β mature lineage trajectory with three branches | $100 K | 150 days |
perturbation_immune |
hard | synovial fluid | JAK inhibitor effect on T-cell states in rheumatoid arthritis | $120 K | 180 days |
biomarker_validation_lung |
medium | lung | Validate SPP1 as biomarker for pro-fibrotic macrophages in IPF | $90 K | 150 days |
Each scenario carries paper references with DOIs, true DE genes with log2FC values, true pathway activities, true regulatory networks, and ground-truth causal mechanisms used for terminal reward calibration.
server/simulator/
This is the simulator itself.
latent_state.pydefinesFullLatentState, the root aggregate of all hidden state. Key sub-structures areLatentBiologicalState(true DE genes, pathways, gene programs, trajectory, regulatory network, markers, causal mechanisms),TechnicalState(dropout, doublets, ambient RNA, sample quality),ExperimentProgress(18 boolean milestone flags plus counts), andResourceState(internal budget and time tracking with exhaustion properties)noise.pycentralises stochasticity inNoiseModel. All randomness flows through a single seedednumpy.Generator. Methods includeadd_expression_noise,sample_effect_sizes,sample_p_values,generate_false_positives,generate_false_negatives,quality_degradation,sample_qc_metric,sample_cluster_count,shuffle_ranking, andcoin_flipoutput_generator.pyturns an action plus hidden state into a realisticIntermediateOutput. Every action type has a dedicated handler conditioned on the latent state; noise is then injected β dropout in expression data, false positives and false negatives in DE and marker results, over/under-clustering, and pathway contaminationtransition.pyapplies action costs fromACTION_COSTS, updates progress flags, calls the output generator, degrades quality on soft violations, propagates discovered DE genes and cluster names back into latent state, and decides whether the episode is done
The output generator does not simply echo the action. It conditions outputs on the hidden state, then injects realistic noise.
server/rules/engine.py
The rule engine enforces scientific and procedural constraints before each action is applied.
- hard violations block the action entirely
- soft violations allow the action, but reduce output quality and add reward penalties
The four rule families are:
- Prerequisites (HARD) β each computational step requires the appropriate upstream milestone flag. For example:
normalize_datarequiresdata_filtered,differential_expressionrequiresdata_normalized,validate_markerrequiresmarkers_discovered - Resource constraints (HARD/SOFT) β budget or time exhausted is a hard block; action cost exceeding remaining budget (when budget > 0) is a soft warning
- Redundancy (SOFT) β repeating an already-completed step such as
run_qcornormalize_data - Causal validity (SOFT) β synthesizing conclusions without prior DE or clustering; making causal claims without validation evidence; pathway enrichment before DE
server/rewards/reward.py
Rewards are decomposed rather than being a single opaque number.
Per-step reward formula:
R_t = r_validity + r_ordering + r_info_gain + r_efficiency + r_novelty + r_penalty + Ξ³[Ο(s_{t+1}) β Ο(s_t)]
| Component | Weight | Description |
|---|---|---|
validity |
0.3 | 1.0 if output succeeded, β1.0 if hard violation |
ordering |
0.2 | 1.0 if natural next step, 0.3 otherwise |
info_gain |
0.4 | quality_score Γ (1 β uncertainty) |
efficiency |
0.3 | max(0, 1 β 5 Γ budget_fraction_used) |
novelty |
+0.1 | Bonus when no soft violations |
penalty |
β0.15/violation | Per soft violation |
shaping |
Ξ³ = 0.99 | Potential-based over 12 progress milestones |
Terminal reward adds:
| Component | Weight | Description |
|---|---|---|
| Pipeline completeness | 3.0 | Fraction of 7 core milestones completed |
| Calibration | 4.0 | How well conclusions match hidden markers and mechanisms |
| Budget + time efficiency | 1.0 | Average fraction of budget and time remaining |
| Overconfidence penalty | β0.5/claim | For high-confidence claims (> 0.8) that are wrong |
This makes the environment easier to debug, benchmark, and train against.
server/hackathon_environment.py
This is the orchestration layer that ties everything together.
On reset() it:
- seeds the noise model
- generates a task and latent state via
TaskGenerator - clears history, outputs, discoveries, conclusions, and cumulative reward
On step() it:
- checks rules
- calls the transition engine
- computes reward
- appends a
PipelineStepRecord - updates discovered markers and candidate mechanisms
- stores conclusion claims if the action is
synthesize_conclusion - builds the next
ExperimentObservation
This file is the best place to read if you want the end-to-end control flow.
What actually happens on one step
Here is the concrete order of operations for env.step(action):
- Increment the step counter.
- Copy the previous latent state for reward comparison.
- Run rule checks and split violations into hard vs soft.
- If there is a hard violation, return a failure report without applying the action.
- Otherwise deduct budget and time based on
ACTION_COSTS. - Update latent progress flags like
samples_collected,qc_performed, orde_performed. - Generate a structured simulated output for the chosen action.
- If there were soft violations, degrade output quality (Γ0.5) and attach warnings.
- Propagate artifacts back into latent state, such as discovered DE genes or cluster names.
- Compute decomposed reward from state transition plus output quality.
- If the episode is ending, compute terminal reward from completeness and conclusion calibration.
- Return an observation that exposes the visible summary but not the hidden truth.
Action costs
Each action deducts from the episode's budget and time. Computational steps also accrue compute hours.
| Action | Budget | Time (days) |
|---|---|---|
sequence_cells |
$15,000 | 5 |
prepare_library |
$8,000 | 3 |
collect_sample |
$5,000 | 7 |
validate_marker |
$5,000 | 14 |
culture_cells |
$3,000 | 14 |
perturb_gene |
$2,000 | 3 |
perturb_compound |
$1,000 | 2 |
select_cohort |
$500 | 1 |
run_qc |
$100 | 0.5 |
integrate_batches |
$300 | 1 |
regulatory_network_inference |
$200 | 1 |
cluster_cells |
$150 | 0.5 |
differential_expression, trajectory_analysis, pathway_enrichment |
$100β200 | 0.5β1 |
filter_data, normalize_data, marker_selection |
$50β100 | 0.25β0.5 |
synthesize_conclusion, design_followup_experiment, request_subagent_review |
$0 | 0.25β0.5 |
Typical successful pipeline
Most scenarios reward a sensible experiment order similar to:
collect_sampleprepare_librarysequence_cellsrun_qcfilter_datanormalize_datacluster_cells- one or more of:
differential_expression,trajectory_analysis,pathway_enrichment,regulatory_network_inference,marker_selection,validate_marker synthesize_conclusion
The exact best sequence depends on the scenario:
- trajectory scenarios benefit from
trajectory_analysisand regulatory inference - biomarker scenarios benefit from DE, marker selection, and validation
- perturbation scenarios benefit from pathway-level interpretation
Episode termination
An episode ends when one of the following happens:
- the agent chooses
synthesize_conclusion - resources are exhausted
- the environment reaches
MAX_STEPSwhich is currently30
Installation
Dependencies are managed with uv. The package requires Python β₯ 3.10.
H100 Jupyter notebook setup: See H100_JUPYTER_SETUP.md for environment setup on NVIDIA H100 instances with Jupyter.
# Core environment only
uv sync
# With dev/test tools
uv sync --extra dev
# With training dependencies (TRL, transformers, torch)
uv sync --extra train
# With bioinformatics extras (scanpy, biopython, gseapy)
uv sync --extra bio
Key dependency groups from pyproject.toml:
| Group | Key packages |
|---|---|
| core | openenv-core[core]>=0.2.0, numpy, scipy, pydantic>=2.0 |
| train | trl>=0.29, transformers>=5.3, accelerate, datasets, torch, matplotlib |
| bio | scanpy, biopython, gseapy |
| dev | pytest, pytest-cov |
Interfaces you can use
1. In-process environment
Use BioExperimentEnvironment when you want direct Python access with full structured observations:
from models import ActionType, ExperimentAction
from server.hackathon_environment import BioExperimentEnvironment
env = BioExperimentEnvironment(scenario_name="biomarker_validation_lung")
obs = env.reset()
obs = env.step(ExperimentAction(
action_type=ActionType.COLLECT_SAMPLE,
parameters={"n_samples": 8},
justification="Collect enough material for downstream single-cell analysis.",
confidence=0.8,
))
print(obs.task.problem_statement)
print(obs.latest_output.summary if obs.latest_output else "No output yet")
print(obs.reward)
The constructor accepts:
scenario_name: Optional[str]β pin to a specific scenario;Nonepicks randomly each episodedomain_randomise: bool = Trueβ perturbs scenario parameters for generalization
2. OpenEnv client/server mode
Use the FastAPI app when you want to serve the environment over HTTP and WebSocket:
uv sync --extra dev
uv run uvicorn server.app:app --reload
The server exposes five endpoints:
| Method | Path | Description |
|---|---|---|
POST |
/reset |
Start a new episode |
POST |
/step |
Execute one action |
GET |
/state |
Current environment state |
GET |
/schema |
Action/observation JSON schemas |
WS |
/ws |
WebSocket for persistent sessions |
Then connect with the client:
from client import BioExperimentEnv
from models import ActionType, ExperimentAction
with BioExperimentEnv(base_url="http://localhost:8000") as env:
result = env.reset()
result = env.step(ExperimentAction(action_type=ActionType.COLLECT_SAMPLE))
print(result.observation.latest_output.summary)
The environment class supports concurrent sessions, but the bundled server is currently configured with max_concurrent_envs=1 in server/app.py.
3. Running a local agent
run_agent.py runs a single interactive episode using a local Hugging Face model:
uv run python run_agent.py
For H100 and other large-GPU workflows, prefer the quantized Unsloth path:
uv sync --extra train
uv run python run_agent_unsloth.py
Configuration is via environment variables:
| Variable | Default | Description |
|---|---|---|
RUN_AGENT_USE_PIPELINE |
0 |
Use HF pipeline() path instead of direct generate |
RUN_AGENT_MAX_EPISODE_STEPS |
12 |
Maximum number of planning steps |
The local model defaults to Qwen/Qwen3.5-0.8B with sampling parameters temperature=0.7, top_p=0.8, top_k=20, repetition_penalty=1.3. The episode runs up to MAX_EPISODE_STEPS = 12 steps. When action parsing fails, the script falls back to an observation-aware action that respects prerequisites.
PowerShell note: older PowerShell versions do not support &&. Run commands from the target directory directly, or use ; as the command separator.
Windows runtime warnings:
- If you see HuggingFace symlink-cache warnings, functionality is unaffected; optionally set
HF_HUB_DISABLE_SYMLINKS_WARNING=1. - If you see flash attention / causal-conv fallback warnings, execution continues with a slower PyTorch path.
4. GRPO training
training_script.py follows the TRL GRPO pattern and uses OpenEnv rewards to score generated action JSON against this environment.
uv sync --extra train
uv run python training_script.py --dry-run
uv run python training_script.py --model-id Qwen/Qwen3.5-0.8B
For H100, the preferred entrypoint is training_unsloth.py, which uses Unsloth 4-bit loading plus LoRA for faster quantized GRPO training:
uv sync --extra train
uv run python training_unsloth.py --dry-run
uv run python training_unsloth.py --model-id Qwen/Qwen3.5-4B
Laptop / mid-range GPU (e.g. 12GB VRAM): Use reduced batch size and sequence length to avoid OOM:
uv sync --extra train
uv pip install unsloth unsloth_zoo --no-deps # if using training_unsloth.py
uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dataset-episodes 12 --rollout-steps 6 --per-device-train-batch-size 1 --num-generations 2 --gradient-accumulation-steps 4 --max-seq-length 1024 --trust-remote-code
If you still hit OOM, try --max-seq-length 768 or --num-generations 1.
PyTorch CUDA: Use the PyTorch index that matches your GPU. For older cards (RTX 20/30/40 series): uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121. For RTX 50 series (Blackwell, sm_120) you need a CUDA 12.8 build:
uv pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128
uv pip install triton-windows # required by Unsloth on Windows
Key arguments:
| Argument | Default | Description |
|---|---|---|
--model-id |
Qwen/Qwen2.5-7B-Instruct |
Base model to fine-tune |
--output-dir |
training/grpo-output |
Save directory |
--dataset-episodes |
8 |
Rollout episodes for prompt dataset |
--rollout-steps |
6 |
Steps per episode during collection |
--collection-policy |
heuristic |
random or heuristic |
--reward-backend |
local |
local (in-process) or remote (live server) |
--base-url |
http://localhost:8000 |
Server URL for remote backend |
--scenario-name |
all | Repeatable; restricts which scenarios are used |
--domain-randomise |
off | Enable domain randomisation |
--num-generations |
4 |
GRPO generations per prompt |
--max-completion-length |
160 |
Max tokens for model completions |
--max-prompt-length |
768 |
Max tokens for prompts |
--learning-rate |
5e-6 |
AdamW learning rate |
--dry-run |
off | Build data and test reward without training |
By default the reward function reconstructs prompt states locally so the prompt and reward stay aligned. Switch to a live server-backed reward loop with --reward-backend remote --base-url http://localhost:8000.
training_unsloth.py adds H100-oriented options such as --max-seq-length, --disable-4bit, and LoRA settings (--lora-r, --lora-alpha, --lora-dropout). vLLM fast inference is disabled to avoid dependency conflicts.
After training, the script saves plots to the output directory:
training_loss.pngtraining_reward.pngtraining_metric.pngtraining_dashboard.pngtraining_plot_manifest.json
Use --plot-metric-key <logged_key> to force a specific extra metric on the third chart; otherwise the script auto-selects a useful logged metric such as KL or gradient norm.
5. Rollout collection
training/rollout_collection.py collects direct environment rollouts into trajectory files:
uv run python -m training.rollout_collection
This runs N episodes with a random or heuristic policy, saves JSON trajectories, and prints evaluation metrics.
6. Benchmark and scripted agents
training/literature_benchmark.pyruns paper-aligned action sequences and compares outcomes against curated expected findingstraining/rollout_collection.pycollects direct environment rollouts into trajectory filestraining_script.pytrains a GRPO policy with OpenEnv reward callstraining_unsloth.pytrains a quantized GRPO policy with Unsloth on H100-class GPUsrun_agent.pyruns a local language model planner against the environmentrun_agent_unsloth.pyruns the planner with Unsloth 4-bit loading for faster inferencetraining/trajectory.pystores trajectories for offline RL, imitation learning, replay, and evaluationtraining/evaluation.pycomputes online, benchmark, expert-review, and fidelity-oriented metrics
Training utilities
training/trajectory.py
Provides TrajectoryStep, Trajectory, and TrajectoryDataset for episode serialization.
TrajectoryStepstoresaction,observation,reward,done,reward_breakdown, and an optionallatent_snapshotTrajectoryaccumulates steps withadd_step(), computestotal_reward, and exposessave(path)/load(path)TrajectoryDatasetwraps a list of trajectories withfilter_successful(),save_dir(),load_dir(), andsummary()(n, success_rate, mean_reward, mean_length, max/min reward)
training/evaluation.py
EvaluationSuite is a stateless class with four families of @staticmethod methods:
| Family | Method | Metrics |
|---|---|---|
| Online RL | online_metrics(trajectories) |
mean_return, median_return, std_return, mean_episode_length, success_rate |
| Offline benchmark | benchmark_metrics(dataset) |
pipeline_validity_rate, ordering_score, action_diversity, mean_conclusion_confidence |
| Expert review | expert_review_metrics(...) |
Placeholder; averages provided scores |
| Simulator fidelity | simulator_fidelity_metrics(sim, real) |
reward_distribution_gap |
training/literature_benchmark.py
run_paper_benchmark(problem_statement, scenario_name, domain_randomise) runs a paper-aligned action pipeline and scores against expected_findings using keyword matching. Returns a PaperBenchmarkResult with match_ratio.
Docker deployment
The server ships with a server/Dockerfile. It uses a multi-stage build based on openenv-base, installs dependencies via uv, and starts uvicorn server.app:app on port 8000.
docker build -f server/Dockerfile -t bio-experiment-env .
docker run -p 8000:8000 bio-experiment-env
The openenv.yaml file configures the deployment for the OpenEnv platform:
spec_version: 1
name: hackathon
type: space
runtime: fastapi
app: server.app:app
port: 8000
Why this is useful
This environment is trying to model a realistic scientific planning loop rather than a toy decision problem:
- actions have prerequisites
- outputs are noisy and imperfect
- budget and time matter
- not every correct-looking answer is well supported
- final conclusions are scored against hidden ground truth
That makes it suitable for:
- agent planning benchmarks
- RL experiments on long-horizon scientific reasoning
- literature-grounded evaluation
- comparing structured policies against LLM-driven planners
Project map
.
βββ client.py # OpenEnv HTTP/WebSocket client
βββ models.py # Shared action / observation / task schemas
βββ openenv.yaml # OpenEnv platform deployment config
βββ pyproject.toml # Package metadata and dependency groups
βββ run_agent.py # Single-episode interactive agent runner
βββ run_agent_unsloth.py # Quantized Unsloth interactive agent runner
βββ server/
β βββ app.py # FastAPI/OpenEnv server entry point
β βββ Dockerfile # Multi-stage Docker build
β βββ hackathon_environment.py # Main environment orchestration
β βββ requirements.txt # Minimal server dependencies
β βββ rewards/
β β βββ reward.py # Decomposed reward model
β βββ rules/
β β βββ engine.py # Biological constraint checking
β βββ simulator/
β β βββ latent_state.py # Hidden biological, technical, progress, resource state
β β βββ noise.py # Seeded stochastic noise model
β β βββ output_generator.py # Per-action simulated output generation
β β βββ transition.py # State transition engine and ACTION_COSTS table
β βββ subagents/ # Placeholder for future sub-agent integration
β βββ tasks/
β βββ generator.py # TaskGenerator with domain randomisation
β βββ scenarios.py # SCENARIO_LIBRARY with 4 curated scenarios
βββ training/
β βββ evaluation.py # EvaluationSuite metrics
β βββ literature_benchmark.py # Paper-backed benchmark flow
β βββ rollout_collection.py # Direct rollout collection helper
β βββ trajectory.py # Trajectory serialization and dataset utilities
βββ training_script.py # TRL GRPO training entry point
βββ training_unsloth.py # Unsloth quantized GRPO training entry point
βββ tests/
βββ test_environment.py
βββ test_literature_benchmark.py
βββ test_models.py
βββ test_rewards.py
βββ test_rules.py
βββ test_simulator.py
βββ test_training_script.py
Quick sanity check
uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q
Those tests verify:
- reset and step lifecycle
- valid vs invalid pipeline behavior
- conclusion termination
- literature-backed scenario selection
- benchmark matching for curated expected findings
Run the full suite with coverage:
uv run pytest tests/ --cov -q