Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| title: Bio Experiment Environment Server | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - bioinformatics | |
| # Bio Experiment Environment | |
| This repository implements an OpenEnv-compatible reinforcement learning environment for planning biological experiment pipelines. The agent does not directly see the true biological state. Instead, it proposes one structured experiment or analysis step at a time, receives a noisy simulated output, and is rewarded for valid, informative, efficient, well-calibrated plans. | |
| The environment is designed as a partially observable Markov decision process (POMDP) with: | |
| - hidden ground-truth biology | |
| - hidden technical noise and failure conditions | |
| - visible task metadata, resource usage, step history, and intermediate outputs | |
| - dense step-wise reward plus terminal reward for conclusion quality | |
| ## How it works | |
| At a high level, each episode looks like this: | |
| 1. `reset()` picks a biological scenario and seeds the simulator. | |
| 2. The agent receives an `ExperimentObservation` describing the task and current visible state. | |
| 3. The agent submits an `ExperimentAction` such as `collect_sample`, `run_qc`, or `differential_expression`. | |
| 4. The rule engine checks whether the action is valid at this point in the pipeline. | |
| 5. The transition engine updates hidden state, spends resources, and asks the output generator to simulate the result. | |
| 6. The reward computer scores the step for validity, ordering, information gain, efficiency, novelty, and penalties. | |
| 7. The environment returns a new observation with updated history, outputs, discoveries, violations, and reward. | |
| 8. The episode ends when the agent synthesizes a conclusion, exhausts resources, or reaches the step limit. | |
| ## The core mental model | |
| ### Hidden state | |
| The simulator keeps a `FullLatentState` that the agent never directly sees. It contains: | |
| - true cell populations and marker genes | |
| - true DE genes, pathways, trajectories, and regulatory networks | |
| - technical factors such as dropout, doublets, ambient RNA, and batch effects | |
| - experiment progress flags | |
| - remaining budget and time | |
| - hidden failure conditions | |
| ### Visible state | |
| The agent only sees `ExperimentObservation`, which includes: | |
| - the current `TaskSpec` | |
| - pipeline history | |
| - available assays and tools | |
| - resource usage | |
| - the latest and cumulative intermediate outputs | |
| - discovered markers and candidate mechanisms | |
| - rule violations | |
| - per-step reward breakdown | |
| This separation is what makes the environment a POMDP rather than a fully observed simulator. | |
| ## Main building blocks | |
| ### `models.py` | |
| Defines the contracts that all other modules use: | |
| - `ActionType`: 21 discrete experiment steps, grouped into three frozensets β `WET_LAB_ACTIONS` (8), `COMPUTATIONAL_ACTIONS` (10), and `META_ACTIONS` (3) | |
| - `SubagentType`: 9 sub-agent delegate roles (e.g. `wet_lab_planner`, `computational_analyst`, `causal_reasoning_agent`) | |
| - `ExperimentAction`: one structured step chosen by the agent; fields include `action_type`, `method`, `parameters`, `justification`, `confidence` (clamped to `[0, 1]`), `invoked_subagent`, `tool_call_spec`, `input_targets` | |
| - `ExperimentObservation`: what the agent can see after each step; includes `task`, `pipeline_history`, `resource_usage`, `latest_output`, `all_outputs`, `discovered_markers`, `candidate_mechanisms`, `conclusions`, `rule_violations`, `step_reward_breakdown` | |
| - `TaskSpec`: the problem statement, organism, tissue, conditions, budget, time limit, assays, tools, paper references, and expected findings | |
| - `IntermediateOutput`: the simulated artifact returned by a step; carries `output_type`, `success`, `quality_score`, `summary`, `data`, `uncertainty`, `warnings`, `artifacts_available` | |
| - `ConclusionClaim`: structured claims used for final synthesis; carries `claim`, `evidence_steps`, `confidence`, `claim_type`, `supporting_data` | |
| - `PipelineStepRecord`: compact observable record of one past step stored in history | |
| - `ResourceUsage`: budget and time tracking visible to the agent | |
| The action vocabulary is intentionally broad enough to mix wet-lab, computational, and meta-planning actions. | |
| ### `server/tasks/` | |
| This is where episodes come from. | |
| - `scenarios.py` defines a curated library of four biological scenarios as `Scenario` dataclass objects, each bundling a `TaskSpec`, a `LatentBiologicalState`, a `TechnicalState`, hidden failure conditions, and tags | |
| - `generator.py` turns a scenario into a `(TaskSpec, FullLatentState)` pair via `TaskGenerator.generate()`; optional domain randomisation perturbs budget (Β±30%), time (Β±20%), technical noise, batch effects, cell proportions, and effect sizes | |
| The four scenarios are: | |
| | Name | Difficulty | Tissue | Problem | Budget | Time | | |
| |---|---|---|---|---|---| | |
| | `cardiac_disease_de` | easy | heart | Differential expression between healthy and dilated cardiomyopathy cardiomyocytes | $80 K | 120 days | | |
| | `hematopoiesis_trajectory` | medium | bone marrow | Infer HSC β mature lineage trajectory with three branches | $100 K | 150 days | | |
| | `perturbation_immune` | hard | synovial fluid | JAK inhibitor effect on T-cell states in rheumatoid arthritis | $120 K | 180 days | | |
| | `biomarker_validation_lung` | medium | lung | Validate SPP1 as biomarker for pro-fibrotic macrophages in IPF | $90 K | 150 days | | |
| Each scenario carries paper references with DOIs, true DE genes with log2FC values, true pathway activities, true regulatory networks, and ground-truth causal mechanisms used for terminal reward calibration. | |
| ### `server/simulator/` | |
| This is the simulator itself. | |
| - `latent_state.py` defines `FullLatentState`, the root aggregate of all hidden state. Key sub-structures are `LatentBiologicalState` (true DE genes, pathways, gene programs, trajectory, regulatory network, markers, causal mechanisms), `TechnicalState` (dropout, doublets, ambient RNA, sample quality), `ExperimentProgress` (18 boolean milestone flags plus counts), and `ResourceState` (internal budget and time tracking with exhaustion properties) | |
| - `noise.py` centralises stochasticity in `NoiseModel`. All randomness flows through a single seeded `numpy.Generator`. Methods include `add_expression_noise`, `sample_effect_sizes`, `sample_p_values`, `generate_false_positives`, `generate_false_negatives`, `quality_degradation`, `sample_qc_metric`, `sample_cluster_count`, `shuffle_ranking`, and `coin_flip` | |
| - `output_generator.py` turns an action plus hidden state into a realistic `IntermediateOutput`. Every action type has a dedicated handler conditioned on the latent state; noise is then injected β dropout in expression data, false positives and false negatives in DE and marker results, over/under-clustering, and pathway contamination | |
| - `transition.py` applies action costs from `ACTION_COSTS`, updates progress flags, calls the output generator, degrades quality on soft violations, propagates discovered DE genes and cluster names back into latent state, and decides whether the episode is done | |
| The output generator does not simply echo the action. It conditions outputs on the hidden state, then injects realistic noise. | |
| ### `server/rules/engine.py` | |
| The rule engine enforces scientific and procedural constraints before each action is applied. | |
| - hard violations block the action entirely | |
| - soft violations allow the action, but reduce output quality and add reward penalties | |
| The four rule families are: | |
| 1. **Prerequisites (HARD)** β each computational step requires the appropriate upstream milestone flag. For example: `normalize_data` requires `data_filtered`, `differential_expression` requires `data_normalized`, `validate_marker` requires `markers_discovered` | |
| 2. **Resource constraints (HARD/SOFT)** β budget or time exhausted is a hard block; action cost exceeding remaining budget (when budget > 0) is a soft warning | |
| 3. **Redundancy (SOFT)** β repeating an already-completed step such as `run_qc` or `normalize_data` | |
| 4. **Causal validity (SOFT)** β synthesizing conclusions without prior DE or clustering; making causal claims without validation evidence; pathway enrichment before DE | |
| ### `server/rewards/reward.py` | |
| Rewards are decomposed rather than being a single opaque number. | |
| Per-step reward formula: | |
| ``` | |
| R_t = r_validity + r_ordering + r_info_gain + r_efficiency + r_novelty + r_penalty + Ξ³[Ο(s_{t+1}) β Ο(s_t)] | |
| ``` | |
| | Component | Weight | Description | | |
| |---|---|---| | |
| | `validity` | 0.3 | `1.0` if output succeeded, `β1.0` if hard violation | | |
| | `ordering` | 0.2 | `1.0` if natural next step, `0.3` otherwise | | |
| | `info_gain` | 0.4 | `quality_score Γ (1 β uncertainty)` | | |
| | `efficiency` | 0.3 | `max(0, 1 β 5 Γ budget_fraction_used)` | | |
| | `novelty` | +0.1 | Bonus when no soft violations | | |
| | `penalty` | β0.15/violation | Per soft violation | | |
| | `shaping` | Ξ³ = 0.99 | Potential-based over 12 progress milestones | | |
| Terminal reward adds: | |
| | Component | Weight | Description | | |
| |---|---|---| | |
| | Pipeline completeness | 3.0 | Fraction of 7 core milestones completed | | |
| | Calibration | 4.0 | How well conclusions match hidden markers and mechanisms | | |
| | Budget + time efficiency | 1.0 | Average fraction of budget and time remaining | | |
| | Overconfidence penalty | β0.5/claim | For high-confidence claims (`> 0.8`) that are wrong | | |
| This makes the environment easier to debug, benchmark, and train against. | |
| ### `server/hackathon_environment.py` | |
| This is the orchestration layer that ties everything together. | |
| On `reset()` it: | |
| - seeds the noise model | |
| - generates a task and latent state via `TaskGenerator` | |
| - clears history, outputs, discoveries, conclusions, and cumulative reward | |
| On `step()` it: | |
| - checks rules | |
| - calls the transition engine | |
| - computes reward | |
| - appends a `PipelineStepRecord` | |
| - updates discovered markers and candidate mechanisms | |
| - stores conclusion claims if the action is `synthesize_conclusion` | |
| - builds the next `ExperimentObservation` | |
| This file is the best place to read if you want the end-to-end control flow. | |
| ## What actually happens on one step | |
| Here is the concrete order of operations for `env.step(action)`: | |
| 1. Increment the step counter. | |
| 2. Copy the previous latent state for reward comparison. | |
| 3. Run rule checks and split violations into hard vs soft. | |
| 4. If there is a hard violation, return a failure report without applying the action. | |
| 5. Otherwise deduct budget and time based on `ACTION_COSTS`. | |
| 6. Update latent progress flags like `samples_collected`, `qc_performed`, or `de_performed`. | |
| 7. Generate a structured simulated output for the chosen action. | |
| 8. If there were soft violations, degrade output quality (Γ0.5) and attach warnings. | |
| 9. Propagate artifacts back into latent state, such as discovered DE genes or cluster names. | |
| 10. Compute decomposed reward from state transition plus output quality. | |
| 11. If the episode is ending, compute terminal reward from completeness and conclusion calibration. | |
| 12. Return an observation that exposes the visible summary but not the hidden truth. | |
| ## Action costs | |
| Each action deducts from the episode's budget and time. Computational steps also accrue compute hours. | |
| | Action | Budget | Time (days) | | |
| |---|---|---| | |
| | `sequence_cells` | $15,000 | 5 | | |
| | `prepare_library` | $8,000 | 3 | | |
| | `collect_sample` | $5,000 | 7 | | |
| | `validate_marker` | $5,000 | 14 | | |
| | `culture_cells` | $3,000 | 14 | | |
| | `perturb_gene` | $2,000 | 3 | | |
| | `perturb_compound` | $1,000 | 2 | | |
| | `select_cohort` | $500 | 1 | | |
| | `run_qc` | $100 | 0.5 | | |
| | `integrate_batches` | $300 | 1 | | |
| | `regulatory_network_inference` | $200 | 1 | | |
| | `cluster_cells` | $150 | 0.5 | | |
| | `differential_expression`, `trajectory_analysis`, `pathway_enrichment` | $100β200 | 0.5β1 | | |
| | `filter_data`, `normalize_data`, `marker_selection` | $50β100 | 0.25β0.5 | | |
| | `synthesize_conclusion`, `design_followup_experiment`, `request_subagent_review` | $0 | 0.25β0.5 | | |
| ## Typical successful pipeline | |
| Most scenarios reward a sensible experiment order similar to: | |
| 1. `collect_sample` | |
| 2. `prepare_library` | |
| 3. `sequence_cells` | |
| 4. `run_qc` | |
| 5. `filter_data` | |
| 6. `normalize_data` | |
| 7. `cluster_cells` | |
| 8. one or more of: | |
| `differential_expression`, `trajectory_analysis`, `pathway_enrichment`, | |
| `regulatory_network_inference`, `marker_selection`, `validate_marker` | |
| 9. `synthesize_conclusion` | |
| The exact best sequence depends on the scenario: | |
| - trajectory scenarios benefit from `trajectory_analysis` and regulatory inference | |
| - biomarker scenarios benefit from DE, marker selection, and validation | |
| - perturbation scenarios benefit from pathway-level interpretation | |
| ## Episode termination | |
| An episode ends when one of the following happens: | |
| - the agent chooses `synthesize_conclusion` | |
| - resources are exhausted | |
| - the environment reaches `MAX_STEPS` which is currently `30` | |
| ## Installation | |
| Dependencies are managed with `uv`. The package requires Python β₯ 3.10. | |
| > **H100 Jupyter notebook setup:** See [H100_JUPYTER_SETUP.md](H100_JUPYTER_SETUP.md) for environment setup on NVIDIA H100 instances with Jupyter. | |
| ```bash | |
| # Core environment only | |
| uv sync | |
| # With dev/test tools | |
| uv sync --extra dev | |
| # With training dependencies (TRL, transformers, torch) | |
| uv sync --extra train | |
| # With bioinformatics extras (scanpy, biopython, gseapy) | |
| uv sync --extra bio | |
| ``` | |
| Key dependency groups from `pyproject.toml`: | |
| | Group | Key packages | | |
| |---|---| | |
| | core | `openenv-core[core]>=0.2.0`, `numpy`, `scipy`, `pydantic>=2.0` | | |
| | train | `trl>=0.29`, `transformers>=5.3`, `accelerate`, `datasets`, `torch`, `matplotlib` | | |
| | bio | `scanpy`, `biopython`, `gseapy` | | |
| | dev | `pytest`, `pytest-cov` | | |
| ## Interfaces you can use | |
| ### 1. In-process environment | |
| Use `BioExperimentEnvironment` when you want direct Python access with full structured observations: | |
| ```python | |
| from models import ActionType, ExperimentAction | |
| from server.hackathon_environment import BioExperimentEnvironment | |
| env = BioExperimentEnvironment(scenario_name="biomarker_validation_lung") | |
| obs = env.reset() | |
| obs = env.step(ExperimentAction( | |
| action_type=ActionType.COLLECT_SAMPLE, | |
| parameters={"n_samples": 8}, | |
| justification="Collect enough material for downstream single-cell analysis.", | |
| confidence=0.8, | |
| )) | |
| print(obs.task.problem_statement) | |
| print(obs.latest_output.summary if obs.latest_output else "No output yet") | |
| print(obs.reward) | |
| ``` | |
| The constructor accepts: | |
| - `scenario_name: Optional[str]` β pin to a specific scenario; `None` picks randomly each episode | |
| - `domain_randomise: bool = True` β perturbs scenario parameters for generalization | |
| ### 2. OpenEnv client/server mode | |
| Use the FastAPI app when you want to serve the environment over HTTP and WebSocket: | |
| ```bash | |
| uv sync --extra dev | |
| uv run uvicorn server.app:app --reload | |
| ``` | |
| The server exposes five endpoints: | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `POST` | `/reset` | Start a new episode | | |
| | `POST` | `/step` | Execute one action | | |
| | `GET` | `/state` | Current environment state | | |
| | `GET` | `/schema` | Action/observation JSON schemas | | |
| | `WS` | `/ws` | WebSocket for persistent sessions | | |
| Then connect with the client: | |
| ```python | |
| from client import BioExperimentEnv | |
| from models import ActionType, ExperimentAction | |
| with BioExperimentEnv(base_url="http://localhost:8000") as env: | |
| result = env.reset() | |
| result = env.step(ExperimentAction(action_type=ActionType.COLLECT_SAMPLE)) | |
| print(result.observation.latest_output.summary) | |
| ``` | |
| The environment class supports concurrent sessions, but the bundled server is currently configured with `max_concurrent_envs=1` in `server/app.py`. | |
| ### 3. Running a local agent | |
| `run_agent.py` runs a single interactive episode using a local Hugging Face model: | |
| ```bash | |
| uv run python run_agent.py | |
| ``` | |
| For H100 and other large-GPU workflows, prefer the quantized Unsloth path: | |
| ```bash | |
| uv sync --extra train | |
| uv run python run_agent_unsloth.py | |
| ``` | |
| Configuration is via environment variables: | |
| | Variable | Default | Description | | |
| |---|---|---| | |
| | `RUN_AGENT_USE_PIPELINE` | `0` | Use HF `pipeline()` path instead of direct generate | | |
| | `RUN_AGENT_MAX_EPISODE_STEPS` | `12` | Maximum number of planning steps | | |
| The local model defaults to `Qwen/Qwen3.5-0.8B` with sampling parameters `temperature=0.7`, `top_p=0.8`, `top_k=20`, `repetition_penalty=1.3`. The episode runs up to `MAX_EPISODE_STEPS = 12` steps. When action parsing fails, the script falls back to an observation-aware action that respects prerequisites. | |
| PowerShell note: older PowerShell versions do not support `&&`. Run commands from the target directory directly, or use `;` as the command separator. | |
| Windows runtime warnings: | |
| - If you see HuggingFace symlink-cache warnings, functionality is unaffected; optionally set `HF_HUB_DISABLE_SYMLINKS_WARNING=1`. | |
| - If you see flash attention / causal-conv fallback warnings, execution continues with a slower PyTorch path. | |
| ### 4. GRPO training | |
| `training_script.py` follows the TRL GRPO pattern and uses OpenEnv rewards to score generated action JSON against this environment. | |
| ```bash | |
| uv sync --extra train | |
| uv run python training_script.py --dry-run | |
| uv run python training_script.py --model-id Qwen/Qwen3.5-0.8B | |
| ``` | |
| For H100, the preferred entrypoint is `training_unsloth.py`, which uses Unsloth 4-bit loading plus LoRA for faster quantized GRPO training: | |
| ```bash | |
| uv sync --extra train | |
| uv run python training_unsloth.py --dry-run | |
| uv run python training_unsloth.py --model-id Qwen/Qwen3.5-4B | |
| ``` | |
| **Laptop / mid-range GPU (e.g. 12GB VRAM):** Use reduced batch size and sequence length to avoid OOM: | |
| ```bash | |
| uv sync --extra train | |
| uv pip install unsloth unsloth_zoo --no-deps # if using training_unsloth.py | |
| uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dataset-episodes 12 --rollout-steps 6 --per-device-train-batch-size 1 --num-generations 2 --gradient-accumulation-steps 4 --max-seq-length 1024 --trust-remote-code | |
| ``` | |
| If you still hit OOM, try `--max-seq-length 768` or `--num-generations 1`. | |
| **PyTorch CUDA:** Use the PyTorch index that matches your GPU. For older cards (RTX 20/30/40 series): `uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121`. For **RTX 50 series (Blackwell, sm_120)** you need a CUDA 12.8 build: | |
| ```bash | |
| uv pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128 | |
| uv pip install triton-windows # required by Unsloth on Windows | |
| ``` | |
| Key arguments: | |
| | Argument | Default | Description | | |
| |---|---|---| | |
| | `--model-id` | `Qwen/Qwen2.5-7B-Instruct` | Base model to fine-tune | | |
| | `--output-dir` | `training/grpo-output` | Save directory | | |
| | `--dataset-episodes` | `8` | Rollout episodes for prompt dataset | | |
| | `--rollout-steps` | `6` | Steps per episode during collection | | |
| | `--collection-policy` | `heuristic` | `random` or `heuristic` | | |
| | `--reward-backend` | `local` | `local` (in-process) or `remote` (live server) | | |
| | `--base-url` | `http://localhost:8000` | Server URL for remote backend | | |
| | `--scenario-name` | all | Repeatable; restricts which scenarios are used | | |
| | `--domain-randomise` | off | Enable domain randomisation | | |
| | `--num-generations` | `4` | GRPO generations per prompt | | |
| | `--max-completion-length` | `160` | Max tokens for model completions | | |
| | `--max-prompt-length` | `768` | Max tokens for prompts | | |
| | `--learning-rate` | `5e-6` | AdamW learning rate | | |
| | `--dry-run` | off | Build data and test reward without training | | |
| By default the reward function reconstructs prompt states locally so the prompt and reward stay aligned. Switch to a live server-backed reward loop with `--reward-backend remote --base-url http://localhost:8000`. | |
| `training_unsloth.py` adds H100-oriented options such as `--max-seq-length`, `--disable-4bit`, and LoRA settings (`--lora-r`, `--lora-alpha`, `--lora-dropout`). vLLM fast inference is disabled to avoid dependency conflicts. | |
| After training, the script saves plots to the output directory: | |
| - `training_loss.png` | |
| - `training_reward.png` | |
| - `training_metric.png` | |
| - `training_dashboard.png` | |
| - `training_plot_manifest.json` | |
| Use `--plot-metric-key <logged_key>` to force a specific extra metric on the third chart; otherwise the script auto-selects a useful logged metric such as KL or gradient norm. | |
| ### 5. Rollout collection | |
| `training/rollout_collection.py` collects direct environment rollouts into trajectory files: | |
| ```bash | |
| uv run python -m training.rollout_collection | |
| ``` | |
| This runs N episodes with a `random` or `heuristic` policy, saves JSON trajectories, and prints evaluation metrics. | |
| ### 6. Benchmark and scripted agents | |
| - `training/literature_benchmark.py` runs paper-aligned action sequences and compares outcomes against curated expected findings | |
| - `training/rollout_collection.py` collects direct environment rollouts into trajectory files | |
| - `training_script.py` trains a GRPO policy with OpenEnv reward calls | |
| - `training_unsloth.py` trains a quantized GRPO policy with Unsloth on H100-class GPUs | |
| - `run_agent.py` runs a local language model planner against the environment | |
| - `run_agent_unsloth.py` runs the planner with Unsloth 4-bit loading for faster inference | |
| - `training/trajectory.py` stores trajectories for offline RL, imitation learning, replay, and evaluation | |
| - `training/evaluation.py` computes online, benchmark, expert-review, and fidelity-oriented metrics | |
| ## Training utilities | |
| ### `training/trajectory.py` | |
| Provides `TrajectoryStep`, `Trajectory`, and `TrajectoryDataset` for episode serialization. | |
| - `TrajectoryStep` stores `action`, `observation`, `reward`, `done`, `reward_breakdown`, and an optional `latent_snapshot` | |
| - `Trajectory` accumulates steps with `add_step()`, computes `total_reward`, and exposes `save(path)` / `load(path)` | |
| - `TrajectoryDataset` wraps a list of trajectories with `filter_successful()`, `save_dir()`, `load_dir()`, and `summary()` (n, success_rate, mean_reward, mean_length, max/min reward) | |
| ### `training/evaluation.py` | |
| `EvaluationSuite` is a stateless class with four families of `@staticmethod` methods: | |
| | Family | Method | Metrics | | |
| |---|---|---| | |
| | Online RL | `online_metrics(trajectories)` | `mean_return`, `median_return`, `std_return`, `mean_episode_length`, `success_rate` | | |
| | Offline benchmark | `benchmark_metrics(dataset)` | `pipeline_validity_rate`, `ordering_score`, `action_diversity`, `mean_conclusion_confidence` | | |
| | Expert review | `expert_review_metrics(...)` | Placeholder; averages provided scores | | |
| | Simulator fidelity | `simulator_fidelity_metrics(sim, real)` | `reward_distribution_gap` | | |
| ### `training/literature_benchmark.py` | |
| `run_paper_benchmark(problem_statement, scenario_name, domain_randomise)` runs a paper-aligned action pipeline and scores against `expected_findings` using keyword matching. Returns a `PaperBenchmarkResult` with `match_ratio`. | |
| ## Docker deployment | |
| The server ships with a `server/Dockerfile`. It uses a multi-stage build based on `openenv-base`, installs dependencies via `uv`, and starts `uvicorn server.app:app` on port 8000. | |
| ```bash | |
| docker build -f server/Dockerfile -t bio-experiment-env . | |
| docker run -p 8000:8000 bio-experiment-env | |
| ``` | |
| The `openenv.yaml` file configures the deployment for the OpenEnv platform: | |
| ```yaml | |
| spec_version: 1 | |
| name: hackathon | |
| type: space | |
| runtime: fastapi | |
| app: server.app:app | |
| port: 8000 | |
| ``` | |
| ## Why this is useful | |
| This environment is trying to model a realistic scientific planning loop rather than a toy decision problem: | |
| - actions have prerequisites | |
| - outputs are noisy and imperfect | |
| - budget and time matter | |
| - not every correct-looking answer is well supported | |
| - final conclusions are scored against hidden ground truth | |
| That makes it suitable for: | |
| - agent planning benchmarks | |
| - RL experiments on long-horizon scientific reasoning | |
| - literature-grounded evaluation | |
| - comparing structured policies against LLM-driven planners | |
| ## Project map | |
| ```text | |
| . | |
| βββ client.py # OpenEnv HTTP/WebSocket client | |
| βββ models.py # Shared action / observation / task schemas | |
| βββ openenv.yaml # OpenEnv platform deployment config | |
| βββ pyproject.toml # Package metadata and dependency groups | |
| βββ run_agent.py # Single-episode interactive agent runner | |
| βββ run_agent_unsloth.py # Quantized Unsloth interactive agent runner | |
| βββ server/ | |
| β βββ app.py # FastAPI/OpenEnv server entry point | |
| β βββ Dockerfile # Multi-stage Docker build | |
| β βββ hackathon_environment.py # Main environment orchestration | |
| β βββ requirements.txt # Minimal server dependencies | |
| β βββ rewards/ | |
| β β βββ reward.py # Decomposed reward model | |
| β βββ rules/ | |
| β β βββ engine.py # Biological constraint checking | |
| β βββ simulator/ | |
| β β βββ latent_state.py # Hidden biological, technical, progress, resource state | |
| β β βββ noise.py # Seeded stochastic noise model | |
| β β βββ output_generator.py # Per-action simulated output generation | |
| β β βββ transition.py # State transition engine and ACTION_COSTS table | |
| β βββ subagents/ # Placeholder for future sub-agent integration | |
| β βββ tasks/ | |
| β βββ generator.py # TaskGenerator with domain randomisation | |
| β βββ scenarios.py # SCENARIO_LIBRARY with 4 curated scenarios | |
| βββ training/ | |
| β βββ evaluation.py # EvaluationSuite metrics | |
| β βββ literature_benchmark.py # Paper-backed benchmark flow | |
| β βββ rollout_collection.py # Direct rollout collection helper | |
| β βββ trajectory.py # Trajectory serialization and dataset utilities | |
| βββ training_script.py # TRL GRPO training entry point | |
| βββ training_unsloth.py # Unsloth quantized GRPO training entry point | |
| βββ tests/ | |
| βββ test_environment.py | |
| βββ test_literature_benchmark.py | |
| βββ test_models.py | |
| βββ test_rewards.py | |
| βββ test_rules.py | |
| βββ test_simulator.py | |
| βββ test_training_script.py | |
| ``` | |
| ## Quick sanity check | |
| ```bash | |
| uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q | |
| ``` | |
| Those tests verify: | |
| - reset and step lifecycle | |
| - valid vs invalid pipeline behavior | |
| - conclusion termination | |
| - literature-backed scenario selection | |
| - benchmark matching for curated expected findings | |
| Run the full suite with coverage: | |
| ```bash | |
| uv run pytest tests/ --cov -q | |
| ``` | |