File size: 27,200 Bytes
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
 
 
 
 
 
 
 
 
4db0438
 
 
 
 
 
 
5c3cfae
 
4db0438
5c3cfae
4db0438
5c3cfae
 
 
 
 
 
 
 
4db0438
 
 
 
 
5c3cfae
 
 
 
4db0438
5c3cfae
4db0438
 
 
 
 
 
 
 
5c3cfae
4db0438
5c3cfae
 
 
 
4db0438
 
 
 
 
5c3cfae
4db0438
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
4db0438
 
 
5c3cfae
 
 
 
 
 
4db0438
 
 
 
 
 
 
 
 
 
5c3cfae
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
4db0438
 
 
 
 
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
4db0438
 
 
 
 
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
db03c40
 
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
 
 
 
4db0438
 
 
 
 
 
 
 
 
5c3cfae
 
 
 
 
 
 
 
 
 
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
4db0438
5c3cfae
4db0438
5c3cfae
 
4db0438
 
db03c40
 
 
 
 
 
 
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
 
4db0438
5c3cfae
 
 
 
 
 
 
 
 
4db0438
db03c40
 
 
 
 
 
 
 
ad39f2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
 
db03c40
5c3cfae
 
 
 
 
 
ad39f2a
db03c40
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4db0438
 
5c3cfae
 
db03c40
4db0438
db03c40
4db0438
 
 
5c3cfae
4db0438
5c3cfae
4db0438
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
4db0438
 
 
5c3cfae
4db0438
5c3cfae
 
 
db03c40
4db0438
5c3cfae
 
4db0438
5c3cfae
 
 
 
 
 
 
 
 
 
 
 
 
 
4db0438
5c3cfae
4db0438
5c3cfae
 
 
db03c40
5c3cfae
 
 
 
 
 
 
 
4db0438
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3cfae
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
---

title: Bio Experiment Environment Server
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - reinforcement-learning
  - bioinformatics
---


# Bio Experiment Environment

This repository implements an OpenEnv-compatible reinforcement learning environment for planning biological experiment pipelines. The agent does not directly see the true biological state. Instead, it proposes one structured experiment or analysis step at a time, receives a noisy simulated output, and is rewarded for valid, informative, efficient, well-calibrated plans.

The environment is designed as a partially observable Markov decision process (POMDP) with:

- hidden ground-truth biology
- hidden technical noise and failure conditions
- visible task metadata, resource usage, step history, and intermediate outputs
- dense step-wise reward plus terminal reward for conclusion quality

## How it works

At a high level, each episode looks like this:

1. `reset()` picks a biological scenario and seeds the simulator.
2. The agent receives an `ExperimentObservation` describing the task and current visible state.
3. The agent submits an `ExperimentAction` such as `collect_sample`, `run_qc`, or `differential_expression`.
4. The rule engine checks whether the action is valid at this point in the pipeline.
5. The transition engine updates hidden state, spends resources, and asks the output generator to simulate the result.
6. The reward computer scores the step for validity, ordering, information gain, efficiency, novelty, and penalties.
7. The environment returns a new observation with updated history, outputs, discoveries, violations, and reward.
8. The episode ends when the agent synthesizes a conclusion, exhausts resources, or reaches the step limit.

## The core mental model

### Hidden state

The simulator keeps a `FullLatentState` that the agent never directly sees. It contains:

- true cell populations and marker genes
- true DE genes, pathways, trajectories, and regulatory networks
- technical factors such as dropout, doublets, ambient RNA, and batch effects
- experiment progress flags
- remaining budget and time
- hidden failure conditions

### Visible state

The agent only sees `ExperimentObservation`, which includes:

- the current `TaskSpec`
- pipeline history
- available assays and tools
- resource usage
- the latest and cumulative intermediate outputs
- discovered markers and candidate mechanisms
- rule violations
- per-step reward breakdown

This separation is what makes the environment a POMDP rather than a fully observed simulator.

## Main building blocks

### `models.py`

Defines the contracts that all other modules use:

- `ActionType`: 21 discrete experiment steps, grouped into three frozensets β€” `WET_LAB_ACTIONS` (8), `COMPUTATIONAL_ACTIONS` (10), and `META_ACTIONS` (3)
- `SubagentType`: 9 sub-agent delegate roles (e.g. `wet_lab_planner`, `computational_analyst`, `causal_reasoning_agent`)
- `ExperimentAction`: one structured step chosen by the agent; fields include `action_type`, `method`, `parameters`, `justification`, `confidence` (clamped to `[0, 1]`), `invoked_subagent`, `tool_call_spec`, `input_targets`
- `ExperimentObservation`: what the agent can see after each step; includes `task`, `pipeline_history`, `resource_usage`, `latest_output`, `all_outputs`, `discovered_markers`, `candidate_mechanisms`, `conclusions`, `rule_violations`, `step_reward_breakdown`
- `TaskSpec`: the problem statement, organism, tissue, conditions, budget, time limit, assays, tools, paper references, and expected findings
- `IntermediateOutput`: the simulated artifact returned by a step; carries `output_type`, `success`, `quality_score`, `summary`, `data`, `uncertainty`, `warnings`, `artifacts_available`
- `ConclusionClaim`: structured claims used for final synthesis; carries `claim`, `evidence_steps`, `confidence`, `claim_type`, `supporting_data`
- `PipelineStepRecord`: compact observable record of one past step stored in history
- `ResourceUsage`: budget and time tracking visible to the agent

The action vocabulary is intentionally broad enough to mix wet-lab, computational, and meta-planning actions.

### `server/tasks/`

This is where episodes come from.

- `scenarios.py` defines a curated library of four biological scenarios as `Scenario` dataclass objects, each bundling a `TaskSpec`, a `LatentBiologicalState`, a `TechnicalState`, hidden failure conditions, and tags
- `generator.py` turns a scenario into a `(TaskSpec, FullLatentState)` pair via `TaskGenerator.generate()`; optional domain randomisation perturbs budget (Β±30%), time (Β±20%), technical noise, batch effects, cell proportions, and effect sizes

The four scenarios are:

| Name | Difficulty | Tissue | Problem | Budget | Time |
|---|---|---|---|---|---|
| `cardiac_disease_de` | easy | heart | Differential expression between healthy and dilated cardiomyopathy cardiomyocytes | $80 K | 120 days |
| `hematopoiesis_trajectory` | medium | bone marrow | Infer HSC β†’ mature lineage trajectory with three branches | $100 K | 150 days |
| `perturbation_immune` | hard | synovial fluid | JAK inhibitor effect on T-cell states in rheumatoid arthritis | $120 K | 180 days |
| `biomarker_validation_lung` | medium | lung | Validate SPP1 as biomarker for pro-fibrotic macrophages in IPF | $90 K | 150 days |

Each scenario carries paper references with DOIs, true DE genes with log2FC values, true pathway activities, true regulatory networks, and ground-truth causal mechanisms used for terminal reward calibration.

### `server/simulator/`

This is the simulator itself.

- `latent_state.py` defines `FullLatentState`, the root aggregate of all hidden state. Key sub-structures are `LatentBiologicalState` (true DE genes, pathways, gene programs, trajectory, regulatory network, markers, causal mechanisms), `TechnicalState` (dropout, doublets, ambient RNA, sample quality), `ExperimentProgress` (18 boolean milestone flags plus counts), and `ResourceState` (internal budget and time tracking with exhaustion properties)
- `noise.py` centralises stochasticity in `NoiseModel`. All randomness flows through a single seeded `numpy.Generator`. Methods include `add_expression_noise`, `sample_effect_sizes`, `sample_p_values`, `generate_false_positives`, `generate_false_negatives`, `quality_degradation`, `sample_qc_metric`, `sample_cluster_count`, `shuffle_ranking`, and `coin_flip`
- `output_generator.py` turns an action plus hidden state into a realistic `IntermediateOutput`. Every action type has a dedicated handler conditioned on the latent state; noise is then injected β€” dropout in expression data, false positives and false negatives in DE and marker results, over/under-clustering, and pathway contamination
- `transition.py` applies action costs from `ACTION_COSTS`, updates progress flags, calls the output generator, degrades quality on soft violations, propagates discovered DE genes and cluster names back into latent state, and decides whether the episode is done

The output generator does not simply echo the action. It conditions outputs on the hidden state, then injects realistic noise.

### `server/rules/engine.py`

The rule engine enforces scientific and procedural constraints before each action is applied.

- hard violations block the action entirely
- soft violations allow the action, but reduce output quality and add reward penalties

The four rule families are:

1. **Prerequisites (HARD)** β€” each computational step requires the appropriate upstream milestone flag. For example: `normalize_data` requires `data_filtered`, `differential_expression` requires `data_normalized`, `validate_marker` requires `markers_discovered`
2. **Resource constraints (HARD/SOFT)** β€” budget or time exhausted is a hard block; action cost exceeding remaining budget (when budget > 0) is a soft warning
3. **Redundancy (SOFT)** β€” repeating an already-completed step such as `run_qc` or `normalize_data`
4. **Causal validity (SOFT)** β€” synthesizing conclusions without prior DE or clustering; making causal claims without validation evidence; pathway enrichment before DE

### `server/rewards/reward.py`

Rewards are decomposed rather than being a single opaque number.

Per-step reward formula:

```

R_t = r_validity + r_ordering + r_info_gain + r_efficiency + r_novelty + r_penalty + Ξ³[Ο†(s_{t+1}) βˆ’ Ο†(s_t)]

```

| Component | Weight | Description |
|---|---|---|
| `validity` | 0.3 | `1.0` if output succeeded, `βˆ’1.0` if hard violation |
| `ordering` | 0.2 | `1.0` if natural next step, `0.3` otherwise |
| `info_gain` | 0.4 | `quality_score Γ— (1 βˆ’ uncertainty)` |
| `efficiency` | 0.3 | `max(0, 1 βˆ’ 5 Γ— budget_fraction_used)` |
| `novelty` | +0.1 | Bonus when no soft violations |
| `penalty` | βˆ’0.15/violation | Per soft violation |
| `shaping` | Ξ³ = 0.99 | Potential-based over 12 progress milestones |

Terminal reward adds:

| Component | Weight | Description |
|---|---|---|
| Pipeline completeness | 3.0 | Fraction of 7 core milestones completed |
| Calibration | 4.0 | How well conclusions match hidden markers and mechanisms |
| Budget + time efficiency | 1.0 | Average fraction of budget and time remaining |
| Overconfidence penalty | βˆ’0.5/claim | For high-confidence claims (`> 0.8`) that are wrong |

This makes the environment easier to debug, benchmark, and train against.

### `server/hackathon_environment.py`



This is the orchestration layer that ties everything together.



On `reset()` it:



- seeds the noise model

- generates a task and latent state via `TaskGenerator`

- clears history, outputs, discoveries, conclusions, and cumulative reward



On `step()` it:



- checks rules

- calls the transition engine

- computes reward

- appends a `PipelineStepRecord`

- updates discovered markers and candidate mechanisms

- stores conclusion claims if the action is `synthesize_conclusion`
- builds the next `ExperimentObservation`

This file is the best place to read if you want the end-to-end control flow.

## What actually happens on one step

Here is the concrete order of operations for `env.step(action)`:

1. Increment the step counter.
2. Copy the previous latent state for reward comparison.
3. Run rule checks and split violations into hard vs soft.
4. If there is a hard violation, return a failure report without applying the action.
5. Otherwise deduct budget and time based on `ACTION_COSTS`.
6. Update latent progress flags like `samples_collected`, `qc_performed`, or `de_performed`.
7. Generate a structured simulated output for the chosen action.
8. If there were soft violations, degrade output quality (Γ—0.5) and attach warnings.
9. Propagate artifacts back into latent state, such as discovered DE genes or cluster names.
10. Compute decomposed reward from state transition plus output quality.
11. If the episode is ending, compute terminal reward from completeness and conclusion calibration.
12. Return an observation that exposes the visible summary but not the hidden truth.

## Action costs

Each action deducts from the episode's budget and time. Computational steps also accrue compute hours.

| Action | Budget | Time (days) |
|---|---|---|
| `sequence_cells` | $15,000 | 5 |
| `prepare_library` | $8,000 | 3 |
| `collect_sample` | $5,000 | 7 |
| `validate_marker` | $5,000 | 14 |
| `culture_cells` | $3,000 | 14 |
| `perturb_gene` | $2,000 | 3 |
| `perturb_compound` | $1,000 | 2 |
| `select_cohort` | $500 | 1 |
| `run_qc` | $100 | 0.5 |
| `integrate_batches` | $300 | 1 |
| `regulatory_network_inference` | $200 | 1 |
| `cluster_cells` | $150 | 0.5 |
| `differential_expression`, `trajectory_analysis`, `pathway_enrichment` | $100–200 | 0.5–1 |
| `filter_data`, `normalize_data`, `marker_selection` | $50–100 | 0.25–0.5 |
| `synthesize_conclusion`, `design_followup_experiment`, `request_subagent_review` | $0 | 0.25–0.5 |

## Typical successful pipeline

Most scenarios reward a sensible experiment order similar to:

1. `collect_sample`
2. `prepare_library`
3. `sequence_cells`
4. `run_qc`
5. `filter_data`
6. `normalize_data`
7. `cluster_cells`
8. one or more of:
   `differential_expression`, `trajectory_analysis`, `pathway_enrichment`,
   `regulatory_network_inference`, `marker_selection`, `validate_marker`
9. `synthesize_conclusion`

The exact best sequence depends on the scenario:

- trajectory scenarios benefit from `trajectory_analysis` and regulatory inference
- biomarker scenarios benefit from DE, marker selection, and validation
- perturbation scenarios benefit from pathway-level interpretation

## Episode termination

An episode ends when one of the following happens:

- the agent chooses `synthesize_conclusion`
- resources are exhausted
- the environment reaches `MAX_STEPS` which is currently `30`

## Installation

Dependencies are managed with `uv`. The package requires Python β‰₯ 3.10.

> **H100 Jupyter notebook setup:** See [H100_JUPYTER_SETUP.md](H100_JUPYTER_SETUP.md) for environment setup on NVIDIA H100 instances with Jupyter.

```bash

# Core environment only

uv sync



# With dev/test tools

uv sync --extra dev



# With training dependencies (TRL, transformers, torch)

uv sync --extra train



# With bioinformatics extras (scanpy, biopython, gseapy)

uv sync --extra bio

```

Key dependency groups from `pyproject.toml`:

| Group | Key packages |
|---|---|
| core | `openenv-core[core]>=0.2.0`, `numpy`, `scipy`, `pydantic>=2.0` |
| train | `trl>=0.29`, `transformers>=5.3`, `accelerate`, `datasets`, `torch`, `matplotlib` |
| bio | `scanpy`, `biopython`, `gseapy` |
| dev | `pytest`, `pytest-cov` |

## Interfaces you can use

### 1. In-process environment

Use `BioExperimentEnvironment` when you want direct Python access with full structured observations:

```python

from models import ActionType, ExperimentAction

from server.hackathon_environment import BioExperimentEnvironment



env = BioExperimentEnvironment(scenario_name="biomarker_validation_lung")

obs = env.reset()



obs = env.step(ExperimentAction(

    action_type=ActionType.COLLECT_SAMPLE,

    parameters={"n_samples": 8},

    justification="Collect enough material for downstream single-cell analysis.",

    confidence=0.8,

))



print(obs.task.problem_statement)

print(obs.latest_output.summary if obs.latest_output else "No output yet")

print(obs.reward)

```

The constructor accepts:
- `scenario_name: Optional[str]` β€” pin to a specific scenario; `None` picks randomly each episode
- `domain_randomise: bool = True` β€” perturbs scenario parameters for generalization

### 2. OpenEnv client/server mode

Use the FastAPI app when you want to serve the environment over HTTP and WebSocket:

```bash

uv sync --extra dev

uv run uvicorn server.app:app --reload

```

The server exposes five endpoints:

| Method | Path | Description |
|---|---|---|
| `POST` | `/reset` | Start a new episode |
| `POST` | `/step` | Execute one action |
| `GET` | `/state` | Current environment state |
| `GET` | `/schema` | Action/observation JSON schemas |
| `WS` | `/ws` | WebSocket for persistent sessions |

Then connect with the client:

```python

from client import BioExperimentEnv

from models import ActionType, ExperimentAction



with BioExperimentEnv(base_url="http://localhost:8000") as env:

    result = env.reset()

    result = env.step(ExperimentAction(action_type=ActionType.COLLECT_SAMPLE))

    print(result.observation.latest_output.summary)

```

The environment class supports concurrent sessions, but the bundled server is currently configured with `max_concurrent_envs=1` in `server/app.py`.

### 3. Running a local agent

`run_agent.py` runs a single interactive episode using a local Hugging Face model:

```bash

uv run python run_agent.py

```

For H100 and other large-GPU workflows, prefer the quantized Unsloth path:

```bash

uv sync --extra train

uv run python run_agent_unsloth.py

```

Configuration is via environment variables:

| Variable | Default | Description |
|---|---|---|
| `RUN_AGENT_USE_PIPELINE` | `0` | Use HF `pipeline()` path instead of direct generate |
| `RUN_AGENT_MAX_EPISODE_STEPS` | `12` | Maximum number of planning steps |

The local model defaults to `Qwen/Qwen3.5-0.8B` with sampling parameters `temperature=0.7`, `top_p=0.8`, `top_k=20`, `repetition_penalty=1.3`. The episode runs up to `MAX_EPISODE_STEPS = 12` steps. When action parsing fails, the script falls back to an observation-aware action that respects prerequisites.

PowerShell note: older PowerShell versions do not support `&&`. Run commands from the target directory directly, or use `;` as the command separator.

Windows runtime warnings:
- If you see HuggingFace symlink-cache warnings, functionality is unaffected; optionally set `HF_HUB_DISABLE_SYMLINKS_WARNING=1`.
- If you see flash attention / causal-conv fallback warnings, execution continues with a slower PyTorch path.

### 4. GRPO training

`training_script.py` follows the TRL GRPO pattern and uses OpenEnv rewards to score generated action JSON against this environment.

```bash

uv sync --extra train

uv run python training_script.py --dry-run

uv run python training_script.py --model-id Qwen/Qwen3.5-0.8B

```

For H100, the preferred entrypoint is `training_unsloth.py`, which uses Unsloth 4-bit loading plus LoRA for faster quantized GRPO training:

```bash

uv sync --extra train

uv run python training_unsloth.py --dry-run

uv run python training_unsloth.py --model-id Qwen/Qwen3.5-4B

```

**Laptop / mid-range GPU (e.g. 12GB VRAM):** Use reduced batch size and sequence length to avoid OOM:

```bash

uv sync --extra train

uv pip install unsloth unsloth_zoo --no-deps   # if using training_unsloth.py

uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dataset-episodes 12 --rollout-steps 6 --per-device-train-batch-size 1 --num-generations 2 --gradient-accumulation-steps 4 --max-seq-length 1024 --trust-remote-code

```

If you still hit OOM, try `--max-seq-length 768` or `--num-generations 1`.

**PyTorch CUDA:** Use the PyTorch index that matches your GPU. For older cards (RTX 20/30/40 series): `uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121`. For **RTX 50 series (Blackwell, sm_120)** you need a CUDA 12.8 build:



```bash

uv pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128

uv pip install triton-windows   # required by Unsloth on Windows

```



Key arguments:



| Argument | Default | Description |

|---|---|---|

| `--model-id` | `Qwen/Qwen2.5-7B-Instruct` | Base model to fine-tune |

| `--output-dir` | `training/grpo-output` | Save directory |

| `--dataset-episodes` | `8` | Rollout episodes for prompt dataset |

| `--rollout-steps` | `6` | Steps per episode during collection |

| `--collection-policy` | `heuristic` | `random` or `heuristic` |

| `--reward-backend` | `local` | `local` (in-process) or `remote` (live server) |

| `--base-url` | `http://localhost:8000` | Server URL for remote backend |

| `--scenario-name` | all | Repeatable; restricts which scenarios are used |

| `--domain-randomise` | off | Enable domain randomisation |

| `--num-generations` | `4` | GRPO generations per prompt |

| `--max-completion-length` | `160` | Max tokens for model completions |

| `--max-prompt-length` | `768` | Max tokens for prompts |

| `--learning-rate` | `5e-6` | AdamW learning rate |

| `--dry-run` | off | Build data and test reward without training |



By default the reward function reconstructs prompt states locally so the prompt and reward stay aligned. Switch to a live server-backed reward loop with `--reward-backend remote --base-url http://localhost:8000`.



`training_unsloth.py` adds H100-oriented options such as `--max-seq-length`, `--disable-4bit`, and LoRA settings (`--lora-r`, `--lora-alpha`, `--lora-dropout`). vLLM fast inference is disabled to avoid dependency conflicts.



After training, the script saves plots to the output directory:



- `training_loss.png`

- `training_reward.png`

- `training_metric.png`

- `training_dashboard.png`

- `training_plot_manifest.json`



Use `--plot-metric-key <logged_key>` to force a specific extra metric on the third chart; otherwise the script auto-selects a useful logged metric such as KL or gradient norm.



### 5. Rollout collection



`training/rollout_collection.py` collects direct environment rollouts into trajectory files:



```bash

uv run python -m training.rollout_collection

```



This runs N episodes with a `random` or `heuristic` policy, saves JSON trajectories, and prints evaluation metrics.



### 6. Benchmark and scripted agents



- `training/literature_benchmark.py` runs paper-aligned action sequences and compares outcomes against curated expected findings

- `training/rollout_collection.py` collects direct environment rollouts into trajectory files

- `training_script.py` trains a GRPO policy with OpenEnv reward calls

- `training_unsloth.py` trains a quantized GRPO policy with Unsloth on H100-class GPUs

- `run_agent.py` runs a local language model planner against the environment

- `run_agent_unsloth.py` runs the planner with Unsloth 4-bit loading for faster inference

- `training/trajectory.py` stores trajectories for offline RL, imitation learning, replay, and evaluation

- `training/evaluation.py` computes online, benchmark, expert-review, and fidelity-oriented metrics



## Training utilities



### `training/trajectory.py`



Provides `TrajectoryStep`, `Trajectory`, and `TrajectoryDataset` for episode serialization.



- `TrajectoryStep` stores `action`, `observation`, `reward`, `done`, `reward_breakdown`, and an optional `latent_snapshot`

- `Trajectory` accumulates steps with `add_step()`, computes `total_reward`, and exposes `save(path)` / `load(path)`

- `TrajectoryDataset` wraps a list of trajectories with `filter_successful()`, `save_dir()`, `load_dir()`, and `summary()` (n, success_rate, mean_reward, mean_length, max/min reward)



### `training/evaluation.py`



`EvaluationSuite` is a stateless class with four families of `@staticmethod` methods:



| Family | Method | Metrics |

|---|---|---|

| Online RL | `online_metrics(trajectories)` | `mean_return`, `median_return`, `std_return`, `mean_episode_length`, `success_rate` |

| Offline benchmark | `benchmark_metrics(dataset)` | `pipeline_validity_rate`, `ordering_score`, `action_diversity`, `mean_conclusion_confidence` |

| Expert review | `expert_review_metrics(...)` | Placeholder; averages provided scores |

| Simulator fidelity | `simulator_fidelity_metrics(sim, real)` | `reward_distribution_gap` |



### `training/literature_benchmark.py`



`run_paper_benchmark(problem_statement, scenario_name, domain_randomise)` runs a paper-aligned action pipeline and scores against `expected_findings` using keyword matching. Returns a `PaperBenchmarkResult` with `match_ratio`.



## Docker deployment



The server ships with a `server/Dockerfile`. It uses a multi-stage build based on `openenv-base`, installs dependencies via `uv`, and starts `uvicorn server.app:app` on port 8000.



```bash

docker build -f server/Dockerfile -t bio-experiment-env .

docker run -p 8000:8000 bio-experiment-env

```



The `openenv.yaml` file configures the deployment for the OpenEnv platform:



```yaml

spec_version: 1

name: hackathon

type: space

runtime: fastapi

app: server.app:app

port: 8000

```



## Why this is useful



This environment is trying to model a realistic scientific planning loop rather than a toy decision problem:



- actions have prerequisites

- outputs are noisy and imperfect

- budget and time matter

- not every correct-looking answer is well supported

- final conclusions are scored against hidden ground truth



That makes it suitable for:



- agent planning benchmarks

- RL experiments on long-horizon scientific reasoning

- literature-grounded evaluation

- comparing structured policies against LLM-driven planners



## Project map



```text

.

β”œβ”€β”€ client.py                     # OpenEnv HTTP/WebSocket client

β”œβ”€β”€ models.py                     # Shared action / observation / task schemas

β”œβ”€β”€ openenv.yaml                  # OpenEnv platform deployment config

β”œβ”€β”€ pyproject.toml                # Package metadata and dependency groups

β”œβ”€β”€ run_agent.py                  # Single-episode interactive agent runner

β”œβ”€β”€ run_agent_unsloth.py          # Quantized Unsloth interactive agent runner

β”œβ”€β”€ server/

β”‚   β”œβ”€β”€ app.py                    # FastAPI/OpenEnv server entry point

β”‚   β”œβ”€β”€ Dockerfile                # Multi-stage Docker build

β”‚   β”œβ”€β”€ hackathon_environment.py  # Main environment orchestration

β”‚   β”œβ”€β”€ requirements.txt          # Minimal server dependencies

β”‚   β”œβ”€β”€ rewards/

β”‚   β”‚   └── reward.py             # Decomposed reward model

β”‚   β”œβ”€β”€ rules/

β”‚   β”‚   └── engine.py             # Biological constraint checking

β”‚   β”œβ”€β”€ simulator/

β”‚   β”‚   β”œβ”€β”€ latent_state.py       # Hidden biological, technical, progress, resource state

β”‚   β”‚   β”œβ”€β”€ noise.py              # Seeded stochastic noise model

β”‚   β”‚   β”œβ”€β”€ output_generator.py   # Per-action simulated output generation

β”‚   β”‚   └── transition.py         # State transition engine and ACTION_COSTS table

β”‚   β”œβ”€β”€ subagents/                # Placeholder for future sub-agent integration

β”‚   └── tasks/

β”‚       β”œβ”€β”€ generator.py          # TaskGenerator with domain randomisation

β”‚       └── scenarios.py          # SCENARIO_LIBRARY with 4 curated scenarios

β”œβ”€β”€ training/

β”‚   β”œβ”€β”€ evaluation.py             # EvaluationSuite metrics

β”‚   β”œβ”€β”€ literature_benchmark.py   # Paper-backed benchmark flow

β”‚   β”œβ”€β”€ rollout_collection.py     # Direct rollout collection helper

β”‚   └── trajectory.py             # Trajectory serialization and dataset utilities

β”œβ”€β”€ training_script.py            # TRL GRPO training entry point

β”œβ”€β”€ training_unsloth.py           # Unsloth quantized GRPO training entry point

└── tests/

    β”œβ”€β”€ test_environment.py

    β”œβ”€β”€ test_literature_benchmark.py

    β”œβ”€β”€ test_models.py

    β”œβ”€β”€ test_rewards.py

    β”œβ”€β”€ test_rules.py

    β”œβ”€β”€ test_simulator.py

    └── test_training_script.py

```



## Quick sanity check



```bash

uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q

```



Those tests verify:



- reset and step lifecycle

- valid vs invalid pipeline behavior

- conclusion termination

- literature-backed scenario selection

- benchmark matching for curated expected findings



Run the full suite with coverage:



```bash

uv run pytest tests/ --cov -q

```